DEV Community: Authress Engineering Blog

The Risks of User Impersonation

Warren Parad — Fri, 24 Jan 2025 17:58:49 +0000

What is user impersonation?

User impersonation is anything that allows your systems to believe the current logged in user is someone else. With regards to JWTs and access tokens, this means that one user obtains a JWT that contains another user's User ID. User impersonation or logging in as a customer can be used as a tool to help identify many issues from user authentication and onboarding to corrupted data in complex multi-service business logic flows.

However, at first glance it should is obvious that there are major security implications with such an approach. Even if it isn't, this article will extensively review user impersonation and the security implications as well as offer alternative suggestions to achieve a similar outcome in a software system without compromising security.

The impersonation use cases

No solution is relevant in a vacuum, so let's consider the concrete issues that you might actually have, and the reason you've arrived at this Authress Academy article. If we were to jump straight into a solution, then we'll definitely end up sacrificing security or worse, our user's sensitive data in favor of suboptimal solutions.

Possible use case user stories:

One of your users reports that they are experiencing an issue with a screen in your application portal not showing the correct information. As a support engineer, you want to review the exact display in the application UI that your user sees, so that you can verify the UI is indeed broken and something is actually going wrong.
Similar to above, can you know whether or not the display having an issue is a result of a problem with the UI itself or with the data that application UI is fetching, hence a service API issue.
Sometimes it is a problem with a complex API server flow. A click in your application portal was expected to perform a data change, transformation, or API request to your backend services, but is may not have been sent with the appropriate data. As an product engineer, you would like to know that the correct request data is being sent in the request to my service API.
As an system admin, multiple third party systems are interacting with each other and something™ isn't working, and because you are a great collaborator, even though it isn't your responsibility, you want to help out your customers.

Now, this list isn't exhaustive, but already you can start see that while focusing on the concrete problems, user impersonation might be useful, but these don't actually require it to debug. The root causes often fall into at least one of these categories:

This is a UI component display issue.
An unexpected request is being sent or isn't sent to your service API from your application portal.
The wrong data is being sent in the request from your application UI to your API.
It is a READ permissions data issue for the user.
It is a WRITE permissions data issue for the user.
In is multi-system problem and not an access issue, and having a duplicated environment that exactly matches the current production was your goal to continue debugging.

Note: out of these solutions, none of them even get close to needing user impersonation, they each have straightforward alternatives that are both secure and frequently simpler to implement.

Supported libraries

Fundamentally, user impersonation is insecure by design, we'll see why in a moment. There are much better ways to provide insight into your specific scenario that actually take security into account. But let's assume that we do implement user personation. Is there help available for us by utilizing support from our favorite overengineered solution?

Ruby - Rails pretender
Python - Django hijack
Nodejs - Express/Passport impersonate
Insert your favorite monolithic HTTP Framework here ➤ Deprecated Solution

What's interesting is that in doing the research to actually find existing implementations, 86% of the repos and links I found:

No longer exist, and haven't existed for quite some time
Were archived over 5 years ago
Have less than 10 stars on GitHub

Even if people are trying to make this happen, the tools don't even exist to ensure that we are doing it correctly and safely. The results of this search, tell us something. Even more surprisingly is that most of the Auth SaaS solutions don't offer this either. As it turns out, either no one really cares that much or it is next to impossible to get it right such that no solution can exist. Well that can't be right.

Dangers of user impersonation

Let's assume for a moment that the collective wisdom is correct, and no solutions exist because it is dangerous. What exactly are those dangers? To help convey these issues, say that we managed to get one of these legacy packages above actually working with our system, the first problem that we'll run into is:

Who actually has access to perform this User Impersonation in the first place? Who are our admins?

1. Defining the admins

Of course allowing everyone to impersonate one another basically means our authentication provides no value. We might as well let users enter whatever username they like on every post they make. Realistically, we want to restrict this list to those that it actually makes sense to have the ultimate su privilege.

Figuring out who the admins should be and maintaining access to that closely guarded endpoint that grants user impersonation is a common problem that even eludes the most sophisticated companies. The most notorious example of getting this wrong were the Twitter 2020 admin tools hack and the Microsoft Storm-0558 breaches. Attackers were able to compromise admin-level account tools, and use them to steal and impersonate actual users. Historically, one of these companies had paid significant attention to their own internal security, were, if not the first, one of the first to introduce the notion of public social logins, and were no stranger to the issues at hand, and the other was Microsoft.

Challenge 1: Maintaining both the admin list, and correctly securing the endpoint to allow impersonation in the first place.

2. The implementation

The next issue regarding impersonation becomes transparent when we start to question how it can even work in practice. In theory, practice is the same as theory, in practice it is not.

Once admin is authorized to impersonate a user, what exactly is happening in our platform? Let's flash back to Authentication. In order to secure your system, to ensure the right users have access to the right data at the right time, your users must use a session cookie or session token sent on every request for which your API can verify that user is logged in. This could be a completely opaque GUID that represents some data in your database (a reference token) or a more secure JWT that is stateless. In any case, your system identifies users via your Authentication Strategy, and at the end of the day identification comes down to a single property in a single object somewhere. An example could be the JWT subject claim property:

User user_001 JWT access token:

{
        "iss": "https://login.authress.io",

        // highlight
        "sub": "user_001",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

In OAuth/OpenID, the sub claim in a JWT represent the User ID. Thus this particular token represents a verified user with the identity user_001. Anyone that holds this token is now has access to impersonate this user. Hopefully, you have some logging in place to identify when a user is being impersonated and who actually started the impersonation process. But how do we actually impersonate this user?

Well of course, I need to convert a token that represents my admin user into a token that represents the user I want to impersonate. This would be an example of the token that I have right now.

My admin user token:

{
        "iss": "https://login.authress.io",

        // highlight
        "sub": "me_admin",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

Since our system, in this scenario uses the sub property to determine which user is accessing the system, I of course need a token that replaces the current value of the sub which is me_admin for me, to one that contains the sub of user_001. So when I impersonate the user, the result must be a token that looks exactly like the user token:

User token generated by the admin:

{
        "iss": "https://login.authress.io",

        // highlight
        "sub": "user_001",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

Some of http/auth frameworks have thought a whole two seconds longer than the rest and might have decided to add an additional property to indicate that the token was created through the process of impersonation by an admin instead of directly by the user:

User token generated by the admin with magic:

{
        "iss": "https://login.authress.io",
        "sub": "user_001",

        // highlight
        "generated_by": "me_admin",
        // highlight

        "iat": 1685021390,
        "exp": 1685107790,
        "scope": "openid profile email"
}

And this might even seem like a good idea, however, in practice it creates a Pit of Failure. Enabling admin to create new tokens that contain the original user causes two distinct problems.

The first issue is that one admin user can impersonate another admin user. And that second admin user might be one that potentially has more access and is authorized for more sensitive information. This means that it isn't so straightforward to just add in impersonation and assume that everything will just work out. Our List of Admin, no longer can just be a list of admin, it now must also contain some hierarchal order of who can impersonate whom. If you've been following along this looks a lot like what Authress Authorization provides. Of course you don't absolutely have to have that, but if you don't then you've sacrificed some security.
The second issue is that not every application you have might be interested in allowing users to be impersonated. In any mature system, and even most early software ventures, have some data that you are even less interested in exposing than rest. Sensitive by nature or Regulated data fits this picture. This could be Personal Identifiable Information (PII), Credit Cards (PCI-DSS), or really anything that has been regulated in your locality as a result of governing bodies. You might breach this through user impersonation if for instance your support engineer is in different Data Residency than the user is in. For example, when attempting to debug issues in a UI, almost never is the Date Of Birth (DOB) of the user absolutely necessary to be shown on the screen. Sure it is relevant in some user use cases, but in most debugging scenarios it is not.

If your authentication depends on the property sub in the JWT, then an application cannot opt out of user impersonation. Since you are changing the sub to be the impersonated user, every application will see the new sub value, even if they do not want to support user impersonation. Strike 1.
All applications are forced opted-in. If an application wants to opt-out then the second claim generated_by or it's respective implementation is required. But then still, all applications are opted-in. That means when you design a new application you have to know that you might want to opt out admin from accessing user data in this application, "data is insecure by default, unless explicitly designed otherwise". This is the pit of failure, a pit of success would be opt-in, Data is secured by default, unless otherwise excluded. Strike 2.

A quick call-out is worthwhile on how to secure data like a user's DOB. UIs don't need to know this information in most cases. The screens and activities where DOB is valuable, actually care that the user isBornInJanuary or isOlderThan18, and not the actual date of birth of the user. Unless of course this is the users DOB selection, in which case this component rarely needs to be validated by a support engineer, and if you believe that user impersonation is necessary to help validate the user DOB entry screen, this article isn't going to be of any help for you.

3. Secondary system data leakage

Not only do we need to worry about vulnerabilities in our primary user applications, as well as leaking the data associated with them. Now we also need to worry about protecting these secondary systems used to impersonate users AND leaking the data associated with them as well. Internal systems, by their very design usually end having worse security measures in place because fewer people use them. Fewer users and lower volume means more hacks and less attention given to such an app. In practice, these applications are rarely changed, but frequently break, and most importantly have low priority when it comes to innovation and implementing necessary improvements. They don't end up in your OKR Objectives for this quarter and no one is getting promoted over them.

We are so concerned that someone is abusing these tools that we ourselves leak user access tokens and data to logging systems. We log so zealously to ensure we have captured the usage of these tools, that we end up logging that which we should not. When we log that means we've probably also exported these logs to some third party reporting tools. It is a Catch-22, we know we need to log and report on actions taken as an admin when impersonating a user that log data that we would not normally be logging. The goal to prevent security issues creates a new attack surface.

The result is that these systems will likely end up logging usage of user tokens. That's an introduction of a new attack surface, and due to the issues in priority with fixing, these systems are actually twice as likely to leak user data compared to our primary user applications.

4. Corrupted audit trails

Frequently we can a priori conclude that user impersonation is actually wrong. In the debugging scenarios, the last thing you want to do is gain access to modify the users' data. If you actually needed to modify a user's private data, or one of your customer's account information, you definitely want a dedicated system to handle that. This means, you actually don't want to the be the user, you don't want to impersonate the user, you just want to be the user with the explicit caveat of read only permissions. You only want to see what they see, not actually be able to modify their data. Accidentally modifying user data is guaranteed to happen accidentally if the only way to to verify a user facing UX problem is to completely impersonate a user and get full write access to their account.

Without thinking about, the following issues are associated with impersonating the user in this context:

Audit trails incorrect say the user changed data when they did not. ➤ An admin impersonating the user did it.
The user's sessions may start to include the one generated by the admin. ➤ As a user, it would be an understatement to say they would be concerned if they saw a session in a sensitive account modifying data from a location they are not in.
Logging data in the applications is incorrectly recorded, or may not be recorded at all. ➤ You may be tempted to hide these admin interactions.
And lastly, in every case, now we need to alter our systems to be not only aware of how to process the data due to impersonation, but how to log it. ➤ Impersonation is a virus that starts to infect all of our systems.

The practical-ish solutions

If generating a new token that contains the impersonated User ID is so bad, there must be better solutions out there.

Solution A: Additional token claim property

What if we don't change the subject sub claim, but instead add a new claim. That way, only those services that understand this claim, and actually want to use it would choose to use it. Services that don't know about it, keep using the unmodified sub claim. Admins would still look like admins. Only services that care about a new adminIsImpersonatingUserId claim property would know to use it and how to handle it. This would give you security by default, and only expose services to the danger that have already explicitly designed support for it. You would have to opt in, success finally!

Theoretically this is great, and while it is a bit more secure than altering the subject, in practice, we start to write code that looks like this:

Resolve User Identity:

async function resolveUserIdentity() {
        const userId = jwtToken.adminIsImpersonatingUserId
          ?? jwtToken.sub
        return userId;
}

Then that code ends up in a shared library which all our services implement. So while our intentions were good, the reinforcing system loops cause this to be no better than the alternatives. The reason is, we often find the need to optimize our usage across even a small number of services, some believe preventing code duplication is a bad thing. So the resolveUserIdentity method leads us to the following pattern:

We change our Auth solution to add the new claim to the JWT during impersonation.
Only those services that need to care about this add support for it.

At this point we are still 100% secure. But then:

We update some shared libraries that support JWT verification and add the method resolveUserIdentity to it.
The resolveUserIdentity replaces all the checks to consume the new claim.
All existing services get updated to use this shared library, and are exposed to the dangers of impersonation.

A new claim won't help us. This means that now we are back to the same problem, and arguably the situation is worse. Instead of all the services in the platform trusting the standardize sub, we now maintain a bespoke solution just for our system. This is especially important, the sub claim is an OAuth and OpenID industry standard RFC 9068, everyone in the industry is familiar with it. However, just for your system, there is now a new claim which just ends up being treated as the sub canonical sub, but it is not standard, not self documenting, unexpected and unique. Complexity reduces security. Strike 3.

For more about the systemic issues with a JWT or session token based permission system, permission attenuation is discussed in depth in the token scoping academy topic.

Solution B: DOM Recording

See earlier impersonation use cases.

If we flash back to the original user stories that drove us to implement user impersonation in the first place, we might start to see a pattern emerge. Most of the time the issue is that — something is wrong with the User Experience. The user is stuck in some way, the data isn't being displayed correctly, some component is broken.

All of these are user facing issues, and issues facing the user purely in the UI. The source of the data, and the security therein has near-zero value to us in validating the user experience. Attempting to use expensive full user impersonation instead of simple UI component tests, is the exact same problem we see incorrectly implementing tests at the wrong level.

Let's use the Testing Pyramid as an analogy. The canonical testing pyramid is this:

At the bottom is our unit tests, those tests are cheap and easy to write, find the most issues, and ensure our system is working without much effort.
Then comes the service level tests. Or in the case of UIs these are our screen tests. Multiple pieces of functionality and components are combined together in these tests. We don't want many of them, perhaps 10% max of all our tests test full screens or services. Most of the functionality of the service or screen is already validated in the unit tests — ie we know that our core functions, as well as buttons, slides, pickers, etc — all work correctly.
Now comes the 1% integration or end-to-end tests. You almost never want these, only the most critical flows of your application should be validated. When they report a failure, you have no idea what might have caused that particular failure, you just know there is a problem. In the case of an application like social media platform, The integration test you want is — making a new post. (Obviously there is no reason to test the login flow, since your auth provider has you already covered there!)
At the top of the pyramid is manual exploratory testing. That which cannot be automated, and most importantly needs the intelligence and creativity of a human to identify potential problems in your software application. This is the most expensive and you rarely have an interest in squandering this effort.

The only difference between this and a support case is the context — the why. The services, applications, business logic, and tools that we have at our disposal are all the same. We need to trust that our tests exist to validate the problems we could have. It is always a mistake to invest effort in the top of the pyramid when we lack the assets at the bottom. Likewise, our support pyramid is this:

At the bottom is application logs. There is no sense in attempting to tackle any of the higher layers until you have sufficient application logs that exactly report incoming requests, outgoing responses, unexpected data scenarios, edge cases that aren't completely implemented, and systemic issues.
Just above that is documentation. This includes expected common flows, uncommon flows, and demos of the more complex to use aspects of our application. The biggest benefit of this documentation is that we can help out users. I want to repeat that it is more for us, than it is for our users. The pyramid exists to inform us what we should do, not how our users should operate.
The next rung up are User recordings. For users that are having issues, we have concrete recorded data for their flow. The flows would include anything relevant to the application, how they used it, what actions they took. All so we can actually see what happened in context for when there is a problem. No one wants to spend any time looking at recordings if they don't have to. It is also very difficult to identify the root cause of problems by reviewing a recording, but having them is indispensable to your support engineers when they need them, when a user has reported a issue. Solutions include PostHog, FullStory, Sentry. If you don't have these recordings, then the next best alternative (which is very far away) is getting a live screencast from the user. These are less useful, and more expensive to obtain. Worst of all, they can and have been used to breach sensitive systems.
At the very top, is of course the thing you never want to have to do, and the topic of this article: Full user impersonation. If everything else fails then at least we have user impersonation left in our toolkit. But this must only be used after we have significantly invested in all the other strategies.

Assuming we have tackled the bottom two rungs of the table, the missing next component is the User recordings. If you have those, which offer the ability to sanitize the data coming from users, then you've got the solution to 99% of all support cases. Having people jump in and impersonate users is just not necessary. And most importantly, if we look at who often needs to impersonate users, it isn't even the people who should have access to do so.

Revisiting user impersonation

Do you want to see the data or do you want to see what the user sees? In almost every case it is the former, seeing the data can be through an admin app. In the rare case that it is the later, we would need the exact permissions the user has, or some safer strict subset of them. So what's the right way to handle user impersonation in the case that we just can't live without it?

The most important principle here is Secure by Default. So far a blanket implementation is wrong, and there are too many pits of failure with the JWT, auth session, or reference token based approach.

Looking at the support engineer use case, our needs would be satisfied if we were to explicitly hand out to the support staff just the permissions read:logs to handle that specific support case. But it is quite something else to generate whole valid tokens that contain the subject different from the user requesting them and give those out to specific people. So as long as we have a system that allows us to provide our team members with explicit permissions to only the exact resources they need, then we have the capability to ensure we have a secure system that also solves all our use cases.

How Authress supports user impersonation

I want to end this article with a discussion about how Authress solves the top of the pyramid user impersonation story. The caveat here being, that it is sometimes a trade-off some companies really want. They absolutely want to sacrifice security, increase vulnerabilities as well as their attack surface by introducing full user impersonation functionality. However from experience, very few of our customers have anything implemented in this space at all, and those that do have hooked their process into easy to grant permissions through Authress, rather than full user identity impersonation.

The real solution is to actually consider your support team persona when designing features. And this is what Authress optimizes for.

The flow that we consider the most secure is explicitly and Temporarily grant your support user persona exactly one small additional set of permissions relevant for the support case. When we do this we don't change how we determine identity, we only change the way we determine access. Authress supports this by allowing quick cloning of User Based Access Records which represent the permissions a user has. Since cloning is dynamic, a temporary access record can be created that only contains the READ equivalent roles that the user has. And most cases, you can just directly assign your support engineers to a Authress Permission Group with READ ✶ access, and never need to touch permissions again.

Here is an example cloned access record, where the support engineer received just the Viewer Role to all organizations so that documents and users could be Read not Updated:

The firehouse recommendations

In case you want to ignore the advise of this academy article, and instead of using Authress permissions to drive access control as recommended, I do want to include recommendations that will help reduce the impact of security and compliance issues related to user impersonation:

Do not hide user impersonation, it will be tempting to obscure the usage of it from your customers. Instead make sure it is visible and clear for everyone especially your customers. I know you don't want them to know, but they should know, they may even need to know, especially if something goes wrong.
Make sure all actions are recorded in an audit trail both by your admin who impersonated the user and the application user. Especially the admin. There will definitely be questions related to the "last person that touched this" and of course "it was working before your team looked at it". You will need a way to be confident in your response to your customers when it wasn't an admin that touch it last.
If you're operating in any high-security environment, FedRAMP, ITAR, or the like, always require customer user action before the support engineer has access to the account data. Some prominent cloud providers believe having an email with the user agreeing, is sufficient for this. I'm here to say — is not sufficient. Because often people who can create support cases do not and should not have admin access to the customer account to view all the data. Someone without the customer admin role should be able to grant your support engineering staff access to sensitive data in the account. You need an admin to click a button. This is usually done through a Step-Up Authorization Request.
Impersonation can be valuable in some environments however often completely useless in others. Especially in spaces with regulatory requirements, it's much better to diagnose issues from outside the impacted account, either through data replication or a permissions based approach.
Ensure your impersonation logic is completely tested. There should be no better tested piece of functionality in your software system.
Audit trails should always keep a "This was run-by User X" annotation on audit records, not just the user ID, but any additional information from the admin. Our recommendation is both the Admin User ID and the Support Ticket ID, on every log statement.
Start with your customer expectations. What sort of transparency do they explicitly expect? Do not guess. Err on the side of overcommunicating, rather than under.
Please revisit doing this in the first place if you don't have the capacity to have a dedicated team accountable for this functionality. Often this will involve your legal team when it doesn't go right.
When (not if) credentials leak, who leaked those credentials? Was it your customer or was it through your admin application or by one of your support engineers. Always be able to tell where those credentials came from, so that you can respond to the compromise as effectively as possible.
If you want to start anywhere, go back and invest in your admin/support tools so that they can expose the data that you need, rather than focusing on user impersonation. If those tools are insufficient check back at the Support Engineer Pyramid again.

For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to the Authress development team or follow along in the Authress documentation and join our community:

Join the community

Are millions of accounts vulnerable due to Google's OAuth Flaw?

Warren Parad — Wed, 15 Jan 2025 17:01:30 +0000

This article is a rebuttal to Truffle Security's post on Millions of Accounts Vulnerable due to Google's OAuth Flaw. (Alt link) Even more ridiculous might be that their post got picked up by no small number of news outlets that all should be ashamed of themselves, far too many to actually link in this post.

Are millions of accounts vulnerable due to Google's OAuth Flaw?

In a true Betteridge's law of headlines fashion, the answer is a resounding No. Which explains why Google ignored this vulnerability in the first place:

The TL;DR of the source article claims that due to the nature of how Google OAuth works, "Millions of Americans' data and accounts remain vulnerable". It relies on the nature of Domain Ownership.

The Claim

Google’s OAuth login doesn’t protect against someone purchasing a failed startup’s domain and using it to re-create email accounts for former employees.

Domains are the root of trust* for many businesses. At Authress we rely on authress.io to establish trust with our customers, just as at your business you rely on your domains for your customers. This is "Root of Trust" with an asterisk because in reality the root of trust lies with the domain authority, the domain registrar, and the issuer of your TLS certificates for HTTPS encryption. But that is outside of the scope of this article.

The claim in the original article is that it is OAuth and specifically Google's OAuth that is at fault and nothing else. And that somehow domain ownership is linked to the exposure of customer data.

Background

Gaining access to your trusted domain is one way in which attackers attempt to circumvent your security strategy and compromise your users. If malicious attackers can utilize your domain to trick your users, then they can impersonate your business and steal their personal information, bank accounts, and credit card numbers. This is the basis for why phishing is popular today. As a matter of fact phishing is so popular because compromising a domain is incredibly hard, and is usually executed through a DNS Poising attack. The strategy behind phishing is to purchase alternative domains that look and feel like the valid domain as the next best thing (Typosquatting). These facsimiles exist for exactly that reason.

Besides using separate domains attackers will often also attempt Subdomain takeovers which is a mesh between domain compromise and using an alternative domain.

However, in this case, attackers cleverly will attempt to use your existing corporate domain after you believe you are done with it. The expected flow involving Google Workspace's OAuth looks something like this:

You buy a domain for your company, let's call it yourcompany.com.
Sign up for an Employee Identity Solution (IdP) that provides OAuth, there are actually many solutions here, Google Workspace, Okta, Microsoft Entra ID, Ping Identity
Then your employees use that identity solution to sign into to a third party product such as Stripe, AWS, PostHog, etc...
Lastly you give critical data to that product, business sensitive information, like your pets' birthdays.
That third party applications saves that data because they like data very much.

Identity

When you log into your favorite third party application, there needs to be an identifier sent from the Employee Identity Solution to that third party. The Third Party trusts your chosen identity solution as well as that identifier. Here is an example token generated by Google Workspace:

{
      "iss": "https://accounts.google.com",
      "sub": "210169484474386",
      "iat": "1736946817",
      "exp": "1736996817",

      "email": "warren@yourcompany.com",
      "hd": "yourcompany.com",
      "name" : "Warren Parad",
      "given_name": "Warren",
      "family_name": "Parad",
      "locale": "en"
}

The identifier in the token is the sub claim with the value 210169484474386. This is my User ID (Note: this is not actually my user ID, feel free to do with it as you wish, but I made it up for the purposes of this post.)

Your third party application uses this sub property to uniquely identify you, and then authorize you to your company's sensitive cat photos.

The Vulnerability

Now, imagine that you close your Google Workspace account, because your company goes bankrupt (This frequently happens because as much as we want to believe companies are successful through hard work, the truth is that it is actually luck). Along with your Google Workspace Account will likely be your expired domain yourcompany.com, unless you have some secret prayers that one day you will be able to sell it instead of expiring worthless. Let's assume that yourcompany.com domain is now available for anyone to purchase. By purchasing that domain, an attacker can create a new Google Workspace account, in hopes to gain access to those exact same third parties you had used for your business.

This actually isn't even the first time something like this has been attempted, and frequently it works due to hard-coded solutions in many applications. In a cruel twist of fate, here is a great example of being able to compromise the attackers themselves because they had a used a application which relied on expired trusted malicious domains.

This actually doesn't happen with Google OAuth. When you close the google workspace account, the User ID with the value 210169484474386, ceases to exist. This is what Google is confirming by closing the original bug report. An attacker recreating the Google Workspace account is unable to generate the same sub again. So that even if an attacker attempted to create a new Google Workspace from the expired and unclaimed domain yourcompany.com, the sub would be different and your third party application would reject access.

What's the problem?

The issue is some third party applications decided not to use the sub claim. The author of the Truffle Security post suggests that this is due to some bug in the Google OAuth implementation, but the reality is OAuth has nothing to do with this problem. The failure to use the sub claim stems from this shiny property in the identity token called email. In the original token above you can see the users email there warren@yourcompany.com.

A third party that utilizes this email address to uniquely identify users means that they are allowing malicious attackers who compromise employee identity providers through expired domains to take over your account. There are lots of reasons they do this, but primarily it is because they like the way the @ looks in their database.

That means this is actually a vulnerability on the third party application side. Any third party application that allows users to log in with just an email are inherently creating a vulnerability in their own platform and setting themselves up to expose their (ex-)users data.

Vulnerability review

So, actually this has nothing to do with Google Workspace at all. And an attacker can actually use any email provider to perpetrate this attack:

Buy an expired domain and register your domain in a new email provider
...
Profit

Although in this case the ... is simply: Attempt a password reset or magic-link authentication for that third party application. In a similar attack a vulnerability was utilized by attackers through an email support system.

1. The real vulnerability

This shows us that OAuth and Google Workspace aren't actually the source of the issue here, it's the third party application. I've frequently condemned Magic-Link based Authentication, and while there are some areas where it unfortunately still provides value, it isn't worth it if you care about security. The fact that the email is provided by Google is just unfortunate. Emails are helpful for identify where to send messages to users who want emails, but it should never be used anywhere related to security.

2. Dismantling the solution

The original article suggests that adding yet two more additional claims/properties to the User Identity Token, will solve the problem. One claim isn't good enough, let's have three!

Given that the problem is that third party applications are ignoring the already existing sub claim. I find it to be quite the naïve suggestion. No amount of additional claims will prevent third parties for incorrectly substituting in their beliefs where actual security is necessary. This is just an unfortunate truth. We see this every day and it is one of the reasons we built Authress in the first place. The defaults that exist in SDKs, frameworks, protocols, and standards, are just not enough for people to do the right thing, explicit investment had to be made in prevention of doing the wrong thing.

3. Third Party Application responsibility

The last part of the problem is that the author in the original article claims

What can Downstream Providers do to mitigate this? At the time of writing, there is no fix.

Which just isn't true. Third party applications that allow email based authentication, must delete user data after account deactivation. Once you stop paying for a third party application, that data must be deleted and never exposed again unless you resume access and the third party verifies identity. I prefer taking guidance from the NIST 800-63A.

As a user you too, can do something to. If you have sensitive data, you could decide not to use any third party applications, unless of course you actually pay for it and ensure that you delete your account before your company stops using that application. If you give someone your data, they have it, assume the worst. We can and should put more responsibility onto these third party application services who are utilizing unsafe email addresses and often SMS numbers of authentication. As long as you treat email auth as a valid solution, everyone will forever be just as culpable as third parties who rely on it. Use OAuth and SAML for your business authentication and make sure to provide sufficient secure options to the users of the products and services you build.

Consumer exposure

The original article also seems to conflate risks to consumers directly. There is nothing about this vulnerability that directly affects consumers. Sure there are impacts to consumers regarding data privacy, but the vulnerability discussed in this article doesn't include them.

That's because as a consumer when you use an application, that application stores data in their primary databases. When the company that manages that application fails, both their databases and their bank accounts are empty. You don't have to worry about that data. But you do have to worry about who they gave your data to. You have to worry about that irrespective of the company, or its state. Many companies out there have started to be investigating because of just that. This is the whole premise of the Facebook's Cambridge Analytica scandal. Facebook gave user personal data to Cambridge Analytica when they should not have access to it. Facebook didn't even need to be bankrupt for there to be a problem.

The core of the issue isn't the data you have given to the company, the problem is data the they have shared to others. But no amount of praying or technological solutions is going to fix that. The problems proposed in this article regarding the domain vulnerability in question are related to the data given to the third party applications secured with by the company's corporate domain. The data that is most vulnerable in these circumstances is the business-to-business relationships. Billing information, strategic partnerships, invoices, business strategies, these are at risk.

For example, at Authress we use Stripe, sometimes. In stripe we have customer account information, including customer emails for sending invoices. If you are using Stripe or another payment provider, then chances are you too are storing some sort of customer data in Stripe. If your company goes bankrupt, and attacker uses the domain vulnerability to do a password reset on your Stripe account, they will now have access to your old company's customer invoice and email data. You probably don't care, but you should.

Conclusion

So I think we can say definitely, no there aren't millions of people at risk with this vulnerability. Sure your data is at risk, it always had been at risk, it always will be at risk, but Google's OAuth implementation, while problematic, honestly doesn't change anything at all. You can continue to file your data deletion requests with your third party application providers when you don't think they are doing too well. But if they aren't doing that well, I sincerely doubt they are deleting your data, let alone deleting your data from their third party providers. I don't know what will become of the original published articles or Google's response, but I had felt strongly to first educate regarding the problem rather than lambast Google Workspace over their responses. The claim by the original author that millions of accounts vulnerable due to Google's OAuth Flaw is just irresponsible.

Curious about this and worth discussing more:

Join the community

How does machine to machine authentication work?

Warren Parad — Wed, 06 Dec 2023 11:37:34 +0000

Machine to machine auth is how you ensure secure communication between individual services, and each service can authorize others to access protected resources.

This article is part of the Authress Academy and discusses how machine clients interact with each other in a secure way. Specifically it will dive into:

A refresher on how JWTs work
What a machine service is
Token generation and how it differs from user JWT access tokens
How generated JWTs are secured
How to validate service client JWTs

Background

Throughout the article we'll refer to our clients as service clients. A service client is a machine or service entity that needs access to another service. Examples of service clients might be a service that interacts with the Slack or Google Workspace APIs. Or potentially one service in your platform that needs to communicate with another one--such as an Orders Service that wants to fetch user related information from a User Profile Service. These services will talk to each other. And they do that by creating and securing HTTP requests between each other. This is known as machine-to-machine communication.

Every client and user needs to identify who they are so that we can verify that identity and ensure that client is authorized to only be able to perform the actions they are allowed to perform.

The working setup is:

End users that log into UIs
Backend services that receive calls from these UIs
These same services may also call each other
A centralized authentication service such as Authress that enables verification of tokens

We'll remember that every token that is created in a platform must be verifiable. If these tokens can't be verified then any one can create any token, even one that has admin privileges. To guard against this, all tokens will be JWTs. JWTs have two important components:

A user ID - known as the sub claim
The issuer - the source that created the token found in the iss claim

This is an example JWT:

{
      "sub": "user-001",
      "iss": "https://api.authress.io/tokens"
}

These JWTs can be verified by using a standard library such as an Authress SDK.

In the case of your end users, you have a login portal that users are directed to in order to log in. Through the Auth provider, users are forwarded to a provider of their choice, such as Google, to log in. Once returning, your auth provider will verify the user identity and generate a JWT that represents that user.

However, in the case of service clients, there is no user interaction, there is no password, so how do these service clients get valid tokens to call other services?

Creating Service Client tokens

Just as we have end user tokens we'll want to have service tokens as well. To make security in the platform simple and consistent, these tokens should have the exact same form as the user tokens. That means they should look exactly like this:

{
      "sub": "service-client-001",
      "iss": "https://api.authress.io/tokens"
}

We'll notice here that instead of the user ID present in the sub claim from above, we want to see the service client ID.

Users get tokens by navigating through the authentication login flow. That flow is:

Users register with a username, email, biometrics, WebAuthN, Face ID, etc...
Then later, users navigate to the authentication service and use the same strategy as selected during registration.
The user receives back a JWT that contains their username in the sub claim.

We need a similar process for service clients as well:

Register a service client and receive some credentials.
Service client calls the authentication service with the credentials.
The service client is returned a JWT that contains the service client ID in the sub claim.

The credential generation options

Step (3) is the same as with the user login case. This means as long as the service client interacts with the same Auth service, they'll get back a valid service client JWT access token that can be easily verifiable. Step (2) can be accomplished if the auth service has an endpoint that accepts service client credentials and returns JWTs. The real question is how does step (1) happen, and what are credentials really?

Credentials are the strategy in which service clients identify themselves. What's important is that these credentials provide the service client a way to do that. Further, we need the Auth service here, because we need some way of verifying the credentials that are generated. Without that, any service client could generate any credentials and impersonate both your users and other services. That means we need some service that is trustworthy.

There are many ways for service clients to identify themselves. The core component is that the client can convey to the receiving service who they are. Some ways to do this are:

1. A plain text string that says I'm Service X

The service passes a string that literally says I am service X. The problems with this should be obvious. It means that any service can pretend to be any other service. However, when you have only a couple of services, this might not be a problem. But it would require that all these services are protected behind some complex firewall, because if they are public, your services will not be able to distinguish between one of your valid services and a malicious attacker attempting to impersonate your services. Since your services are probably handling requests from users as well this doesn't work in real production environments.

curl -XGET https://example.servire.com -H"Authorization: Service X"

2. An api key

When we say an API Key, we usually refer to a plain text string that is a generated by the Auth service, and is treated very similar to a password. When a service client wants a valid JWT it presents the API key, for which the Auth service can verify it. Often the API key is coupled to the specific service client. When you register the client in the Auth service, usually using the UI, you'll get back an API key. Many services with low security concerns out there will allow you to generate API keys for service client to interact with. Using API keys are susceptible to database vulnerabilities as well as potentially timing attacks. This is reviewed further in other academy articles, and won't be discussed here.

3. Generated x509 certificate

x509 certificates are a complex strategy that enable the client to encrypt requests using that certificate. That are usually used in a scheme called mTLS. The problem with mTLS is that it requires a trusted certificate exchange in order to even generate the certificate.

If you can't guarantee the certificate exchange is secure then this opens opportunities for vulnerabilities, making it worse than an plain text api key. Additionally the generation of these certificates not easily done. Lastly, they often don't provide a meaningful level of security. That is, while they are secure, their generation is difficult, it is difficult to keep it secure, and they probably won't help at all.

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 04:00:00:00:00:01:15:4b:5a:c3:94
        Signature Algorithm: sha1WithRSAEncryption
        Issuer: C=BE, O=GlobalSign nv-sa, OU=Root CA
        Subject: C=BE, O=GlobalSign nv-sa, OU=Root CA
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:da:0e:e6:99:8d:ce:a3:e3:4f:8a:7e:fb:f1:8b:
                    ...
                Exponent: 65537 (0x10001)
    Signature Algorithm: sha1WithRSAEncryption
         d6:73:e7:7c:4f:76:d0:8d:bf:ec:ba:a2:be:34:c5:28:32:b5:
         ...

After the key exchange, payloads can be directly decrypted by the API, and verified that way.

4. An attestation provided by a third party service

Another way to provide assurances that a request is coming from the right service client, is to use yet another system that provides some sort of attestation for these requests. Systems such as WS-Federation, Kerberos tickets, and others, exist in the world. Most of these are no longer used prevalently because they lack the sophisticated configurability, are unnecessarily complicated to set up, lack integrations with other tools, do not provide managed or cloud native solutions, or just did not follow a standard such as OAuth.

5. Ed25519 public/private key pairs

The last option available is using asymmetric public/private key pars to sign and verify requests. This is the most secure option available. While some providers offer public/private key signatures, they use the weaker RS256 encryption method, however, this is still better than any of the other above alternatives.

EdDSA signatures created via Ed25519 is the norm. These pairs are created either by the Auth Service or the Service client and then exchanged. The Auth Service gets the public key, the Service Client gets the private key. From that point on, for every HTTP interaction request the service client will sign or create JWTs which the Auth service can verify. If the Auth service can verify the request from the signature that means other services can use the Auth service to verify these requests as well.

For the remainder of this academy article, we will assume that service clients use a private key in the Ed25519 form. There are many reasons for this, but most of them boil down to alternatives are either unnecessarily complicated or unsafe.

Securing request chains using JWT tokens

Now that we know how Credentials are created for our service client, we'll need to start using them. Here will review a few different request flows.

End user request flow

Here we'll review the flow when one of your users using your UIs makes a request to your service API.

The User gets a JWT access token by logging and now has it available in your UI.
From there your UI makes an API request to Service A.
Service A needs data from Service B so it makes a subsequent request.

In this circumstance, Service A can actually pass along the user's JWT access token to Service B. If the resources in Service B are actually owned by the User, then Service A doesn't need to generate its own token, it can utilize the User's.

This is the same flow that happens when a service asks for access to use your Google Drive. The token generated in the UI is passed through your service back to Google Drive to authenticate.

When Service A and Service B receive the User's JWT access token in the request, they must verify that token and then it must also check the valid authorization. We have to do this to make sure the user actually has access to the resource they are asking for.

Both services can verify the token the same way.

{
      "sub": "user-001",
      "iss": "https://api.authress.io/tokens",
      "signature": "<SIG>"
}

To verify the token, we will open it, grab the iss, ask the iss for the public keys associated with this token. Once we get back the public keys can we use them to verify the signature of the token.

Examples of how to authorize requests are available in the Authress Knowledge Base.

Direct service authorization

Sometimes however we can't use the end user's JWT. That's because one of these resources:

The end user doesn't own the resources in Service B. They could be Service A's private resources, the security to access a database is one example. That means the user's JWT won't have access.
The resources don't exist yet, and we need to create them. Sometimes resources can only be created by service clients and then granted to users. Resources before they are created are either claimable by anyone using something called Resource Claims, or are owned by Service A. In the case they are owned by Service, then Service A needs to call Service B as itself.
The service isn't owned by you, but is actually owned by a third party developer who develops apps or plugins for your platform. This means your third party won't necessarily know what to do with your user's JWT AND even if they did, you would not want to grant them access by giving them a valid user token. (See more about this in Platform Extensions.)

In these cases, Service A, will need to use its credentials to generate a valid JWT and then call Service B.

Using the credentials, Service A can sign a request asking for a JWT, and then send that signed request to the Auth Service. It will then get back a JWT that contains it's client ID as the sub:

{
      "sub": "service-client-A",
      "client_id": "service-client-A",
      "iss": "https://api.authress.io/tokens",
      "signature": "<SIG>"
}

Alternatively, this same service client can perform what is known as Offline authentication by using their private key to generate a service client minted JWT:

{
      "sub": "service-client-A",
      "client_id": "service-client-A",
      "iss": "https://api.authress.io/clients/service-client-A",
      "signature": "<SIG>"
}

We'll notice here, the issuer has changed to be one that identifies Service A as the issuer. Whether or not you choose offline or online authentication for your service clients is an implementation detail. It is more consistent to perform online, but offline offers a huge number of benefits.

Optional: Bring your own keys (BYOK)

Now with every integration between services secure we can technically move on to more important things. However, the secure storage of credentials is also not a trivial problem. More details are in how to securely store credentials. We'll remember from above, the critical components are:

A public/private key pair to sign and verify JWT tokens
An Auth service to store the public keys.

That means Auth services, including Authress don't necessarily care where the public/private key pair comes from. Any pair can be used, so long as it provides modern asymmetric cryptography. For Authress, this means you can generate your own EdDSA keys or even bring your AWS Key Management Service (KMS) keys to use with Authress. For this, only updating the public key is required and can be done as part of your CI/CD process or via the the Authress Service Client API.

FAQ

Why not send the service client credentials on every request?

It works the same as with user passwords in browsers. You only send your password on the login page, then the site generates a session credential. For every subsequent request only the session credential is used and not the password. This is so that the password is not present in every request. Ideally, the login page is a separate website that has more security around dependency management and development workflows to prevent password <=> session token attacks. Now, the login page specifically has to be compromised instead of any one of a numerous number of front end applications. That's easy to control for. Additionally, the more services that have access to credential generation processes, such as the password, the larger your attack surface is, and the more places the password can end up in logs.

Further, other services have no idea what do with the service client credentials. Service B can't handle Service A credentials only the Auth service knows they are valid. Worse still is that if the credentials are sent, then Service B could impersonate Service A and get access Service A's private resources. Giving someone else your credential, is the same as giving them your password, they can impersonate you. Services are no exception to this rule.

AWS Advanced: Serverless Prometheus in Action

Warren Parad — Tue, 22 Aug 2023 13:09:09 +0000

(Note, this article continues from Part 1: AWS Metrics: Advanced)

We can't use Prometheus

It turns out Prometheus can't support serverless. Prometheus works by polling your service endpoints fetching data from your database and storing it. For simple things you would just expose the current "CPU and Memory percentages". And that works for virtual machines. It does not work for ECS Fargate, and definitely does not work for AWS Lambda.

There is actually a blessed solution to this problem. Prometheus suggests what is known as a PushGateway. What you do is deploy yet another service which you run, and you can push metrics to. Then later, Prometheus can come and pick up the metrics by polling the PushGateway.

There is zero documentation for this. And that's because Prometheus was built to solve one problem, K8s. Prometheus exists because K8s is way to complicated to monitor and track usage yourself, so all the documentation that you will find is a yaml file with some meaningless garbage in it.

Also we don't want to run another service that does actual things that's both a security concern as well as a maintenance burden. But the reason why PushGateway exists, to supposedly solve the problems of serverless, is confusing, because why doesn't Prometheus just support pushing events directly to it? And if you look closely enough at the AWS console for Prometheus, you might also notice this:

What's a Remote Write Endpoint?

You've got me, because there is no documentation on it.

🗎 Promotheus RemoteWrite Documentation

So I'm going to write here the documentation for everything you need to know about Remote Writing, which Prometheus does support. Although you won't find any documentation anywhere on it, and I'll explain why.

Throughout your furious searching on the internet for how to get Prometheus working with Lambdas and other serverless technology, you will no doubt find a larger number of articles trying to explain how different types of metrics work in Prometheus. But metric types are a lie. They don't exist, they are fake, ignore them.

To explain how remote write works, I need to first explain what we've learned about Prometheus.

Prometheus Data Storage strategy

Prometheus stores time series, that's all it does. A time series has a number of a labels, and a list of values at a particular time. Then later provides those time series to query in an easy way. That's it, that's all there is.

Metric types exist, because the initial source of the metric data doesn't want to think about time series data, so Prometheus SDKs offer a bunch of ways for you to create metrics, the internal SDKs convert those metric types to different time series and then these time series are hosted on a /metrics endpoint available for Prometheus to come by and use.

This is confusing I know. It works for # of events of type A happened at Time B. But it does not support average response time for type T? I'll get to this later.

Handling data transfer

Because you could have multiple Prometheus services running in your architecture, they need to communicate with each other. This is where RemoteWrite comes it. RemoteWrite is meant for you to run your own Prometheus Service and use RemoteWrite to copy the data from one Prometheus to another one.

That's our ticket out of here. We can fake being a Prometheus service and publish our time series to the AWS Managed Prometheus service directly. If we fake being a Prometheus Server then as long as we fit the API used for handling this, we can actually push metrics to Prometheus. 🎉

The problem here is that most libraries don't support even writing to the RemoteWrite url. We need to figure out how to write to this url now that we have it and also how to secure it.

The Prometheus SDK

Luckily in nodejs there is the prometheus-remote-write library. It supports AWS SigV4, which means that we can put this library in a Lambda + APIGateway service and proxy requests to Prometheus through it. It also sort of handles the messy bit with the custom protobuf format. (Remember K8s was created by Google, so everything is more complicated than it needs to be). With APIGateway we can authenticate our microservices to call the metrics microservice we need to build. The API can take the request and use IAM to secure the push to Prometheus. (It's worth noting here, you can actually push from one AWS account to another's Promotheus Workspace, but trying to get this to work is bad for two reasons--you never want to expose one AWS account's infra services to another, that's just a bad architecture, and two trying to get a lambda to assume a role and then pass the credentials correctly to the libraries that need them to authenticate is a huge headache).

And this is the whole code of the service:

const { createSignedFetcher } = require('aws-sigv4-fetch');
const { pushTimeseries } = require('prometheus-remote-write');
const fetch = require('cross-fetch');
const signedFetch = createSignedFetcher({
  service: 'aps',
  region: process.env.AWS_REGION,
  fetch
});

const options = {
  url: 'https://aps-workspaces.eu-central-1.amazonaws.com/workspaces/ws-00000000-0000-0000-00000000/api/v1/remote_write',
  fetch: signedFetch,
  labels: { service: request.body.service }
};

const series = request.body.series;
await pushTimeseries(series, options);

And just like that we now have data in Prometheus...

But where is it?

???

So Prometheus has no viewer, unlike DynamoDB and others, AWS provides no way to look at the data at Prometheus directly. So we have no idea if it is working. The API response tells us 200, but like good engineers we aren't willing to trust that. We've also turned on Prometheus logging and that's not really enough help. How the hell do we look at the data?

At this point we are praying there is some easy solution for displaying the data. My personal thought is this is how AWS gets you, AWS Prometheus is cheap, but you have to throw AWS Grafana on top in order to use it:

And that says $9 per user. Wow that's expensive to just look at some data. I don't even want to create graphs, just literally show me the data.

What's really cool though is Grafana Cloud offers a near free tier for just data display, and that might work for us:

Well at least the free tier makes it possible for us to validate our managed Prometheus service is getting our metrics.

And after way too much pain and suffering it turns out it is!

However, we only sent three metric updates to Prometheus, so why are there so many data points. The problem is actually in the response from AWS Prometheus. If we dive down into the actual request that Grafana is making to Prometheus we can see the results actually include all these data points. That means it isn't something weird with the configuration of the UI:

I'm pretty sure it has to do with the fact that the step size is 15s:

It doesn't really matter that it does this because all our graphs will be continuous anyway and expose the connected data. Also since this is a sum over the timespan, we should absolutely treat this as a minimum a 15s resolution unless we actually do get summary metrics more frequently.

No matter what anyone says, these graphs are beautiful. It was the first thing that hit me when I actually figured out what all the buttons were on the screen.

Quick Summary

🗸 Metrics stored in database
🗸 Cost effective storage
🗸 Display of metrics
🗸 Secured with AWS or our corporate SSO
🗸 Low TCO to maintain metric population

The Solution has been:

AWS Prometheus
Lambda Function
Grafana Cloud

Some lessons here:

1. The Grafana UX is absolutely atrocious

Most of the awesome things you want to do aren't enabled by default. For instance, if you want to connect Athena to Grafana so that your bucket can be queried, you first have to enable the plugin for Athena in Grafana. And only then can you create a DataSource for Athena. It makes no sense why everything is hidden behind a plugin. The same is true for AWS Prometheus, it doesn't just work out of the box.

Second, even after you do that, your plugin still won't work. The datasource can't be configured in a way that works, that's because the data source configuration options need to be separately enabled by filing a support ticket with Grafana. I died a bit when they told me that.

In our products Authress and Standup & Prosper we take great pride in having everything self-service, that also means our products' features are discoverable by every user. Users don't read documentation. That's a fact, and they certainly don't file support tickets. That's why every feature has a clear name and description next to it to explain what it does. And you never have to jump to the docs, but they are there if you need them. We would never hide a feature behind a hidden flag that only our support has access to.

2. The documentation for Prometheus is equally bad

Since Prometheus was not designed to be useful but instead designed to be used with K8s, there is little to no documentation using Prometheus to do anything useful. Everyone assumes you are using some antiquated technology like K8s and therefore the metrics creation and population is done for you. So welcome to pain, but at least now there is this guide so you too can effectively use Prometheus in a serverless fashion.

3. Always check the AWS quotas

The default retention is 150 days but this can be increased. One of the problems with AWS CloudWatch is that if you mess up you have 15 months of charges. But here we start with only 6 months. That's a huge difference. We'll plan to increase this, and I'm sure it will be configurable by API later.

Just remember you need to review quotas for every new service you start using so you don't get bitten later.

4. Lacks the polish of a secure solution

Grafana needs to be able to authenticate to our AWS account in order to pull the data it needs. It should only do this one of two ways:

Uses the current logged-in Grafana user's OAuth token to exchange for a valid AWS IAM role
Generates an OIDC JWT that can be registered in AWS

And yet we have...

It does neither of these. Nor does it support dynamic function code to better support this. Sad. We are forced to use an AWS IAM user with an access key + secret. Which we all know you should never ever use. And yet we have to do it here.

I will say, there is something called Private DataSource Connections (PDC), but it isn't well documented if this actually solves the problem. Plus if it did, that we means we'd have to write some GO, and no one wants to do that.

5. Prometheus metric types are a lie

Earlier I mentioned that perhaps you want metrics that are something other than a time series. The problem is Prometheus actually doesn't support that. That's confusing because you can find guides like this Prometheus Metric Types which lists:

Counter
Gauge
Histogram
Summary
etc...

Also you'll notice that pathetic lack of libraries in nodejs and Rust.

How can these metric types exist in a time series way? The truth is they can't. And when Prometheus says you can have these, what it really means is that it will take your data and mash it into a time series even if it doesn't work.

A simple example is API Response Time. You could track request time in milliseconds, and then count the number of requests that took that long. There would be a new timeseries for ever possible request time. That feels really wrong.

5 requests at 1ms
6 requests at 2ms
1 requests at 3ms
...

But that's essentially how Prometheus works. We can do slightly better, and the solution here is to understand how the Histogram configuration works for Prometheus. What actually happens is that we need to a priori decide useful buckets to care about. For instance, we might create:

Less than 10ms
Less than 100ms
Less than 1000ms
...

Then when we get a Response Time, we add a ++1 to each of the buckets that it matches. A 127ms request would only be in the 1000ms bucket, but a 5ms would be in all three.

Later when querying for this data you can filter on the buckets (not the response time) that you care about. Which means something like 1000 10ms step buckets may make sense or

N buckets in 1ms steps from 1-20ms
M buckets in 5ms steps from 20-100ms
O buckets in 20ms steps from 100-1000ms
P Buckets in 1s steps from 1000ms+

That's about ~100 buckets to keep track off. Depending on your SLOs and SLAs you have you might need SLIs that are different granularities.

🏁 The Finish Line

We were almost done and right before we were going to wrap up we saw this error:

err: out of order sample.

What the hell does that mean? Well it turns out that Prometheus cannot handle timestamp messages out of order. What the actual hell!

Let me say that again, Prometheus does not accept out of order requests. Well that's a problem. It's a problem because we batch our metrics being sent. And we are batching them because we don't have all the data available. And we don't have it because CloudWatch doesn't send it to us all at once nor in order

We could wait for CloudWatch the requisite ~24 hours to update the metrics. But there is no way we are going to wait for 24 hours. We want our metrics as live as possible, it isn't critical that they are live, but there is no good reason to wait. If the technology does not support it then it is the wrong tech (does it feel wrong yet, eh, maybe...). The second solution is use the hack flag out_of_order_time_window that Prometheus supports. Why it doesn't support this out of the box makes no sense. But also turns out things like mySQL and ProstreSQL didn't support updating a table schema without a table lock for the longest time. The problem is that at the time of writing AWS does not let us set this [out_of_order_time_window flag](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb.

That only leaves us with ~two solutions:

ignore out of order processing, and drop these metrics on the floor
ignore the original timestamp of processing the message and just publish now as the timestamp on every message.

Guess which one we decided to go with.... That's right, we don't really care about the order of the metrics. It doesn't matter if we got a spike exactly at 1:01 PM or 1:03 PM, most of the time, no one will see notice this difference or care. When there is a problem we'll trust our actual logs way more than the metrics anyway, metrics aren't the source of truth.

And, I bet you thought that is what I was going to say. And it's actually true, we would be okay with that solution. But the problem is that this STILL DOES NOT FIX PROMETHEUS. You still will end up with out of order metrics, and so the solution for us was add the CloudFront logs filename as a unique key as a label in Prometheus, so our labels look like this:

labels: {
    __name__: 'response_status_code_total',
    status_code: statusCode,
    method: method,
    route: route,
    account_id: accountId,
    _unique_id: cloudFrontFileName
},

Remember Labels are just the Dimensions in CloudWatch, they are how we will query the data. And with that, now we don't get any more errors. Because out of order data points only happen within a single time series. By specifying the _unique_id this causes the creation of one time series per CloudFront log file. (Is this okay thing to do? Honestly it's impossible to tell because there is zero documentation on how exactly this impacts Prometheus at scale. Realistically there are a couple of other options to improve upon this like having a random _unique_id (1-10) and retrying with a different value if it doesn't work. This would limit the number of unique time series to 10.)

Further, since we "group" or what Grafana calls "sum" by the other labels anyway, the extra _unique_id label will get automatically ignored:

And the result is in!

And here's the total cost:

$0.00, wow!

Come join our Community and discuss this and other security related topics!

AWS Metrics: Advanced

Warren Parad — Tue, 22 Aug 2023 13:06:05 +0000

Normally I'm the last proponent of collecting metrics. The reason is: metrics don't tell you anything. And anything that tells you nothing is an absolute waste of time setting up. However, alerts tell you a lot. If you know that something bad is happening then you can do something about it.

The difference between alerts and metrics is Knowing what's important. If you aren't collecting metrics yet, the first thing to do is decide what is a problem and what isn't. It's far to often I feel like I'm answering the question of:

How do I know if the compute utilization is above 90%.

The answer is, it doesn't matter, because if you knew, what would you do with that information. Almost always the answer is "I don't know" or "my director told me to it was important".

⚠ So what is important

That's probably the hardest question to answer with a singular point. So for the sake of this article to make it concrete and relevant, let me share what's important for us.

Up-time requirements

We've been at an inflection point within Rhosys for a couple of years now. For Authress and Standup & Prosper we run highly reliable services. These have to be up at least 4 nines, and often we contract out for SLAs at 5 nines. But the actual up-time isn't what has been relevant anymore. Because as most reliability experts know your service can be up, but still returning a 5XX here and there. This is what's known as partial degradation. You may think one 5XX isn't a problem. However for millions of requests per day, this amounts to a non-trivial amount. Even if it is just one 5XX per day, it absolutely is important, if you don't know why it happened. It's one thing to ignore an error because you know why, it's quite another to ignore it because you don't.

Further, even returning 4XXs is often a concern for us as well. Too many 400s could be a problem. They could tell us something is wrong. From a product standpoint, a 4XX means that we did something wrong in our design, because one of our users should never get to the point where they get back a 4XX. If they get back a 4XX that means they were confused about our API or the data they are looking at. This is critically important information. Further, a 4XX could mean we broke something in a subtle way. Something that used to return a 2XX now returning a 4XX unintentionally means that we broke at least one of our users implementations. This is very very bad.

So, actually what we want to know is:

Are there more 400s now than there should be?

A simple example is when a customer of ours calls our API in and forgets to URL Encode the path, that usually means they accidentally called the wrong endpoint. For instance if the UserId had a / in it then:

Route: /users/{userId}/data
Incorrect Endpoint: /users/tenant1/user001/data
Correct Endpoint: /users/tenant1%2Fuser001/data

When this happens we could tell the caller 404, but that's actually the wrong thing to do. It's the right error code, but it conveys the wrong message. The caller will think there is no data for that user, even when there is. That's because the normal execution of the endpoint returns 404 when there is no data associated with the user or the user doesn't exist. Instead, when we detect this, we return 422: Hey you called a weird endpoint, did you mean to call this other one.

A great Developer Experience (DX) means returning the right result when the user asked for something in the wrong way. If we know there is a problem, we can already return the right answer, we don't need to complain about it. However, sometimes that's dangerous. Sometimes the developer thought they were calling a different endpoint, so we have to know when to guess and when to return a 422.

In reality, when we ask Are there more 400s than there should be, we are looking to do anomaly detection on some metrics. If this is an issue we know to look at the recent code released and see when the problem started to happen.

This is the epitome of using anomaly detection. Are the requests we are getting at this moment what we expect them to be, or is there something unexpected and different happening.

To answer this question, finally we know we need some metrics. So let's take a look at our possible options.

🗳 The metric-service candidates

We use AWS heavily, and luckily we AWS has some potential solutions:

So we decided to try some of these out and see if any one of them can support our needs.

🗸 The Verdict

All terrible. So we had the most interest in using DevOps Guru, the problem is that it just never finds anything. It just can't review our logs to find problems. The one case it is useful is for RDS queries and to determine if you need an index. What happens when you fix all your indexes? Then what?

After turning on Guru for a few weeks, we found nothing*. Okay, almost nothing, we found a smattering of warnings regarding the not-so-latest version of some resources being used, or permissions that aren't what AWS thinks they should be. But other than that, it was useless. You can imagine for an inexperienced team, having DevOps Guru enabled will help you about as much as Dependabot does at discovering actual problems in your GitHub repos. The surprise however, is that it is cheap.

DevOps Guru - cheap but worthless.

Then we took a look at AWS Lookout for Metrics. It promises some sort of advanced anomaly detection on your metrics. AWS Lookout is actually used for other things, not primarily metrics, so this was a surprised. And it seemed great, exactly what we are looking for. And when you look at the price it appears reasonable for $0.75 / metric. We only plan on having a few metrics, right? So that shouldn't be a problem. Let's put this one in our back pocket while we investigate the CloudWatch anomaly detection alarms.

At the time of writing this article we already knew something about anomaly detection using CloudWatch Alarms. The reason is we have anomaly detection set on some of our AWS WAF (Web Application Firewalls). Here's an example of that anomaly detection:

We can see there is a little bit of red here where the results were unexpected. This looks a bit cool, although it doesn't work out of the box. Most of the time we ended up with the dreaded alarm flapping sending out alerts at all hours of the day.

To really help make this alarm useful we did three things:

A lot of trial and error with the period and relevant datapoint count to trigger the alarm
The number of deviations outside of the norm to be considered an issue, is the change 1 => 2 a problem or is 1 => 3 a problem
Use logarithm based volume

Now, while those numbers on the left aren't exactly the Log(total requests), they are something like that. And this graph is the result. See logarithms are great here, because the anomaly detection will throw an error as soon as it's outside of a band. And that band is magic, you've got no control over that band for the most part. (You can't choose how to make it think, but you can choose how thick it is)

We didn't really care that there are 2k rps instead of 1.5rps for a time window, but we do care that there are 3k rps or 0 rps. So the logarithm really makes more sense here. Magnitudes different is more important.

So now we know that the technology does what we want we can start.

⏵ Time to start creating metrics

The WAF anomaly detection alarm looks great, though it isn't prefect, but hopefully AWS will teach the ML over time what is reasonable and let's pray that it works. (I don't have much confidence in that). But at least it is a pretty good starting point. And since we are going to be creating metrics, we can reevaluate the success afterwards and potentially switch to AWS Lookuout for Metrics if everything looks good.

💰 $90k

Now that's a huge bill. It turns out we must have done something wrong, because according to AWS CloudWatch Billing and our calculation we'll probably end up paying $90k this month on metrics.

Let's quickly review, we attempted to log APM metrics (aka Application Performance Monitoring) using CloudWatch metrics. That means for each endpoint we wanted to log:

The response status code - 200, 404, etc..
The customer account ID - acc-001
The HTTP Method - GET, PUT, POST
The Route - /v1/users/{userId}

That's one metric, with these four dimensions, and at $0.30 / custom metric / month, we assumed that means $0.30 / month. However, that is a lie.

AWS CloudWatch Metrics charges you not by metric, but by each unique basis of dimensions, that means that:

GET /v1/users{userId} returning a 200 for customer 001
DELETE /v1/users{userId} returning a 200 for customer 002

Are two different metrics, a quick calculation tells us:

~17 response codes used per endpoint (we heavily use, 200, 201, 202, 400, 401, 403, 404, 405, 409, 412, 421, 422, 500, 501, 502, 503, 504)
~5 http verbs
~100 endpoints

Since metrics are saved for ~15 months, using even one of these status codes in the last 15 months will add to your bill. But this math shows us it will cost ~$2550 / per customer.

If you only had 100 customers for your SaaS, that's going to cost you $255,000 per month. And we have a lot more customers than that. Thankfully we did some testing first before releasing this solution.

I don't know how anyone uses this solution, but it isn't going to work for us. We aren't going to pay this ridiculous extortion to use the metrics service, we'll find something else. Worse still that means that AWS Lookout for metrics is also not an option, because it would cost us 75% of that cost per month as well, so let's just call it $5k per customer per month. Now, while I don't mind shelling out $5k customer for a real solution to help us keep our 5-nines SLAs, it's going to have to do a lot more than just keep a database of metrics.

We are going to have look somewhere else.

🚑 SaaS to the rescue?

I'm sure some one out there is saying we use ________ (insert favorite SaaS solution here), but the truth is none of them support what we need.

Datadog and NewRelic were long eliminated from our allowed list because they resorted to malicious marketing phone calls directly to our personal phone numbers multiple times. That's disgusting. Even if we did allow one of those to be picked, they are really expensive.

What's worse, all the other SaaS solutions that provide APM fail in one of these ways:

UI/UX is not just bad, but it's terrible,
don't work with serverless
are more expensive that Datadog, I don't even know how that is possible.

But wait, didn't AWS release some managed service for metrics...

𓁗 Enter AWS Athena

AWS Athena isn't a solution, it's just a query on top of S3, so when we say "use AWS Athena" what we really mean is "stick your data in S3". But actually we mean:

Stick your data in S3, but do it in a very specific and painstaking way. It's so complicated and difficult that AWS wrote a whole second service to take the data in S3 and put it back in S3 differently.

That service is called Glue. We don't want to do this, we don't want something crawling our data and attempting to reconfigure it, it just doesn't make sense. We already know the data at the time of consumption. We get a timespan with some data, and we write back that timespan. Since we already know the answer, using a second service dedicated to creating timeseries doesn't make sense. (It does absolutely make sense if we had non-timeseries data, and needed to convert it to timeseries for querying, but we do not).

The real problem here however, is that every service we have would need to figure out how to write this special format to S3 so that it could be queried. Fuck.

While we could build out a Timeseries-to-S3 Service, I'd rather not. The Total Cost of Ownership (TCO) of owning services is really really high. We knew this already before we built our own statistics platform and deprecated it, we don't want to do it again.

So Athena was out. Sorry.

🔥 Enter Prometheus

And no I'm not talking about Elastic Search, that doesn't scale, and it isn't really managed. It's a huge pain. It's like taking a time machine back to the days on on-prem DBAs, except these DBAs work at AWS and are less accessible.

The solution is AWS runs a managed Grafana + Prometheus solutions. The true SaaS is Grafana Cloud, but AWS has a managed version, and of course these are also two different AWS services.

Apparently Prometheus is a metrics solution. It keeps track of metrics and it's pricing is:

We know we aren't going to have a lot of samples (aka API requests to prometheus), so let's focus on the storage...

$0.03/GB-Mo

Looking at a metric storage line as status code + verb + path + customer ID we get ~200B, which is $1e-7 / request + storage) For $100 / month that would afford us 847,891,077 API requests per month. (This assumes ~150 days by default retention time). Let's call it 1B requests per month, that's a sustained 400rps. Now while that is a fraction of where we are at, the pricing for this feels so much better. We pay only for what we use, and the amortized cost per customer also makes a lot more sense.

So we are definitely going to use Prometheus it seems.

For how we did that exactly, check out Part 2: AWS Advanced: Serverless Promtheus in Action

If you liked this article come join our Community and discuss this and other security related topics!

Denylists and Invaliding user access

Warren Parad — Wed, 05 Jul 2023 14:15:26 +0000

This article is part of the Authress Academy and discusses the different ways to invalidate a user's access and revoke their tokens.

It discusses solutions for:

Denylists
OAuth JWT token expiry
API Gateways
Refresh Tokens

Background

To secure communication between different systems or services that you have, an access token is sent between the user, client, or machine to your service API. Usually, this is some sort of JWT access token.

In the case of Authress and other providers, a JWT token is generated when the user logs in, that token is available for them to gain access to their resources and prove their identity to your services.

When that token expires, they must then fetch a new token. (Fetching new tokens won't be discussed in this article, the KB has additional articles on Authress Authentication.)

However, there are cases where we want to terminate access as soon as possible--User role changes, permission revocation, or token exposures--all may facilitate the need to terminate access.

OAuth JWT access tokens + Scopes

Many authentication solutions and identity providers generate JWT access tokens and provide these as a way to authenticate users to your applications. They are generated for the user in the UI, and are sent to your services via the Authorization header. Generated JWTs are usually created through a standardized OAuth exchange, the details of which aren't immediately relevant here, but there are a number of other Authress KB articles that discuss the topic. For the rest of the article, we'll assume that you know what a JWT access token and how to get one. (If you are unsure, please don't hesitate to join our Community and ask!)

In the case of authentication solutions that do not offer any sort of authorization or access control, they may attempt to store a user role (or roles) in the JWT in the scope JWT property claim.

OAuth JWT Scopes

The OAuth JWT scope claim is a property which contains a list of permissions and roles that the user has approved to be passed to the JWT. However, since the User gets their own JWT, this may seem unnecessary complex. The standard OAuth use case is that the requestor of the JWT is not the same as the User themselves. That is, your user is actually approving the generation of the JWT by your UI. And then your service will validate the JWT. Scopes here are a way to restrict what permissions the user is passing to the token. When this JWT is used, the scopes provide a way to filter the resources that are actually allowed to be accessed.

Example:

The user has access to all their own resources of the types: documents, videos, photos
A UI could request the scopes documents and photos
The user approves the UI requests for those scopes

{
    "iss": "https://login.authress.io",
    "scope": "documents photos",
    "sub": "user_id",
    "iat": 1685021390,
    "exp": 1685107790,
    "scope": "openid profile email",
    "aud": [
        "https://api.authress.io"
    ]
}

The generated JWT would only contain the documents and the photos scopes and not the videos one. That means of all the resources that the User has access to, the JWT only has access to user's documents and photos.

You'll notice that scopes are not the same as permissions the user has. The user has permissions to only some resources, but the scopes have even more limited access. Scopes constrain access based on what the user already has permissions to. The optimal use of OAuth JWT Scopes are used when you don't control both the UI and the Service, and there are two reasons for that:

Most Authentication providers allow the User to specify any and all scopes that they want in the token. This means if you are using the Scopes to control access to resources that the user can interact with, you have a vulnerability. You are letting your users control what they have access to. Scopes must not be used for access control for the User.
You'll need some to capture which permissions the user actually gave to another service. If a third party is accessing your resources or you are accessing a third party's resources, the scopes granted to the client application, control that access. This is a different list of permissions because it is a different entity performing the API request.

Issues with scopes

Since scopes are embedded into the JWT token, as long as the JWT token is valid (between the nbf and exp times) then the holder of that token has access to the user's resources that are granted via the scopes in the token. This means that there is no way to block requests in a situation where you want to invalidate the access that the token grants.

This is the OAuth spec, and while it seems like a missing part of the specification, if we consider the difference between the access granted to the user and the scopes granted to the third party on behalf of the user, it makes sense that in most cases this doesn't matter. Invalidating a JWT wouldn't make sense.

This also makes sense when we consider what these JWT tokens are. A JWT access token is a representation of a user identity. The Authentication Service generates the token to represent the user's identity, and then the user, client, or third party presents that token to prove who they are. However, it doesn't really say anything about what they can access or why. Further, JWTs don't stop representing the user just because their access changes.

As discussed in Academy for access control strategies and token attenuation:

The most common mistake is putting the roles or permissions into the JWT.

When the permissions are in the JWT (using the scope or a custom claim), the permissions have become coupled to the identity of the user. These are separate things, and deserve separate treatment, and we'll go into some of the further issues and resolutions below.

Revoking a JWT access token

Now that we know that the permissions coupling to the JWT access token creates some problems, we can discuss what we can do about it.

0. Do nothing approach

The default and simple approach is to do nothing. For most implementations and products, this may be the best answer, however if you are in Healthcare, Banking, or another highly regulated government industry, this is not going to be the right approach, and skipping this step is recommend.

When the JWT expires the user will lose access to the scopes and permissions that were specified in the token. This may seem bad from a security standpoint, however in most cases, this is exactly the flow that makes the most sense. Most cases don't need to be concerned with malicious attackers using still valid JWTs. When the user signs out of the UI, discard the JWT. Once it leaves memory it should no longer be a problem. If this were a problem that the hypothetically saved JWT could still be used, we can question How? If the token is available to be used after log out, and we are concerned that it will be, how did that come to happen?

One common answer is that the user is using a shared machine. However on shared machines, user log out is not reliable, and shared machines are not truth-worthy. Simply logging out, even if the token were revoked will not prevent vulnerabilities from existing to utilize the token and impersonate the user. It is an illusion that there is a solution to this problem other than terminating the OS of the machine (which still might not be sufficient.)

1. Limiting the access token lifetime

Going further would be to ensure that the lifetime of the token is as limited as possible. With authentication session management, we can generate an additional token for the user when the current one expires. So instead of 24 hours, 4 hours, or even 1 hour, if the token expiry is in 5 minutes, then every 5 minutes the user will get a new token. Even if the user doesn't log out, that token is going to expiry in the next 5 minutes, almost completely reducing the feasibility of an attack using an exfiltrated token.

Using Authress or another centralized authentication identity provider, you might be able to configure the token lifetime to automatically expire.

2. Use a shared Denylist

In a platform where you have many services, creating an endpoint on one service that allows the user to logout thereby storing that timestamp, isn't sufficient to ensure that the token with the scopes doesn't get used again. This leads to the need to have a unified solution for token management. We already know we need a unified solution for authentication, but this adds an additional layer of complexity. One common solution is the creation of an API gateway. An API Gateway is a reverse-proxy for all the services in your platform.

The API Gateway receives every request, verifies that the token is valid based on your identity provider and internal token cache, and then forwards the request or denies it. Additionally, it can offer an endpoint that allows your services to revoke a still valid identity token.

The drawback with this is that it requires this additional piece of technology on top of your existing Authentication IdP, and it also introduces a requirement that your user logout now needs to call to your exposed API Gateway in order to invalidate these tokens. There are some solutions that make storing this data in a database easier, but it is still something extra you'll need to add to your API Gateway.

Distributed Cache Alternative:

If you aren't interested in setting up an API Gateway, which often is infeasible in a multi-region deployment or in a highly distributed system, then alternatively a distributed shared caching solution can work. The Authress recommendation is to avoid having a shared cache wherever possible, since it increases the cost of maintenance and often serves as a single point of failure that wasn't designed to handle fault tolerance at the scale that IdPs are designed for. (See Authress downtime protections for some of these fault tolerant protections.)

3. Permission changes

So far we have only discussed the need to invalidate credentials when the user logs out, but what happen when their roles and permissions change? Your API Gateway isn't going to be sufficient to know what to do, unless you add permission and role changes to it as well. At that point you've designed your own AuthZ solution, and it would be better to use an existing IdP as part of your reverse-proxy solution. (And in that case, you might want to check out Building your own AuthZ solution.) And worse, if you invalidate the token in the API Gateway, your users in your UI never know they need to get a new token. The invalidation happened on the service side without the users' knowledge. They'll start getting back 401 errors from the API Gateway. A workaround would be to add handling code to your UI to force a re-login when a 401 is seen as a response from a service. But then this code has to be replicated to every UI you have, and not every service might be behind the API Gateway, that means special handling would also be necessary for invalidations to know which service the user is calling from the UI. Some calls can expect a 401 while others a 401 means "my roles changed". That can cause a poor UX for the user, if they are forced to log out every time their roles are changed.

The best solution here is to decouple the authorization from the token itself. By storing the authorization access control and permissions for the user in a separate service from the user identity handling, when the access changes, then no further updates are necessary.

That's because authorization is checked realtime instead of only being populated during token generation. With JWT scopes, the permissions and roles are cemented into the token, but with an Authorization solution the access is dynamic. When roles change, so does the user's live access. This is a significant improvement, because there is no extra storing of access tokens nor a need to invalidate them one by one.

Additionally, depending on the solution you are using, in this case if we assume Authress Authentication, when the user logs out the user's session may also be terminated on the provider side. Meaning that a solution that offers both Authentication and Authorization and keeps them segregated works best for handling token invalidation.

Concerns

One common concern as we go further down this list is the reliance more and more on a shared centralized system. And most shared systems are not fault tolerant--they weren't build to scale nor built to be reliable. Many open source implementations of token invalidation, caching, and gateways suffer from this lack of reliability. And the more reliable they were designed the more difficult they are to maintain, that's because they pass the burden of maintenance onto the development team that runs the open source solution. It's better to go with high SLA AuthN solution that already supports your needs out of the box.

Rather than having to build this yourself and maintain it, finding the right product that fits your needs is a must. The current auth situation report is available in the KB for a deep dive on the different auth technology pieces.

Here are some hints about how reliable these solution are:

1. Reliability designed in

High SLA services (at least four 9's), are designed with this distributed reliability in mind. It isn't a single point of failure since frequently the service itself has been replicated to multiple regions in multiple datacenters with additional backups. your technology often can be anywhere in the world and your users can be on the opposite side, and both your services and the user can experience very-low latency. This is often achieved with distributed CDNs and edge-node authentication.

2. Selection of the right protocol and technology

When designing token invalidations, a large part of the ability to even execute effectively, requires picking technologies that are by default high performance. Some providers offer the EdDSA token signature standard which is faster and produces smaller signatures. Utilizing distributed public keys, enables a hands-off approach to verifying credentials. Using one of these providers that supports public key encryption enables tools like an API Gateway or the services themselves to verify tokens without even making any API calls. The public keys are cacheable for an extended period of time. Enabling fast requests without the single-point of failure.

3. Clear separation of responsibilities

Solutions that recommend storing the access permissions inside the identity JWT immediately get off to a bad start. We've seen from above this encourages coupling in a way which cases issues at the important security edge cases. While it can seem like an optimization for simplicity, it actually just creates problems. There are very few products, services, and applications where JWT scopes for permissions actually is a good fit. Therefore, the selection of a technology that provides the separation between identity and access control is important.

Going further

What about refresh tokens

Refresh tokens are part of the OAuth standard that exist to enable third party services to impersonate users even when the user's JWT access token has expired. That has nothing to do with token invalidation and don't help us at all here in our use case.

Revocation comes up often with OAuth, because documentation sources often point to Refresh tokens as a solution. That's because in the OAuth spec, Refresh tokens can actually be revoked and solutions, recommendations, and implementations get stuck on this part. Refresh tokens are long lived, and revoking them prevents new access tokens from being generated from that Refresh token. But just because the refresh token is revoked, does not mean that the access token is revoked. And in almost all implementations, that is the case, because access tokens are validated on the client side, but the revocation database is on the service side. So attempting to use refresh token revocation to block access token usage would require every API request to unnecessarily call out to the revocation database to verify the token. This converts the standard distributed offline public key process for token verification to be online, centralized, and slow.

Myths about API HTTP clients

Warren Parad — Thu, 29 Jun 2023 08:38:42 +0000

Having built many Product APIs in my experience for multiple companies, there are a number of Myths we've come to learn about APIs in general. Here are some of the more interesting ones we've learned through building Authress, which is an AuthN + AuthZ solution. For reference here is the Authress API.

Myths

1. Clients should ignore properties in an HTTP response that they don’t understand.

In reality they will start to depend on every undocumented property in every conceivable way Hyrum's Law. Our APIs are frequently used in unexpected ways. We go back through our logs every time before we make a change to see every call that was made (in a reasonable timeframe) to make sure we don't break any previously expected behavior. "How could our customers break if we change this" is not a game we play, it is a frequent topic of conversation.

2. Clients should follow redirects if presented with them.

But they never will. Clients will assume getting back anything other than 200 is an error. Yes that is even 200, a 201 will cause an error. So we have to be very careful when changing the status code we return, or even the errorCode in the error response. Changing a 400 to a 422 will cause someone to break because they definitely wrote code that says:

if (statuscode === 400) {
  doSomething();
}

3. Clients should will follow the recommended HTTP status codes and headers to handle async requests.

Not even remotely. Client usually assume all API calls are synchronous and instantaneous. That means the default behavior always has to be make sense, and it can never change. When you want to offer clients better functionality, it either has to be the default or put behind a documented RFC header flag like prefer: async. Clients will call these endpoints way more than they should.

4. Clients will use your published API specification and SDKs.

Instead, they'll make every kind of call to your API, they don't care if you can't handle it. If there is an API rate-limit or size limit, you are guaranteed that they will attempt to overcome it. They'll use every API software client known to humans to call your API not just Postman. And the bugs with those clients will become your problems, they'll often not work for some reason, and then you get a nice introduction to how yet-another-api-client has some weird defaults.

5. Clients can use your published API specification to make HTTP calls.

Clients expect that your API is available in every programming language and every framework for every language. If it's not available in their preferred language...To bad, they'll just use a different service. And also they will find every bug in your SDK, not just actual bugs, issues with how a library dependency of a dependency of a dependency didn't interpret the RFC correctly and so now you need to make a bug fix to 3 other open source projects in order to get your customer working.

6. You can make a secure Login/Signup portal for your users.

No matter what you do, these unprotected endpoints will be spammed until you run out of money in your wallet. We know because we run Authress which provides this for our customers. And to no end do we get requests there. Don't have these endpoints, use a federated provider, or even better use the right identity aggregator for your use case by reviewing this Auth situation report

7. You will only get requests that make sense.

Welcome to the world of bots, where every day your APIs will be spammed with an arbitrary list of fuzzers attempting to find exposed and unsecured MongoDBs, PHP WordPress instances, and any number of other requests with random Authorization headers. Be prepared to handle the errors in your application that you didn't intend on because no human would have thought to make that API call. Everything that can go wrong, will. And those errors will always be the first to show up in your alert tracker.

8. Following the standard for what you need will work out--RFC, IETF, W3C.

It would be nice if there was actually a standard for what you are working on, but at best there are just some documents that sort of look like what you want. Here's an awesome curated list of falsehoods you might believe

Your clients will always request something that doesn't match the spec. Worse still they'll require it in order to use your API. The spec doesn't support it, not even remotely, leaving you in a really bad place. Worst of all, you know that they are right.

9. Supporting an integration with a third party for your customer is easy.

Every API today needs to support a deluge of third party products to integrate with their solution. When we built Authrss, we had to integrate with just a couple of OAuth/SAML/OIDC/AD solutions. Now, the list goes on and on--Google, Azure, Steam, Zoho, Slack, Discord, MagicLinks, etc...-- And that's just the AuthN part, then we integrate with every cloud provider (AWS, GCP, Azure) to send our partner SIEM logs (SumoLogic, Elastic, DataDog), and still more for handling other types of data integrations.

10. Clients will cache reasonably and use your Cache-Control header to do so.

Clients will never cache their responses unless they are somehow forced. Even when the data almost never changes, the idea of caching the data for even one or two seconds never crosses their mind. They will happily call your API for the same data over and over again. Rate-limiting will force them, at some point to make a change.

11. Your rate limits will make sense for you users.

The only thing rate limiting does is create an expensive piece of technology you now need to maintain. If you have multiple services/applications, then you've created a very dangerous piece of code/library/reverse-proxy that one small bug will cause downtime for everyone.

Further, rate-limits are surprises for your clients. Your clients will accidentally hit them one day and start causing huge problems, and that's because rate-limits aren't for them, they are for you. And so, if you make the mistake of adding rate-limits, now you need to monitor all the traffic in/out of your service to make sure non of your clients ever hit the rate-limit. Because if they do, it's the same as your service being down.

Likely you don't actually even need the rate-limit, but one time, one person with too much authority said you needed them, and now you are stuck. Or worse still, you are handing out api keys to your customers and never thought about how to track and secure them. (This by the way is why Authress offers API keys as a service)

12. We can release new versions of our API when something changes.

This is probably the worst lie. No, you can never release a new version of your API. Sure, you can publish it, but no one will use it, or maybe only a few people will. Realistically, integrations don't get updated. If you release an endpoint today, assume you need to maintain it indefinitely, unless you want to have the conversation with your CEO about why you lost 90% of your customer base because you dropped a necessary endpoint and their product stopped working.

No one wants to change anything that is working, don't try to make them. It's also worth mentioning here that there are ZERO consistent ways to actually change a REST api to support multiple versions.

Different versions of the spec is a nightmare to maintain, and that's just the documentation part
v1.service.com - Using a subdomain is the worst possible idea, now instead of maintaining one service, you've got two, that do almost the exact same thing, 99% of the endpoints are the same, and one of them is only slightly different.
/v1/resources - Different versions in the resource path of the url, actually means these are different resources. If they are don't use a version here, just change the name of the resource
X-Header: v1 A header is difficult to discover and worst still, there's no way to default it a new version
resources?version=v1 - Query parameters seam like the best possible solution, but have all the same problems as the headers--difficult to document, difficult to discover, difficult to know what the right one is.

In the end of the day, don't release breaking changes to your API.

Conclusion

Don't get stuck in the Myths of building an API. And if you find yourself wanting a more secure API in the process, you know where to go.

Come join our Community and discuss this and other security related topics!

Breaking up the monolith: Breaking changes

Warren Parad — Fri, 05 Aug 2022 19:17:05 +0000

Before we get into how to handle a breaking change, we should first identify what is even a breaking change.

What is a breaking change

A breaking change is:
anything that causes a hypothetical client of your service using the service in anyway to start behaving differently.

That’s a broad statement, but it’s true. Even if you don’t make changes to the API, if you change the expectations around how endpoints work, it will break clients. Therefore it is a breaking change. It’s also important to realize that you might not know how every client is using your API, so whether or not there is a real client you can point to is irrelevant.

Some examples of breaking changes might be:

API interface property type is changed (from int to string for instance)
Size of the data property is changed, (from 3 to 4 characters)
Returning an additional enum value in an enum property where you didn’t first explain to the clients that this list can be expanded. The general recommendation is that this isn’t a breaking change. But remember it doesn’t matter if you think it is, it matters if your clients do.
If you return an inconsistent or different error code or response status code. Returning a 400 instead of a 404 can be considered a breaking change. 404 means something, it’s possible that the 404 was a bug, and the resource really existed. So sometimes making a breaking change is a good thing.
Allowing the schema type of a property be different in difference circumstances, i.e. returning an int or a string, just don’t do this. While it is possible to document the union types, it's a huge headache for development teams to deal with.
Requiring a previously optional property or requiring a new header to continue having previous functionality. Clients not sending the header or the property will now get a 400 or 422 back on their response, instead of the previous 2xx.

How to handle breaking changes

We when we run a service, either a UI or an API, that service has endpoints or URLs that point to representations of resources. When we change the schema/interface/expectations around how that endpoint works we are introducing what is known as a breaking change.

One example of a breaking change is a changing a property in the response from type int to type string. The reason it is a breaking is because this change can cause a client of the API to incorrectly parse the response.

If the client has code that says:

if (response.property * 10 < 100) { doSomething(); }

Then how property will be handled by different languages might result in a runtime exception or worse, no exception but improper handling of the result.

There’s obviously a need to introduce the string version of property, we don’t have to care where the need came from, but one example could be, we ran out of numbers. Transactional data runs into this problem all the time, and converting from a sequential int to a guid is one way to help.

Note: I general solution to this problem is that all identifiers must always be strings , never make an identifier an integer.

This is the solution to the problem, but we might have made a mistake, and in hindsight this is obvious, but doesn’t help us.

So what do we do?

Versioning endpoints

One solution is to prefix all your endpoints (or use a header or query parameter) to tell the service which version of an endpoint to use. Let’s define what we mean by versioning an endpoint. Versioning an endpoint is not “running multiple versions of the service at the same time”. It means adding an indicator to the endpoint so that callers can select which version they want. While in practice this can be done, it actually isn’t a concept that is but into practice. Here we will see why.

For instance we might have:

GET /v1/demo and now we’ll introduce GET /v2/demo Where in v1 we return an integer version in the property field and in the v2 we return a string version.

This works but it is a very bad and terrible idea. The reason is that clients that want the new functionality have to find and update their code to reference the new version of the endpoint. Another reason is that you might have multiple changes in progress at one time, does the v3 endpoint contain two changes at once, or what if you have three changes, how does that even work.

Another core problem here is that it just isn’t RESTful. While you might have something to say about whether REST is important, we should agree at the very least that this is a true statement. Having two different endpoints means that the resources at these endpoints should be different. Further what happens when you actually create a v2 resource and want to make the property be deadc0de? This resource now cannot be returned on the v1 endpoint, because v1 only understands int not the string value this property is.

Still further issues include the increased complexity for clients that don’t care about this change, but care about dependent changes. They want to still use the v1 endpoint because they need int for right now, but have a critical change they need that you’ve released in v3. They can’t get it until they take the string upgrade.

Not to mention the maintenance burden on the service side to now keep track of multiple endpoints. And even if we clean up in the end, we’ll have the problem that we’ve got an endpoint on v2.

The last problem is visibility. Along with the complexity, we might not even have a way to solve the problem if we release an SDK. The SDK has hard coded the v1 endpoint, and it would be a mess if we had to introduce duplicate DTOs every time we wanted to make a small change. Not to mention the nightmare later, seeing as we’ll have the exact same problem there. Having breaking changes in a library, just moves the problem. And worse, it moves the problem to every library you maintain.

Further issues

The issue is compounded even further if we have multiple endpoints. What happens if we have two endpoints:

GET /v1/resource and POST /v1/resource And let’s say that property is a write only value that is only used in the POST . Now we have a huge discrepancy if we role the v1 POST to a v2. If it isn’t obvious, think about what happens when a different change is necessary only to the GET v1 endpoint. The v2 for the GET has now a totally different meaning than the v2 for the POST. A client updating their code won’t know the semantic meaning of v2 and doesn’t know that it means something different. This is creating a pit of failure.

There’s a joke here about how many Haskell programmers does it take change a light bulb?

One, but you have to change the whole house.

You can of course go to the extreme of releasing a new version of whole service with everything identical, and running both versions at the same time. When existing clients migrate to the new version, you can shutdown the old version.

Please don’t do this, some clients will never migrate, and you will be stuck with duplicate consumed resources until the end of time. For very small early services, you are better off just breaking your clients.

One more example

Twitter is a service, you can use it, and they release new features all the time. Surely there are breaking changes, but when you want to go to twitter, you go to https://twitter.com you don’t go to https://v2.twitter.com. Even without doing so you still get new features automatically, there might even be breaking changes.

“Now, now Warren, that’s not the same.”

Okay, but bear with me. Even though the UI is a service, you never need to go to a different url to get new functionality even when the UX breaks your experience. Yes, twitter breaks your experience all the time. But it doesn’t break your client, not the best example, so let’s dive in.

If we look at all the apis that are released in the world, the number of endpoint version changes that exist is minuscule and almost zero. It’s so small, that I’ve found telling my teams it’s better to not stick the v in the endpoint url at all. If you need a different resource for some reason, just create a different endpoint. If it is the same resource, then update the endpoint, but don’t break it.

Go for it, go find a public API out there that has versioning and versions the api regularly. It doesn’t exist. Even GCPs APIs are mostly on v2. And this comes from a company that frequently deprecates things before they are released. Adding support for versioning in endpoints is over-engineering. Here’s twitter v2 api, and it’s been around as a company since 2006, when your service is 16 years old, I’ll let you know that only slightly disappointed if you release a v2 :).

We don’t need to version endpoints

So, what’s the solution?

If replacing the whole service is on one side of the spectrum, what’s on the other side?

Is it regret that we have an int for the rest of time?

I hope not.

Instead what we can do is add a new property propertyString or propertyV2 propertyAdvanced propertyOtherThing.

This is really easy, it doesn’t solve every problem, but you can return the new value in this new property, and leave the current one alone. In rare cases, when we created an issue with the primary key of the resource, obviously this won’t work, and creating a new resource/endpoint might be the only solution. Obviously this an edge case, but does happen. But rather than come up with “one solution to rule them all”, we would rather have a better solution to 99% of the problems, and an okay solution to the 1%.

Later, after all the SDKs are updated and the client are using the new property, we can delete property. However even better, we could delete the property from our documentation, but leave the property available. There’s almost no reason to delete it, the cost is very small to have it, and in most cases trivial for maintenance.

Adding a new property is no different than adding a feature. The only time you’ll need to do something special is when you go to delete the property. So treat it like everything else until then, a new separate property, and just don’t break the clients.

The interesting opportunities

Situation 1: Just don’t change the damn thing.

Leave it the way it is, I know you hate it, but honestly the work to change it isn’t worth the change. Sure you could make some updates, let clients choose how they want to call your service, and return the new updates. But if it is merely semantic, get over it. If you want to be a good engineer, focus on the business impact not whether or not you are unhappy with it being an int.

Situation 2: You coupled your DB to your API

There are some frameworks that I consider atrocious, never should have been created and the software community is worse off for having them. I’m going to enumerate the list, but it comes down to anything that makes it easy to couple your DB schema to your API interface. (And don’t get me started on monolithic technologies that let you couple your DB to your UI presentation logic). Things like GrapQL can be good solutions to specific problems, but are often abused by inexperienced engineers to do exactly this.

The critical thing to do here, is abstract your DB schema from the interface. Your clients don’t care about the DB schema, they care about the service interface. In the cases you need to make DB changes, if you can’t do so in a way that doesn’t cause a breaking change in your API, first separate these two. Create a serialization layer, an abstraction layer, an auto-mapper, a schemaless DB NoSQL solution, etc… It doesn’t really matter how you do it, as long as you do it. You will absolutely need to change your DB at some point, and you can’t let the rigidity of your API prevent you from doing so.

Situation 3: Security issues found!

It happens, you find a security vulnerability, some property in the API is either exposing data it shouldn’t be, exposing data to whom it shouldn’t be, or just not working correctly. You can’t go on exposing it, and so that required property you have is going to become optional, there’s no way around that.

However, changing a property from required to optional in a response body is a breaking change, and clients depending a non-empty value will break. But there is really nothing you can do here, other than eat the vulnerability. There is no better time to remain security compliant than in the face of breaking clients. I know something about security in APIs, since I have designed these for many companies, a lot of clients would rather have their business fail then make improvements, sometimes you have to bite the bullet and let these service clients start throw exceptions. But if you do, please communicate that you are doing this.

Situation 4: Deprecation

In the case you really really need to change the schema for non security reasons. Communicate it, even if you do need to change the schema for security reasons, communicate it. No reason to not always communicate.

In the case of a service endpoint or service resources that you don’t need any more, either because they aren’t a business competency or because the cost of management is too high for any reason, remove them.

You want to remove them, some of your clients might even want you to remove them. Your documentation is confusing, or service is confusing, just delete the endpoints.

However you don’t want to break clients. So instead, come out with a deprecation plan. The best deprecation plans are between 6 months and 1 year, where you commit to turning off that endpoint. The trouble is even with all of this, clients will wait until that last email before telling you they can’t migrate. You can certainly try to avoid this, but that day is coming, and they are still using your legacy thing.

The Conclusion

Just don’t make breaking changes to interfaces, remember from the mistakes you made in your past, and deal with unnecessary extra properties, removing documentation for old features, and focusing on the new ones. Take the lessons an build better services, because the cost and headache to trying to fix it is so high. If you are okay with just breaking it for your clients, then just do it, trying to go around this problem is a waste of time and resources. Don’t make the breaking change either — by adding new properties, or by living with the properties you have.

Some Quick Advice

Use strings for all fields that don’t represent numbers. If the property is a number than int is fine, if that property isn’t a number, please don’t use an int.
All Ids should be strings
Do not have mutually exclusive boolean properties: is_active is_deleted use status enums instead
Use Objects instead of properties. Instead of otherResourceId use: otherResource: { id: '' } , then you can add additional properties later.
Prefer Arrays to single elements, so when that thing expands you can be prepared to add additional objects to that object. It’s much easier to have an Array of a single object, than it is to explain when you have both a property called thing and another property called thingList.
Resources usually don’t have versions, but resources can point to versions of other things. Audit trails and changelogs are a different thing.
Before naming a property thingV2 try to come up with a more descriptive name, such as thingAdvanced or thingWithExtraStuff
don’t have a type property at the top level. If you have two types called a and b then instead have a property bag called { a:{}, b: {} } where you can store the properties specific to each of those types.

AWS CloudWatch: How to scale your logging infrastructure

Warren Parad — Mon, 30 May 2022 11:58:07 +0000

An obvious story you might decide to tell yourself is Logging is easy. And writing to the console or printing out debugging messages may seem easy, and when running a service locally it usually is. As soon as you cross the magical barrier that is the cloud, for some reason this gets really complicated.

So complicated that so so so many companies that think they can compete on delivering this exact solution. But this isn’t an post about which of those to use, nor is it a marketing ploy for a specific provider. (And I’ve used a lot of them, and for some reason they are all terrible, the only thing that was more terrible than using a SaaS provider for logging, was running an open source thing, with ELK being the worst logging infrastructure ever created. Your logging infra should cost at most 10% of your spend and next to 0% of your development time. Yet when you use any provider, it’s like 50%)

For something that usually costs around 30% of your total cloud spend, you would expect to get something useful out of logging. And you do, logging is critical for the sustainability of your service and your business. At Rhosys, we frequently need to know not only if our services are working, but how effectively they are working. Dashboards that monitor call counts and latencies are worthless to a business, we need to know exactly what business relevant logs look like. Like most security conscious companies (non-security conscious companies probably want to ignore what I say next, otherwise it will feel like a bit of holy water burning your internal devil), we have multiple AWS accounts each with a dedicated purpose assigned to only one team, only that one team that has access to that specific AWS account. You don’t share accounts.

The Setup

How we set them up is less important, what is important is that each product gets its own AWS account. It just makes sense, and it’s required when a different team owns each one. Since Rhosys has three core products (at the time of writing), we have something like 40 AWS accounts (because AWS of course):

1 AWS account to run Authress
1 AWS account to run Standup & Prosper
1 AWS account to run Modulemancer
1 AWS account for open source and a bunch of our partnerships with AWS
1 account per developer
and then tons more because why not

This isn’t a story about security though, it’s about maintenance, and since each of our products is in a separate account, there are some complexities with actually figuring out the core problem of How is our service doing.

Because we are using AWS and lots of serverless technologies, we make heavy use of CloudWatch Logs. CW Logs is great. It’s better than every other SaaS logging tool out there, it’s fantastic for monitoring as well. (But it’s terrible at alerting.) At this point we still don’t have a great solution to “report this problem to the dev team”, and that’s because CW Logs doesn’t offer a way to send an email or trigger an alert that Actually includes what is wrong. This is because the monitoring solution actually aggregates data instead of annotating and indexing it. And you’ll need to use SNS + CW Insights to help you.

The Logs

So back to focusing on our three product accounts. For the most part, and I’m glossing over some finer details, we log directly to CloudWatch logs, and it’s great. What isn’t great is if you wanted to see all the logs in one place (which is usually wrong, because different teams can have different solutions). But you might want to see all the — alerts, business problems, critical issues in a digestible format. This doesn’t have to be one place. It’s sufficient to have one CW dashboard per account, and easily switch between them.

Another solution is multiple instances of log collection. That is deploy log aggregation services to every AWS account. The problem is that means we would need to run a worker in every account to handle logging. That’s wrong. Having to deploy an agent for every service or for every region, or even every AWS account, is bad architectural design, and it doesn’t scale. This has to be automated, and require near zero burden on accounts that opt in.

Like the good microservice architects we are, we funnel the relevant business related logs to a secured logging account. For our expectation on how and what we log, there’s a separate article where I speak in depth about our expectations around logging, their purpose, and how to get the most value out of them.

The TL;DR of that article is that we have log statements that look like:

logger.log({
  title: '[Action Required] Failed to automatically handle plan
upgrade, review and determine why it failed and how to
more gracefully improve this problem in the future.',
  level: 'ERROR',
  details: { 
    accountId, error
  }
});

This gets converted in CloudWatch Logs to a base64 mess that we need a complex handler to disentangle. This the is meat of our log aggregator:

(Note: the awslogsData is actually list from CW)

for (let logEvent of awslogsData.logEvents) {
  let parsedLogEvent = {
    logStream: awslogsData.logStream,
    logGroup: awslogsData.logGroup,
    region: config.region,
    requestId: logEvent.extractedFields.request_id,
    extractedTimeStamp: logEvent.extractedFields.timestamp
  };

  let event = logEvent.extractedFields.event;

  // Handle timeouts explicitly
  if (event && event.match('Task timed out after')) {
    parsedLogEvent.data = { title: `${event.trim()} (RequestId: ${parsedLogEvent.requestId})`, level: 'ERROR' };

  // Handle everything else
  } else {

    // We want to pull out the JSON object from our logs
    const eventMatcher = event.match(/^(INFO|TRACE|ERROR|WARN)\s+(?:[\w+\s]*\s+)?(\{.*\})\s*$/s);
    const fallbackLevel = eventMatcher[1] || 'INFO';
    const loggedMessage = JSON.parse(eventMatcher[2]) || {};

    // If the message is a special error which has the code === 'ForceRetryExecution' then ignore it, we use this for enabling internal retries
    if (typeof loggedMessage.code === 'string' && loggedMessage.code.match(/^(ForceRetryExecution)$/i)) {
      continue;
    }

    // Normalize a bunch of properties depending on exactly where the real message data is
    const stringOrObjectMessage = loggedMessage.message || loggedMessage;
    parsedLogEvent.data = typeof stringOrObjectMessage !== 'object' ? { message: stringOrObjectMessage } : stringOrObjectMessage;
    parsedLogEvent.data.level = parsedLogEvent.data.level || fallbackLevel;
    parsedLogEvent.data.stack = parsedLogEvent.data.stack || loggedMessage.stack;
    parsedLogEvent.data.reason = parsedLogEvent.data.reason || loggedMessage.reason;
    parsedLogEvent.data.promise = parsedLogEvent.data.promise || loggedMessage.promise;

    // Actually do something with the message
    await handle(parsedLogEvent);
  }
}

So hopefully that abbreviated mess above shows where the value comes in our structured logging. Since all of our services log with structure, it’s easy for us to parse them and handle them in a unified way. I highly recommend a consist logging approach which is something like this. Using structured logs allows easy debugging of any service you have, without having to relearn a new pattern. Of course across teams this can be different, but then they’ll have their own needs and their own aggregation systems. And when you want to additional value to every account logging source, you can do it in one place without updating some library which you force every one of your services in every AWS account to update (as if that is even a real strategy).

The Fallacy

The trouble is here, while we have a great way to handle logs and a great way to log data, we have no way to easily port the logs from one AWS account to another. You would think in the infinite ability of an IAM system that you would be able to assign a valid resource policy to the lambda function and use it across AWS accounts. Alas, you cannot. It turns out that building a successful authorization framework is a huge challenge, and while AWS did a great job thus far, we can attest that managing one and solving for every edge case is a Sisyphean burden. (How do we know? We did it using Authress).

The only way to port logs from one AWS account to another in an automated fashion (remember we want a full service solution, we don’t want to deploy a log subscription lambda function to every region for every account), is to use AWS Kinesis OR AWS Kinesis Firehose.

Wait, those are different things you say? YES they are!

For lack of clear documentation from AWS, Kinesis is a shared database and Kinesis Firehose is a transport mechanism. So you can either stick the data from CloudWatch logs into a specialized shared DB (Kinesis) or you can delegate that work to transporting the data somewhere else (Kinesis Firehose). Since Kinesis Firehose forces you to stream to a DB. Your options are Database or Database. And that Database cannot be CloudWatch Logs, nor does Kinesis support calling lambda directly, because hey, WHY NOT!

Since Kinesis is always on, it costs the wrong kind of money. We want full scalability, so we’ve gone with Firehose. Spin up a Firehose in the logging account, and use that with every CloudWatch subscription.

I can just set a resource policy on my Firehose to allow my whole AWS org access to make subscriptions from CW Logs, right? NOPE, you need to create what are known as custom Log Destinations, and enable other accounts to use that. That’s multiple additional AWS resources to manage.

Oh, also Kinesis Firehose isn’t a valid event source for lambda. “WHAT” — you say. That’s right, you need to funnel the data to an S3 bucket, and then use a Lambda trigger to actually hit the Log parsing Lambda Function.

(And for fun, the data that comes into the lambda via S3 from Kinesis, isn’t delimited. The data is directly concat-ed. Why in the world does it not automatically put delimiters between the records by default, is beyond me. And nothing that a simple .replace(/}{/g, ‘}\n{’).split('\n') couldn’t fix)

The Solution

As a result there are number of moving pieces to this which allow use to aggregate the logs in a single account for alerting purposes. Remember you want to also deploy this in every region, not just one. Your logs should stay in the same region they are generated in:

Multiaccount AWS Architecture Diagram

And the relevant CloudFormation Template to generate these resources in the Logging AWS Account:

Create the bucket where we temporarily store the logs:

CrossAccountLogBucket: {
    Type: 'AWS::S3::Bucket',
    Properties: {
      AccessControl: 'Private',
      BucketName: { 'Fn::Sub': '${AWS::AccountId}-${AWS::Region}-cross-account-logging-sink' },
      NotificationConfiguration: {
        LambdaConfigurations: [{ Event: 's3:ObjectCreated:*', Function: { Ref: 'LambdaFunctionAlias' } }]
      }
    }
  }

Allow it to directly invoke the Lambda Function LambdaFunctionAlias

S3LambdaInvokePermission: {
    Type: 'AWS::Lambda::Permission',
    Properties: {
      FunctionName: { Ref: 'LambdaFunctionAlias' },
      Action: 'lambda:InvokeFunction',
      Principal: 's3.amazonaws.com',
      SourceAccount: { Ref: 'AWS::AccountId' },
      SourceArn: { 'Fn::Sub': 'arn:aws:s3:::${AWS::AccountId}-${AWS::Region}-cross-account-logging-sink' }
    }
  }

Create the Kinesis Firehose

LogDeliveryStream: {
    Type: 'AWS::KinesisFirehose::DeliveryStream',
    Properties: {
      DeliveryStreamName: { 'Fn::Sub': '${serviceName}-${AWS::Region}-Log-Sink' },
      ExtendedS3DestinationConfiguration: {
        BucketARN: { 'Fn::Sub': '${CrossAccountLogBucket.Arn}' },
        RoleARN: { 'Fn::GetAtt': ['LogStreamRole', 'Arn'] }
      }
    }
  }

Allow the Firehose to write to S3

LogStreamRole: {
    Type: 'AWS::IAM::Role',
    Properties: {
      RoleName: { 'Fn::Sub': '${serviceName}-${AWS::Region}-CrossAccountKinesisLogStream' },
      AssumeRolePolicyDocument: {
        Statement: [{
          Effect: 'Allow',
          Principal: { Service: ['firehose.amazonaws.com'] },
          Action: ['sts:AssumeRole'],
          Condition: { StringEquals: { 'sts:ExternalId': { Ref: 'AWS::AccountId' } } }
        }]
      },
      Policies: [
        {
          PolicyDocument: {
            Statement: [
              {
                Effect: 'Allow', Action: ['s3:PutObject'],
                Resource: [{ 'Fn::Sub': '${CrossAccountLogBucket.Arn}' }, { 'Fn::Sub': '${CrossAccountLogBucket.Arn}/*' }]
              },
              {
                Effect: 'Allow', Action: ['kinesis:GetRecords'],
                Resource: [{ 'Fn::Sub': 'arn:aws:kinesis:${AWS::Region}:${AWS::AccountId}:stream/${serviceName}-${AWS::Region}-Log-Sink*' }]
              }
            ]
          }
        }
      ]
    }
  }

Create the CloudWatch Destination which can write to Firehose

AggregateLogEventsSubscriptionDestination: {
    Type: 'AWS::Logs::Destination',
    Properties: {
      DestinationName: { 'Fn::Sub': '${serviceName}-CrossAccountLogStream' },
      RoleArn: { 'Fn::GetAtt': ['CloudWatchDelegatedRole', 'Arn'] },
      TargetArn: { 'Fn::Sub': '${LogDeliveryStream.Arn}' },
      DestinationPolicy: {
        'Fn::Sub': JSON.stringify({
          Statement: [{
            Effect: 'Allow', Principal: { AWS: '*' },
            Action: 'logs:PutSubscriptionFilter',
            Resource: 'arn:aws:logs:${AWS::Region}:${AWS::AccountId}:destination:${serviceName}-CrossAccountLogStream',
            Condition: {
              StringEquals: { 'aws:PrincipalOrgID': [AWSOrgID] }
            }
          }]
        })
      }
    }
  }

And enable it to write to Firehose

CloudWatchDelegatedRole: {
    Type: 'AWS::IAM::Role',
    Properties: {
      RoleName: { 'Fn::Sub': '${serviceName}-${AWS::Region}-CloudWatchCrossAccountAccess' },
      AssumeRolePolicyDocument: {
        Statement: [{
          Effect: 'Allow',
          Principal: { Service: [{ 'Fn::Sub': 'logs.${AWS::Region}.amazonaws.com' }] },
          Action: ['sts:AssumeRole'],
          Condition: {
            StringEquals: { 'aws:PrincipalOrgID': [AWSOrgID] }
          }
        }]
      },
      Policies: [{
        PolicyName: 'FirehoseAccess',
        PolicyDocument: {
          Statement: [{
            Effect: 'Allow', Action: ['firehose:PutRecord'],
            Resource: [{ 'Fn::Sub': '${LogDeliveryStream.Arn}' }]
          }]
        }
      }]
    }
  }

The last step here is create a subscription filter in the log source accounts on existing CloudWatch Log Groups, and you are done.

Step-up authorization

Warren Parad — Fri, 08 Apr 2022 09:16:42 +0000

Step up authorization is the process of converting a user’s auth from a base level to an elevated or privileged state. This is usually achieved by utilizing the user’s preconfigured two factor authentication methods.

The result should be that the user is now able to access the more restricted resources on an account. This can be used to safeguard critical aspects of an account without inconveniencing the user by forcing 2FA from the beginning.

However this is where trouble lies. And unfortunately, solutions are of the form:

Forcing the user to do a full login again, using a different-method, tenant, connection, user identity pool-which has a 2FA enabled
Storing the updated credential and losing the previous login
User is stuck in elevated state
Unable to segregate and explicitly identity which resources should be restricted

This is called step-up authentication. This is actually a bad anti-pattern. The user’s identity hasn’t changed, and usually you aren’t concerned with user identity here.

(There are some cases where we might require multiple forms of user identity that are desired for trust or user delegation, but all of these provide a better user experience by having remembered 2FA, location/ip address based validation, and user agent/browser token persistence during the initial login. All these aspects would be in the authentication domain and authentication should stop at user login.)

Most of the time however, what we really want to do, is secure access to specific resources for the users benefit. Since we are talking about resource access, the appropriate term is step-up authorization and not step-up authentication.

This is an authorization and access problem, not an identity one.

So using login or identity tools to do this is a mistake. It’s a different domain and so we should use the tools in the appropriate domain.

To do step-up authorization, your IAM or authorization provider must support protected resource configuration to allow specifying when a user needs a step-up challenge. This also has the unique advantage of isolating security spheres and keeping the privileged access limited to the service/product/location that requested it, rather than applying it to the user identity across the board.

In the case with user token changes, (i.e. the use of step-up authentication with re-login) you can’t restrict which resources need access. Further there is no way to reduce the access again without logging the user out or hacking the integration with the user’s identity provider to store multiple tokens, assuming the identity provider enables this at all. And thus having to store multiple tokens in every user client session, which becomes more of a problem when there are multiple apps involved or multiple user agents.

Another way to look at the issue here is that, the user agent domain should never have to care about step-up access, the service API is the only one that should know this is required, and ask for it at the right time. But having step-up required, supported, and a first-class requirement of user-agents such as browsers incorrectly couples step-up flows with the UI the user interacts with. This fragments the implementation and segregates the understanding of the security features from where they are needed to where they should be opaque. Security is needed in the service, but the setup and management is forced into the UI. This is clearly incorrect. We want to have the knowledge of the security implementation to be directly aligned with the location where the step-up is requested.

For the purposes of this implementation, we’ll walk through the architecture and integration with Authress to see how to easily implement step-up auth.

Step-Up Authorization configuration

Authress breaks this down into a couple of different parts, we’ll go over each one of them here.

1. Specify the resource is protected

The first step is to mark access to the specific resource as protected by step-up authorization. There are multiple ways to do this in practice, the preferred way is to specify in the access record that permission to access the resource should only happen with elevated token permissions.

Enabling this feature causes authorization checks to fail unless the user and token requesting the resource has been stepped-up.

2. Complete the normal login flow

Users will navigate through your login as normal. As part of their account configuration, make sure to capture any mechanism you would like to use for the step-up flow. You’re probably already capturing their email or phone-number, however any available mechanism could be valid. It could even be the case to use Authress multi-signature request approval for this. Authress supports Access Requests which generates a long running process to approve that request before granting the user access. In some cases, multiple entities must be involved in the step up request, and Authress provides a way to support it.

3. Make the authorization request

As usual, the user navigates through the resources in your platform. When they attempt to perform actions on your resources, you make the appropriate authorization checks.

const { AuthressClient } = require('authress-sdk');

const authressClient = new AuthressClient()

[GET('/v1/resources/{resourceId}')]
function getResource(request) {
  try {
    await authressClient.userPermissions.authorizeUser(userId, `resources/${resourceId}`, 'READ');
    _// Application route code_
    return OK;
  } catch (error) {
      if (error.code === 'StepUpAuthorizationRequired') {
          await issueStepUpChallenge(error.stepUpChallengeToken);
          return Forbidden;
      }
      return ServiceUnavailable;
  }
}

For language specific implementations of authorization checks, see the available Authress SDKs.

4. Perform the step-up

As part of the response from Authress on required step-up access includes the details to enforce the step-up challenge. Authress has the capability to generate a challenge code for future verification and verify a returning code for the user, this prevents just any valid client from approving the step-up. The recommended approach is:

On a this failure, inspect the response, and check if a step-up challenge is necessary.
In the case where a step up is required, issue the step-up challenge to the user, have the user complete their flow.

Since this is all in the user authorization domain, the user doesn’t need to sign in again and no changes to the user identity nor delegation to an identity provider is necessary. Just issue the challenge and wait to hear back from the user.

5. Record to the step-up challenge result

In the case that the user successfully completes your step-up challenge result, the question becomes

How does the app link this elevated access to the existing user identity?

Since we want to restrict elevated permissions to authorization requests utilizing the user’s access token only, make a request to Authress’ user identity endpoint for the access token.

(This should be trivially obvious. If the user is logged in with more than one device, we only want one access token to be granted elevated permissions. This prevents accidentally granting all access tokens ever issued, that might still be active, elevated access. And additionally, the access token given granted step-up is fully controlled by your app/service/platform. This means the user can complete the challenge on their mobile device and elevated access will automatically happen in the browser session without ever needing to perform extra actions in either.)

To record the step-up authorization challenge success for an access token, make the relevant call to the Authress API:

const { AuthressClient } = require('authress-sdk');

const authressClient = new AuthressClient()

function handleChallengeResult(stepUpChallengeToken) {
  _// Set the user's token as part of the request_
  const authorizationToken = request.headers.get('authorization');
  authressClient.setToken(authorizationToken);
  _// Use the stepUpChallengeToken from the error response to upgrade the state_
  await authressClient.userPermissions.stepUpAuthorization(stepUpChallengeToken);
  return OK;
}

6. Continue with the user resource action

Once the step up challenge has been completed by the user, verified by the app service or platform using one of the user’s registered multi-factor devices, The user can be directed to make the same resource requests again. These resource requests will be successful, and the user can perform their desired actions.

Once the actions have been completed. The app can optionally remove the step up authorization, by issuing another call to Authress, and revoking the permission. This also makes for a great flow in high-risk environments, where the user might not trust their device. This these situations, they can request that the step-up be always performed, and their associated access records can be modified to require that the user’s step up happens before access to the protected resources is given.

Originally published at https://authress.io.

Breaking up the monolith: Zero downtime migrations

Warren Parad — Sun, 27 Feb 2022 10:25:50 +0000

It’s pretty common in monolith architectures to have to handle migrations. But this isn’t the only place. Microservices also frequently will have the need for database migrations. Unlike in monoliths where zero-downtime is usually impossible, microservices enable the capability to perform migrations with no downtime whatsoever.

Opponents of microservices, you know who you are, might say “oh now you need to learn to do this”. You don’t, you can continue with your app, which goes offline on the weekends and display a prominent banner that “our development team has been outsourced so we can’t do this simple activity without impacting our users.”

The fundamental need

Even in the most simple cases, you might find yourself in a situation where a database migration is important. You’ve labeled a critical column poorly, you are trying to reduce your DB size, or most probably, your primary index is wrong. It is unfortunate, but it does happen. The recommended approach is do nothing. That’s right, first evaluate how much of a problem this is. Sure it has been working okay until now, and it will continue to get worse, but how much worse? Are we talking about 0.001% of calls are problematic, or even 100% are but only suffer a performance degradation of an extra 1ms?

Before performing any work, always evaluated the value of doing so. While migrations are easy, it doesn’t mean they should be done.

The setup of a migration

So you’ve decided that the migration should be done. No problem, time to make it happen. Migrations always work the same way, and they are easy, once you know what to do.

(Caveat: This works reliably for database sizes up to a few Terabytes. Larger than that it will start to take some real time and put a real load on your migration infrastructure. In those cases you might think about a complimentary approach to port the data from one location to another in addition to the pattern.)

Here is the pattern:

Deploy a new replica database — You should have two versions of your DB running at all times, no exceptions. You cannot do this in one step and it must be done over time. It is always a trade-off for migrations, either take downtime and do it quickly or take no downtime and do it slowly. Zero-downtime is a slow and careful approach. One important part is to have a field called lastUpdated to indicated the most recent write. If you don’t already have something like this in your object, it’s really valuable in itself, but now is the required time to add it.
Duplicate the write logic — Duplicate all logic to write to both the current location and the new location. There are a bunch of ways to do this, the easiest is duplicate the code. The hardest is setup a post processing of your DB to read from the old location and write to the new location. (If you were using AWS DynamoDB this would be a Simple Stream.) — Make sure that the lastUpdated property is being set. This probably should also be used to ensure that data that is written has a lastUpdated that is later than the current value.
Replicate existing data — At this point all New data is going to be in both places. Any data row/object/item that gets written will be exactly the same, the only problem is any old data, this will live in the legacy system but not the new one. The next step is to copy all the existing data to your new database. When these updates happen they must check that their value is greater than the previous lastUpdated. This check ensures that a new write isn’t overwritten by old data. If you don’t do this, it will be impossible to ensure a complete and correct migration.
Validate the migration — Now is the time to check that the data migration worked correctly, run two read operations on both tables and validate that every row is the same, you can ignore an row that was updated after the validation started running (because we know that they’ll be the same). The check is something like — Read from table 1, check when it was last updated (if possible), then if needs to be validated, read from table 2, is the lastUpdated after “migration start time” if yes, ignore, if the same ignore, if they are different, investigate.
Start using new table — Change all the read operations to using the new table. Since all the data in the tables are now consistent, it doesn’t matter which one we read from. So switch to reading from table 2.
Cleanup — Change the “write to table 1 logic” to “write to table 2”, and delete the “duplicate data from table 1 to table 2”. We don’t need the duplication anymore, and since we don’t need table 1 anymore, we can remove the logic to write to it.
Delete Legacy database — Now we can delete table 1, it isn’t necessary anymore, and that’s the end of the migration.

There is however a race condition here that you need to be aware of. When we are in Step 5 (start using the new table):

Writing to table 1
Duplicating to table 2
Reading from table 1 — About to switch this to table 2

If there is a delay between the writing to table 1 and the duplicating to table 2, you could be reading stale data. In 99% of cases this doesn’t matter, in some areas this could be a problem, so you need to make sure that the logic that does the duplication of this data is completed before doing the switch:

await write1();
await duplicateTo2();
await read2();

That’s all you need to ensure, this works even in distributed systems and when using transactions.

Adding Custom Domains to your SaaS

Warren Parad — Fri, 25 Feb 2022 11:17:26 +0000

You're building out a SaaS solution and realize for one reason or another supporting custom domains for your customers is a must. There are some products out there that can do this for you in some way, but they either cost a lot, are a huge maintenance burden or aren't scalable. Here, we'll walk through using AWS to add custom domains to your solution.

What are Custom Domains?

First, let's discuss what a custom domain actually is. A custom domain, let's your customer brand your API, UI, service, or product with their domain on top of your service. This provides a huge amount of value for both parties. For customers they can provide your service to their third parties or hide the implementation detail of what they are using. No one wants to expose their dependencies to customers, it's unnecessary and dangerous.

For example if you have a product @ https://product-service.com and offer custom domains for your customers, they may look like:

https://customer-A.product-service.com

https://02c77286.product-service.com

This might be fine in a lot of situations, you can give the customer this identifier or url and call it a day. You might even have some logic which would associate this with their user/account to automatically redirect them. That's all good, but in the first case, let's say you want to include a Support Portal as part of your Product Service, you don't want to redirect your customers to https://customer-A.third-party-website.com, you want them to stay on the same domain.

As a concrete example, when you sign up for our product you get an account ID that looks like bwdlb5r89jwhy232, so the urls look like this:

So why are we doing this again?

Just like the support portal option, custom domains have two other important uses.

Internal benefit

One example is your benefit. We'll use our product Authress as an example. Authress provides an API for access control. That means every customer has their user identities, resources, roles, and permissions managed by the service. Since Authress optimizes permission queries as well as user credential security and caching, segregating each customer to their own subdomain has a lot of value for us. Authress charges by API call, that means tracking of these calls is critical, using the subdomain is a great way to separate calls.

When the service allocates a subdomain to an account, the subdomain matches the accountId:

https://accountId-A1.api.authress.io

And while this is easy for developers to use in a service, it isn't pretty. So we provide the capability to mask this domain with one the customer chooses.

Security implications

Security is improved via:

A) There are lots of restrictions that exist on a per domain basis, CORS/Cookies are one of them. Allowing your users to set their custom domain for your product to match their domain, allows them to consume these resources in a safe way. Having different domains may tempt you to store some things that you shouldn't.

B) User SSO and authentication. For SSO, every business customer will want to have their own login provider (IdP) utilized. That means Account 1 uses IdP A, Account 2 uses IdP B, and so on. There is no way to know before a user logs in which IdP to use, so we need to guess. We could guess by forcing the user to enter their email, and then using the domain of the email.

This is a terrible solution. Instead, let the customer use a custom domain on top of your subdomain to configure their IdP. Then you can just look at the domain to determine which IdP to you. (Spoiler: this is exactly what Authress does to automatically support SSO for B2B products, you don't need to do any extra work, if implemented through Authress connections).

Setting up a custom domain

So you want to let your customers, tell you which domain they want and then you map

https://custom.domain.com => https://customer-A.product-service.com

This is mostly a DNS record, and you are done. All they have to do is add CNAME added to their Domain Registrar's hosted zone.

Sounds easy doesn't it? So why do all these companies try to charge you a ton for this?

The Challenge

While the CNAME works out of the box, the real problem comes with TLS or HTTPS. In order for that CNAME to work and for their users to correctly have encrypted traffic, YOUR service needs to serve a TLS certificate that matches THEIR domain.

If this was your domain, you would add a DNS verification to your Hosted Zone, and your cert would work. Or You could AWS Certificate Manager and this would work without even any code.

But you don't own this domain, which means somehow you need them to give you a cert, and this cert has to be renewable, that means every 3 months~1 year you would need a new cert...For every customer...

This also means maintaining a database of these encrypted certs and safe-guarding them, because if you:

lose them, then all your customers can no longer access your site
expose access to them, then anyone can impersonate your customers

Both of these are really bad.

The Solution

(A small aside: If you actually wanted to pay for this, there is a discussion about all the different options available on IndieHackers)

As I mentioned there are tons of solutions to this out in the world, but none of them are great, they are costly to maintain, and worse, if there is something that goes wrong, it will go really wrong. So here I'm going to outline how you can do with this AWS Certificate (ACM) and AWS CloudFront (CF) and not have to worry about anything else.

1. Set up your service

Run your service like you would, wherever you would, this could be AWS Lambda or ECS (or if you are a masochist--EC2, or just want to burn a lot of money for no reason--EKS).

2. Support wildcard processing

Your service needs to work with wildcards or genericly specified domains. This means that https://*.product-service.com needs to be handled correctly by your stack. For APIGW, this means adding a custom domain for AWS APIGW that includes the *. For a static site running in S3, however, you don't need any extra magic.

3. Enable custom domain requests

Let your customer request "I want to setup a custom domain, my domain is example.com". Generally speaking, require them to use a subdomain, because you likely aren't hosting their main website. For Authress, we require subdomains lik login.example.com.

4. Deploy Customer Account specific resources

To do this we'll deploy a new CloudFront and a new Certificate, both in the US-EAST-1 region (this is important). Both resources should request resources with the custom domain specified.

Create a new CloudFront Distribution with No alternate domain name set. (If you try to use the cert before it is validated it will fail, so don't do this). This distribution should point to your existing infrastructure, and have custom headers set for the specific customer accountId/subdomain. The CF distribution will have a domain like:

dmv3o3npogaoea.cloudfront.net

Request a new ACM certificate with the custom domain and get back the DNS validation options, they look like this:

Name: _85dd6f5e25429205f5ffa77.example.com.
Value: _caee56563101.aocomg.acm-validations.aws.

5. Mask these values so you can make updates later, by creating two CNAME records in your hosted zone:

CNAME: validation-customer-A.product-service.com
Value: _caee56563101.aocomg.acm-validations.aws (value from above)

AND

Name: customer-A.product-service.com
Value: dmv3o3npogaoea.cloudfront.net

6. Return these two sets of DNS CNAME values:

Name: _85dd6f5e25429205f5ffa77.example.com (Name from above)
Value: validation-customer-A.product-service.com (our custom name)

AND

Name: example.com (their custom domain)
Value: customer-A.product-service.com

7. Wait for your customer to add these two CNAME records to their domain hosted zone.

Concretely, from our example it would looks like this:

8. Once verified update your CloudFront distribution alternate domain name and certificate that was just verified.

Done.

The Result

Now you should be successfully running the CloudFront distribution as a TLS proxy for your service, website, app, etc... It will allow the white labeling that your customers need and you don't need to pay a fortunate to managing anything. ACM and ACF are free, and all the traffic you would pay for anyway just gets migrated to a new CF.

Even better? you can attach protections to your infrastructure to monitor the usage of these proxies to know how they are being used. And further, all the tools exist at the CF level for rate limiting, DDoS protection, real-time logging, custom assets, and the list goes on.

Come join our Community and discuss this and other security related topics!