DEV Community: Yan Cui

Is it safe to use ID tokens with Cognito authorizers?

Yan Cui — Tue, 03 Sep 2024 01:32:48 +0000

A common narrative is that one should always use access tokens to call your APIs, while ID tokens are strictly for identifying users.

Some of it has come from this article by Auth0 [1], which makes a strong statement about using ID tokens:

However, things are usually more nuanced. In some cases, using ID tokens instead of access tokens is both acceptable and pragmatic. Cognito User Pools might be one of these cases.

Cost of using access tokens in Cognito

The common practice amongst Cognito users is to use ID tokens.

Before January 2024, you couldn’t customize access tokens to include the OAuth scopes. So, using access tokens for authorization just wasn’t an option.

But now, Cognito lets you customize access tokens through the Pre-Token Generation trigger, like this [2]. With this, you can implement authorization in API Gateway using access tokens and OAuth scopes.

However, this requires Cognito’s advanced security features [3], which are charged much higher – starting at $0.050 per MAU and do not have a free tier. You must also pay the standard MAU cost for Cognito User Pools, which starts at $0.0055 per MAU and has a generous free tier of 50,000 MAU.

This significantly raises the cost of using Cognito User Pools. Here is how much it’d cost you per month with Advanced Security Features (using access tokens) vs. without (using ID tokens).

This might be fine for B2B use cases where you tend to have few high-value users. But it’s practically a death penalty for many B2C businesses, many of whom have thousands of free users.

“Why not use something else instead?”

For this many MAUs, you’d pay even more with vendors such as Auth0 and Okta. Most of whom require you to sign an enterprise contract before you can reach this scale.

Cognito’s greatest strengths are cost efficiency and its integration with other AWS services, such as API Gateway and AppSync. If access tokens are significantly more costly, one must ask,

“Is it worth it?”

Are access tokens more secure? If so, are they THAT much more secure and worth the extra cost?

ID tokens vs. Access tokens

Whenever I show an example of using Cognito with ID tokens, someone would tell me, “You should use access tokens instead!”. But, I have yet to hear a compelling argument for why ID tokens are less secure.

Let’s quickly debunk the common arguments.

“ID tokens are not designed for authorization”

Yes, access tokens were intended for APIs and ID tokens for authentication purposes. However, this division is not always necessary or practical.

After all, there is more than one way to implement authorization.

You can implement the authorization logic in the identity provider and embed the authorization decision in access tokens (in the form of OAuth scopes).

However, you can also implement the authorization logic directly in the API. In my last post [4], I demonstrated how you can do this with a Lambda authorizer and Cognito groups. We will also explore other ways to implement authorization with API Gateway in the coming weeks.

It’s two routes to the same result – being able to control who can do what in your system.

“ID tokens have more information”

This is true. ID tokens contain information about the user, such as their name and email.

But what can attackers do with this information if they managed to steal your ID tokens? Nothing. Most likely.

As an attacker, I have easier ways to acquire names and emails than to steal ID tokens from a system.

“ID tokens give you access to the API”

So does access tokens.

“Access token can only be created by a trusted source”

So are ID tokens.

“Access tokens have limited lifetime”

So can ID tokens. You can configure the validity period for both access and ID tokens in Cognito (and with other vendors). It’s a matter of making sensible architectural decisions.

“You can bind access tokens to specific senders to avoid abuse”

Yes, the Auth0 article mentioned this.

The linked article 5 discusses two techniques:

Mutual TLS authentication (MTLS)
Demonstrating Proof-of-Possession (DPoP)

for binding a token to a specific sender.

However, nothing about these techniques is specific to access tokens. They will work equally well for ID tokens.

Summary

Until someone can prove otherwise, I believe it’s perfectly safe to use ID tokens with Cognito authorizers.

ID tokens are not inherently less secure than access tokens. Furthermore, all the techniques that make access tokens more secure also apply to ID tokens.

So, there are no security downsides to using ID tokens with Cognito.

On the other hand, there are significant costs to using access tokens.

I’m not saying that you shouldn’t use access tokens! In fact, I will show you how to use access tokens to implement authorization in API Gateway in the next post.

But you should know the trade-offs and not blindly pick a more costly approach based on hearsay.

Links

[1] “ID Token and Access Token: What’s the Difference?” by Auth0

[2] How to customize access tokens in Amazon Cognito user pools

[3] Cognito User Pool advanced security features

[4] Fine-grained access control in API Gateway with Cognito groups & Lambda authorizer

[5] Identity, Unlocked… Explained | Episode 1

The post Is it safe to use ID tokens with Cognito authorizers? appeared first on theburningmonk.com.

Fine-grained access control in API Gateway with Cognito groups & Lambda authorizer

Yan Cui — Thu, 29 Aug 2024 08:09:48 +0000

In security and access control, authentication and authorization mean two distinct but related things.

Authentication verifies the identity of a user or system.

Authorization determines what actions an authenticated user is allowed to perform in your system.

API Gateway has built-in integration with Cognito, but it doesn’t provide any fine-grained authorization out-of-the-box.

By default, a Cognito authorizer only checks if a user’s bearer token is valid and that the user belongs to the right Cognito User Pool.

Here are many ways you can implement a fine-grained authorization with API Gateway. Here are three that I have come across over the years:

Using Lambda authorizer with Cognito groups;
Using Cognito access tokens with OAuth scopes;
Using Lambda authorizer with Amazon Verified Permissions [1];

Over the next few weeks, let’s look at these approaches in-depth and then compare them at the end.

Today, let’s look at Lambda authorizer with Cognito groups.

Model roles with Cognito groups

In Cognito, you can use groups to model the different roles in your system, e.g. Admin, ReadOnly.

Users can belong to more than one group at once, just as they can have multiple roles within a system.

Cognito encodes the groups a user belongs to in the ID token. If you decode the ID token, you will see something like this:



{
  "sub": "f438b478-6031-70f3-a346-4f8e84e00b62",
  "cognito:groups": [
    "ReadOnly",
    "Admin"
  ],
  "email_verified": true,
  "iss": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_xxx",
  "cognito:username": "f438b478-6031-70f3-a346-4f8e84e00b62",
  "origin_jti": "e0d4077e-7092-45dd-ac13-1d60d382629b",
  "aud": "1u7c0elmc6v3qrr68s4vpo63sm",
  "event_id": "fd811c5a-5ac7-4644-92ae-a9738a33bd76",
  "token_use": "id",
  "auth_time": 1724807904,
  "exp": 1724811504,
  "iat": 1724807904,
  "jti": "5fa8be1d-411f-418d-8508-b6b8fe64ff9b",
  "email": "example@example.com"
}

Here, we can see the user belongs to both the Admin and ReadOnly groups.

Lambda authorizer

A Lambda authorizer can use this information to generate its policy document. As a reminder, a Lambda authorizer can return a policy document like this:



{
  "principalId": "username",
  "policyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "execute-api:Invoke",
        "Effect": "Allow",
        "Resource": "arn:aws:execute-api:us-east-1:12345:xxx/dev/GET/resource"
      }
    ]
  }
}

So, we need to take the list of groups a user belongs to and turn them into a set of policy statements.

Mapping roles to policies

One approach is to keep a mapping in your code like this.



const POLICY_STATEMENTS = {
  "Admin": [{
    "Action": "execute-api:Invoke",
    "Effect": "Allow",
    "Resource": "arn:aws:execute-api:us-east-1:123456789012:xxx/dev/*"
  }],
  "ReadOnly": [{
    "Action": "execute-api:Invoke",
    "Effect": "Allow",
    "Resource": [ 
      "arn:aws:execute-api:us-east-1:123456789012:xxx/dev/GET/token",
      "arn:aws:execute-api:us-east-1:123456789012:xxx/dev/POST/task",
      ...
    ]
  }]
}

In many systems, there are a small number of roles that supersede each other. That is, they are hierarchical, and a higher role has all the permissions of a lower role plus some.

In this case, we need to find the most permissive role that the user has.



// assuming we have only two roles, Admin and ReadOnly
// and Admin supercedes Readonly
const statement = groups.includes("Admin")
  ? POLICY_STATEMENTS["Admin"]
  : POLICY_STATEMENTS["ReadOnly"]

return {
  "principalId": username,
  "policyDocument": {
    "Version": "2012-10-17",
    "Statement": statement
  }
}

But what if the roles are more lateral? That is, a user’s permissions are derived from all its roles.

Well, that’s easy enough to accommodate.



const statement = []

["Admin", "ReadOnly"].forEach(x => {

  if (groups.includes(x)) {

    POLICY_STATEMENTS[x].forEach(stm => statement.push(stm))

  }

})

return {

  "principalId": username,

  "policyDocument": {

    "Version": "2012-10-17",

    "Statement": statement

  }

}

Conclusion

This is my preferred approach for simple use cases.

It’s easy to follow and test and makes no API calls (i.e. no extra latency overhead).

Furthermore, it does not require Cognito’s Advanced Security Features, which are charged at a much higher rate [2]. This makes it a very cost-efficient approach.

However, using a Lambda authorizer means you need to think about cold starts and their impact on user experience.

Also, the roles and policies are static. Whilst it’s good enough for simple use cases, it cannot (easily) support more advanced use cases. For example, if you need to allow users to create custom roles while maintaining the tenant boundary.

Amazon Verified Permissions is a better fit for more advanced use cases. More on it later.

Links

[1] Amazon Verified Permissions service

[2] Cognito’s pricing page

The post Fine-grained access control in API Gateway with Cognito groups & Lambda authorizer appeared first on theburningmonk.com.

What’s the best way to do fan-out/fan-in serverlessly in 2024?

Yan Cui — Sun, 04 Aug 2024 18:48:10 +0000

Back in 2018, I shared [1] several ways to implement fan-out/fan-in with Lambda. A lot has changed since, so let’s explore the solution space in 2024.

Remember, what’s “best” depends on your context. I will do my best to outline the trade-offs you should consider.

Also, I will only consider AWS services in this post. But there is a wealth of OpenSource/3rd-party services that you can use too, such as Restate.

If you’re not sure whether you need fan-out/fan-in or map-reduce, then you should my previous post first [2]. I explained the difference between the two and when to use which.

Ok, let’s go!

Step Functions

Fan-out has always been easy on AWS. You can use SQS to distribute tasks across many workers for parallel processing.

But fan-in makes you work! You need to keep track of the individual results so you can act on them when all the results are in.

Luckily for us, Step Function has productized the fan-out/fan-in pattern with its Map state.

You provide an input array and specify how to process each item in the array. The Map state will process the items with as much parallelism as possible. It handles collating the results into the correct order and returns them as an array.

You can take this output array as input to another Task state to act on the results.

Importantly, the Map state has two modes:

Inline mode takes the input array from the state input and allows up to 40 concurrent iterations. This is good enough for most use cases. But it falls short when you have a large input array or require higher concurrency.
Distributed mode can use a JSON/CSV file in S3 as input and allows up to 10,000 concurrent iterations! This is referred to as the “distributed Map” state and is priced differently.

With these two modes, you can use the Map state for fan-out/fan-in at practically any scale.

The main consideration here is the ease of configuration and cost.

Ease of configuration

Inline mode is easier to configure than Distributed mode. Both modes need an ItemProcessor to tell the state machine how to process the array items.

The Distributed mode also needs an ItemReader and optionally (but likely) a ResultWriter. This is so the state machine knows how to read the input from S3 and potentially write the results to S3 too.

In addition, you have to decide what ExecutionType to use for the distributed map.

This is an important decision.

Each iteration of the map state is executed as a child workflow. The ExecutionType setting determines if the child workflow is executed as a standard workflow or an express workflow.

Standard workflows can run for up to a year. But an express workflow can only run for five minutes. See this article [3] for a more detailed comparison of these two workflow types.

Perhaps most importantly, this decision affects how the iterations are charged.

Cost

In Inline mode, the states in the ItemProcessor are executed as part of the state machine. For a Standard Workflow, they are charged at $25 / million state transitions. For an Express Workflow, they are charged based on memory used and duration.

The Distributed mode executes each iteration as a child workflow. The ExecutionType setting dictates whether the iterations are executed as Standard Workflows or Express Workflows.

Express Workflows are priced exactly as before.

However, Standard Workflows are priced as one state transition per iteration. That is, even if an iteration executes 10 state transitions, they are priced as one state transition.

With this in mind, the Step Functions cost for processing large input arrays will not be astronomical.

However, you also need to take into account other associated costs, such as the cost of Lambda invocations. If you need to process a large array of inputs, consider processing them in batches.

Summary

Step Function’s Map state is the simplest solution for implementing fan-in/fan-out serverlessly.

It can handle workloads at any scale.

Even if you need to fan out to millions of tasks, the distributed map state can offer a cost-effective solution.

If you want a balanced solution that offers good developer productivity and cost efficiency, you should choose Step Functions.

But if cost is your primary concern, maybe because you need to produce millions of tasks frequently, then you should implement a custom solution.

Custom build solutions

Here’s a common pattern for building a custom fan-out/fan-in solution:

When an input file is dropped in S3, it triggers a Lambda function to process the input file.
The Lambda function parses the input file and fans out the items to a SQS queue. It also saves the no. of tasks in DynamoDB.
A Lambda function processes the tasks from SQS in batches. To handle partial failures, it should implement partial batch responses [4].
As the SQS function processes each task, it writes the intermediate result to DynamoDB.
The data change events from step 4. are captured in a DynamoDB stream and used to trigger another Lambda function.
This function counts the no. of newly completed tasks and updates the total no. of completed tasks in DynamoDB. When it sees all the tasks are completed, it will iterate over all the results and produce a final output.

This is likely a more cost-efficient solution than Step Functions.

There are built-in batching for the SQS and DynamoDB stream functions, courtesy of Lambda’s EventSourceMapping. So the Lambda-related processing costs will be lower.

Furthermore, here is a rough estimate for other costs (assuming one million items):

1 million SQS SendMessage calls (for fan-out): $0.40
1 million DynamoDB PutItem calls (assuming each intermediate result is smaller than 1kb): $1.25
Assuming the Kinesis function uses a batch size of 100, we need to make 10,000 DynamoDB GetItem & PutItem calls to update the count (not using atomic increments because they are not reliable). $0.015
Once all the results are in, we need to query the table to fetch all results. Assuming each result is 1kb in size, we can fetch 4 results per Read Request Unit (RRU). It will take 250,000 RRUs in total, or $0.0625.

As you can see, these estimated costs are much lower than those for Step Functions. However, this solution has many moving parts, and you have to own the uptime.

If you frequently process millions of items then a custom solution like this can make sense.

Summary

When choosing a solution, it’s not just about finding the cheapest option. It’s about getting the most value for your investment.

Remember, you get what you paid for.

Think of it like buying tools for a job. You don’t always buy the cheapest tools because they might not last. But you also don’t need the most expensive ones if they offer more than you need. You aim for the right balance – reliable enough to get the job done efficiently without overspending.

In the same way, when building your serverless architecture, choose the solution that offers the best trade-off between cost, complexity, and capability for your specific needs.

Links

[1] How to do fan-out and fan-in with AWS Lambda

[2] Do you know your Fan-Out/Fan-In from Map-Reduce?

[3] Choosing workflow type in Step Functions

[4] Implementing partial batch responses

The post What’s the best way to do fan-out/fan-in serverlessly in 2024? appeared first on theburningmonk.com.

How to handle execution timeouts in AWS Step Functions

Yan Cui — Sun, 21 Apr 2024 00:09:42 +0000

Step Functions lets you set a timeout on Task states and the whole execution.

By default, a Task state times out after 60 seconds. But an execution can run for a year if no TimeoutSeconds is configured. To a user, the execution would appear as “stuck”.

AWS best practices recommend using timeouts to avoid such scenarios [1]. So it’s important to consider what happens when you experience a timeout

You can use the Catch clause to handle the States.Timeout error when a Task state times out. You can then perform automated remediation steps.

But what happens when the whole execution times out? How can we catch and handle execution timeouts like we do with Task states?

Here are 3 ways to do it.

EventBridge

Standard Workflows publish TIMED_OUT events to the default EventBridge bus. We can create an EventBridge rule to match against these events. That way, we can trigger a Lambda function to handle the error.

The event contains the state machine ARN, execution name, input and output. We can even use the execution ARN to fetch the full audit history of the execution.

That should give us everything we need to figure out what happened.

Unfortunately, this approach only works for Standard Workflows. Express Workflows do not emit events to EventBridge.

CloudWatch Logs

Both Standard and Express Workflows can write logs to CloudWatch. When an execution times out, it writes a log event like this:

We can use CloudWatch log subscription to send these events to a Lambda function to handle the timeout.

However, these log events are not as easy to use as the EventBridge events.

We can extract the state machine name and execution name from the execution ARN. But not the input and output.

For Standard Workflows, we can use the GetExecutionHistory [2] API to fetch the execution history. But this does not support Express Workflows. Instead, we must rely on the audit history logged to CloudWatch.

These are not always available. Because we will likely set the log level to ERROR to minimize the cost of CloudWatch Logs.

This approach can work for both Standard and Express Workflows. However, it might not be practical because the log event provides limited information about the execution.

Nested workflows

We can solve the abovementioned problems by nesting our state machine inside a parent Standard Workflow.

Works for both Standard and Express Workflows.
We have the input and output for the execution.

This is a simple and elegant solution. It’s definitely my favourite approach for handling execution timeouts.

Honourable mentions

There are other variants besides the approaches we discussed here. You can even turn this problem into an ad-hoc scheduling problem.

For example, you can send a message to SQS with a delivery delay matching the state machine timeout. Or create a schedule in EventBridge Scheduler to be executed when the state machine would have timed out.

In both cases, you run into the limitation that Step Functions’ DescribeExecution and ListExecutions APIs don’t support Express Workflows.

This makes it difficult to find out if an execution timed out in the end. It’s only possible to do this by querying CloudWatch Logs. I don’t think the extra complexity and cost are worth it. So, I’d recommend using one of the three proposed solutions here instead.

Links

[1] Use timeouts to avoid stuck executions

[2] Step Function’s GetExecutionHistory API

The post How to handle execution timeouts in AWS Step Functions appeared first on theburningmonk.com.

How to apply the TDD mindset to serverless

Yan Cui — Tue, 09 Apr 2024 08:28:29 +0000

Testing is an integral part of software development. Your tests are a living documentation of your system. They inform others how to use your system, but they are so much more than that.

One of the most misunderstood parts of Test-Driven Development (TDD) is the “Driven” part of the name.

It’s not just about “writing tests before you write the code”. If your tests do not inform and drive your API design, then you’re not really doing TDD.

When I say “API”, I mean the general meaning of the term, not the API spec for HTTP APIs, although they are part of it. In the context of a serverless application, you have to cover multiple APIs with your tests.

For an HTTP API, there are the API contracts you agree with the caller.
For an event-driven architecture, there are event contracts that everyone agrees on.
For the Lambda function, there is the handler function’s signature (its input and output types). In most cases, we don’t have control over these, as the function’s event source dictates them.
The code modules that are executed when a Lambda function is invoked.

Different types of tests [1] can drive the design of the different APIs above.

For example, end-to-end tests exercise the system from the outside. An e2e test will exercise an HTTP API by calling its HTTP endpoints as a client would. They should tell you if your API is difficult to use and drive you towards a better API design.

Similarly, unit tests exercise the business logic encapsulated in the code modules. They should tell you if you have the right abstraction and modularity in place and drive you towards a better code organization.

To apply the TDD mindset to serverless development:

Write e2e tests for your API before you even think about the Lambda functions behind those API endpoints.
Use the tests to identify problems with the API design. Is data missing from the API responses? Are we asking the caller to make multiple calls when we can do everything they want in one?
Iterate and improve.
Once the API spec is set, you can map API endpoints to Lambda function(s).
Now, it’s time to implement the Lambda function handlers.
Use unit tests to test your domain logic and use them to drive your API design.

Does this look similar to what you’d do if you were building serverful applications running on containers or EC2?

It should! The environment our code runs in should not influence how we can use tests to drive our design.

But with serverless, we want to leverage the cloud to its full potential and delegate the heavy lifting to the cloud provider. It changes what and how much code we have to write and maintain. So, it changes how and what we need to test.

For example, I’d delegate authentication and authorization to API Gateway. So, there is no authentication-related code in my Lambda function, and there’s no need for me to write unit tests for it. Instead, authentication is checked as part of my e2e tests. I might even have an e2e test to make sure that unauthorized requests are rejected by API Gateway.

Similarly, most of my Lambda functions are simple and do not have complex business logic. So, there is a low return on investment (ROI) from unit tests. Instead, I focus on testing the integration with the external dependencies (such as DynamoDB tables) with “remocal tests”.

I wrote about my testing strategy for serverless applications [2] previously. If you want to learn more about testing serverless applications, please read that.

But remember, tests are not just for catching bugs and preventing regressions. They are also living documentation for your system and a way to drive its design.

Links

[1] What is the Test Honeycomb? (and why you should care)

[2] My testing strategy for serverless applications

The post How to apply the TDD mindset to serverless appeared first on theburningmonk.com.

Here are four ways you can implement WebSockets using serverless

Yan Cui — Wed, 03 Apr 2024 00:02:18 +0000

The myth that “you can’t do WebSockets with serverless” still persists today, even though we have some very good ways to implement WebSockets without needing to manage any servers.

Part of the problem is that many still falsely equate “serverless” with Lambda. But serverless is much more than that. To me, it describes any technology that:

No need to manage servers.
Scale to zero.
Usage-based pricing with no minimum charge.

With this in mind, API Gateway, AppSync, and IoT Core are all serverless technologies. All three let you implement WebSockets.

Momento Topics [1] is another good option if you are open to exploring non-AWS options.

API Gateway

API Gateway WebSockets are very low-level and require a lot of work.

As the application developer, you must maintain the mapping from WebSocket connection IDs to users. You do this by implementing Lambda functions that handle API Gateway’s onConnect and onDisconnect events.

When sending messages to a user, you must find the user’s connection ID (in your own database) and call the API Gateway Management API to send the message.

Because you can only send messages one at a time , it can be very inefficient and costly to implement group chats or broadcasts.

Imagine building a sports streaming app and wanting to notify everyone watching Barcelona vs Real Madrid that a goal has gone in. If you have a million live viewers, then that translates:

a million reads from your DynamoDB table
a million API requests to the API Gateway Management API

Of all the options here, this is my least favourite.

AppSync

By comparison, AppSync subscriptions are a breeze to work with.

In the GraphQL schema, you connect a subscription operation (i.e. what a client can subscribe to) to a mutation operation.

That’s it! AppSync takes care of the rest.

Multiple subscribers can subscribe to the same update. When someone publishes a post, the update is automatically broadcast to all the subscribers.

Subscribers can filter what updates they want. To support this, you only have to add arguments to the subscription operation. Here, we allow the caller to filter on author, title, publishYear and publishMonth.

If you only want to receive new posts from “Cixin Liu” then you simply call

addedPost(author: "Cixin Liu") {
  id 
  author
  title
  content
  url
}

If you want to learn more about AppSync subscriptions, then check out this explainer video [2].

IoT Core

Despite its name, the AWS IoT Core service is not just for IoT. It’s actually just a serverless messaging service that speaks MQTT [3].

You can subscribe and publish messages to topics. Pretty simple.

But it has some nice quality-of-life features because it’s designed with IoT in mind. For example, it can store messages if a device is offline and deliver them when it next comes online.

The quickest way to understand how it works is to try it out yourself. Head over to the IoT Core console and click “MQTT test client”. Here, you can subscribe to a topic (e.g. “test”) and then publish a message to the topic.

Momento Topics

Momento Topics [1] are conceptually very simple. You can subscribe to a topic and receive and publish messages through a WebSocket connection.

Allen Helton wrote a nice article [4] on the architecture behind Momento Topics. I highly recommend giving it a read.

The innovation that Momento Topics brings to the table is its pricing structure.

AppSync, API Gateway and IoT Core all charge based on a combination of:

Number of messages sent and received through WebSockets.
Connection time.
Standard EC2 data transfer cost. Which is code for “it’s complicated” [5].

Momento, on the other hand, offers a pay-per-use pricing model. Even if you have millions of connected users, you don’t pay for the connections if they are idle.

This offers a different cost dimension that we can use to align with our business model.

The Frugal Architect [6] Law 2 says that our architecture’s cost should grow along the same dimension as how the business would make money.

In other words, if your business revenue grows with user engagement, then so should your architecture’s cost. If millions of connected but inactive users don’t generate revenue for your business, then connection time shouldn’t be the driver of your cost either.

Comparison

Of the four options we discussed here, API Gateway WebSockets requires the most effort to implement. Again, it’s a low-level construct, and you must work with connections and map them to users.

API Gateway also doesn’t support broadcasts. You can only send one message at a time.

However, it’s a two-way (i.e. duplex) connection. The client can send and receive messages through the WebSocket connection. It’s also a vanilla WebSockets connection, and you can develop your own application protocol on top of it.

With AppSync and IoT Core, your application must speak GraphQL and MQTT, respectively. This can be a barrier to entry because it requires additional application dependencies, and you need to understand how these protocols work.

Security

Regarding security, all four options provide some form of authentication and authorization mechanism.

With API Gateway, you can use AWS IAM or a Lambda authorizer to protect the WebSocket API. Once connected, you can decide what data is sent to the user and validate any messages received from the user.

You have a lot of control here but also a lot of responsibilities!

With AppSync, you can use all the available authentication and authorization mechanisms [7] to protect the Subscription endpoint. This controls who can subscribe for updates.

Additionally, you can use a resolver template to control who can subscribe to which updates [8]. For example, to stop a user from subscribing to notifications intended for another user.

With MQTT and Momento Topics, you can control who can access which topic. To ensure a user can’t subscribe to other users’ updates, you should scope a user’s permission to just his/her topic. Luckily, in both cases, topics are a virtual construct and do not need to be explicitly created before use. So there’s no limit on how many topics you can have.

With all this in mind, here’s a summary of how these services stack up.

The key takeaway is that you have several options for implementing WebSockets serverlessly. Each offers a slightly different set of trade-offs. And that’s great place to be.

If you want to learn more about building serverless applications, then check out my upcoming workshop [9] and get 30% OFF with our early bird tickets (available until April 26th 2024).

Links

[1] Momento Topics

[2] What are AppSync Subscriptions?

[3] The MQTT specification

[4] How we built Momento Topics, a serverless WebSocket service

[5] Understanding data transfer cost in AWS

[6] The Frugal Architect

[7] AppSync authentication and authorization

[8] Authorized AppSync Subscriptions

[9] Production-Ready Serverless workshop

The post Here are four ways you can implement WebSockets using serverless appeared first on theburningmonk.com.

DynamoDB now supports resource-based policies. But is that a good idea?

Yan Cui — Sat, 23 Mar 2024 13:04:26 +0000

DynamoDB announced support for resource-based policies [1] a few days ago. It makes cross-account access to DynamoDB tables easier. You no longer need to assume an IAM role in the target table’s account.

I was confused by this update and wondered if it was even a good idea. If you need cross-account access to DynamoDB, then it’s surely a sign you’re breaking service boundaries, right?

As I said before [2], a microservice should own its data and shouldn’t share a database with another microservice.

In many organizations, microservices run in their own accounts. This provides another layer of insulation between services. It can also compartmentalize security breaches and problems related to AWS limits.

When a service needs to access another service’s data, it should go through its API. It shouldn’t reach in and help itself to their data. It’s insecure and creates a hidden coupling between the two services.

This is not how you should use the new resource-based policies!

However, there are legitimate use cases where cross-account access can be helpful. I want to thank everyone who shared their thoughts and experiences with me.

ETL / data pipelines

ETL jobs and data pipelines often need to read data from multiple sources or subscribe to DynamoDB streams to capture live data change events.

These require cross-account access to DynamoDB tables, and the new resource-based policies make them easier.

Migration to new service/account

In this video [3], I discussed using the Strangler Pattern to migrate an existing monolith to serverless. The new service must still access the old database as part of the transitional phase.

This is when you will need cross-account access to the original DynamoDB table. Once your new service owns all the data access patterns, you can safely migrate existing data to a new table in the new account. See this post [4] to see how you can do this without downtime.

Once the migration is complete, you can drop the cross-account access.

There are other cases when this type of transitional need arises.

For example, when moving from a single-account strategy to a multi-account strategy.

In that scenario, you’d create a new instance of the service in a new account. During the transition period, the new account must access data in the old account until the data is migrated to the new account.

Another layer of access control

Resource-based policies give InfoSec teams more control.

Application teams can use identity-based policies to control data access, and InfoSec teams can use resource-based policies to add another layer of control and governance.

This is important in regulated environments that handle sensitive user information. For example, developers need read-only permissions to debug problems in production. However, InfoSec can use resource-based policies to deny access to DynamoDB tables by IAM users.

This can be applied selectively to only the tables holding sensitive user information. You can also carve out exceptions, such as allowing an “emergency breakglass” role to modify table settings (e.g. provisioned throughput settings) without granting data access.

Sharing data with external parties

In rare situations, you might need to share data with external vendors or partners. Resource-based policies make it easier to control who can access what in your account.

Instead of keeping track of the different IAM roles the partners have to assume, you can see all the policies in one place.

Multi-account global table

Several Amazon employees mentioned a common architectural pattern within Amazon. A global service would deploy to different accounts – one region per account.

DynamoDB global tables do not support cross-account replication, so the application must replicate data changes across regions.

Understandably, this is a rather painful exercise. Resource-based policies significantly simplify this process.

This is an interesting architectural pattern. It’s not something that I have seen myself. All the global services I have seen or worked on would deploy to multiple regions in the same account.

From a security point of view, I can understand the rationale. If an account is compromised, an attacker will likely gain access to other regions. An account boundary offers stronger isolation.

Internal tooling likely played a role as well. This pattern is probably ingrained in how internal automation platforms work within Amazon.

Summary

So, DynamoDB finally supports resource-based policies.

While it simplifies cross-account access to DynamoDB tables, eliminating the need to assume IAM roles. It still begs the question: “Is cross-account data access even a good idea?”

Suggestions of a “central data account” are misguided. Placing an account boundary between a service and its data is a bad idea. It will introduce unnecessary complexity to IaC and CI/CD and increase the likelihood of hitting service limits. It also erodes service boundaries and the autonomy of a team to manage and own its services.

It’s also a single point of failure from a security perspective. A compromised account would give the attacker access to all of your data!

Despite these concerns, several valid use cases will benefit from this, such as:

ETL jobs or data pipelines.
The transitional phase of an account migration process.
InfoSec teams looking for strong data governance.
When you need to share data with external parties.
When you require multi-account global tables.

I will leave you with this: cross-account access to DynamoDB tables is almost always a smell. But as with everything, there are exceptions and edge cases. You should think carefully before you use resource-based policies to enable cross-account access to your DynamoDB tables.

Because exceptions are just that, exceptions. Most of you (myself included) are the rule, not the exception.

Links

[1] DynamoDB now supports resource-based policies

[2] My thoughts on the microservices vs. serverless false dichotomy

[3] Migrating existing monolith to serverless in 8 steps

[4] How to perform database migration for a live service with no downtime

The post DynamoDB now supports resource-based policies. But is that a good idea? appeared first on theburningmonk.com.

When to use Step Functions vs. doing it all in a Lambda function

Yan Cui — Sun, 10 Mar 2024 21:12:42 +0000

I’m a big fan of AWS Step Functions. I use it to orchestrate all sorts of workflows, from payment processing to map-reduce jobs.

Why it’s yet another AWS service you need to learn and pay for. And it introduces additional complexities, such as:

It’s hard to test [1].
Your business logic is split between configuration and code.
New decision points. Such as whether to use Express Workflows or Standard Workflows [2].

So it’s fair to ask “Why should we even bother with Step Functions?” when you can do all the orchestration in code, inside a Lambda function.

Let’s break it down.

Lambda pros

Doing all the orchestration in code is simpler. It’s more familiar.

Everything you can do in Step Functions, you can do with just a few lines of code. Case in point:

module.exports.handler = async (event) => {
    // error handling
    try {
        await doX()
    } catch (err) {
        // handle errors
    }

    // branching logic
    if (condition === true) {
        await doSomething()
    } else {
        await doSomethingElse()
    }

    // parallelism
    const promises = event.input.map(x => doY())
    const results = await Promise.all(promises)

    return results
}

It’s likely cheaper.

A Step Functions state machine would likely use Lambda for its Task states.

In which case, you’d end up paying for both:

The Lambda invocations. There are multiple invocations instead of just one. However, the total billable execution time should be similar.
Step Functions state transitions. At $25 per million state transitions, it’s one of the more expensive services in AWS.

Paying for two services is likely more expensive than paying for just one.

It’s likely more scalable.

When you use both Step Functions and Lambda functions (for the Task states) you are constrained by the throughput limits of both services.

Step Functions Standard Workflows have modest limits on the no. of state transitions and the no. of executions you can start per second.

Both of these limits can be raised. So with proper planning, they wouldn’t be an issue.

Without Step Functions, you are limited only by the concurrent executions limit on Lambda. Similarly, Lambda has default throughput limits on the no. of concurrent executions.

Again, with proper planning, and given the recent scaling changes for Lambda [3], you will be OK.

Both the cost and scalability arguments are situational and depend on several architectural choices.

E.g. do you use Standard or Express workflows?

Do you use Lambda functions for Task states or direct service integrations?

Have you estimated your throughput needs and raised the soft limits accordingly?

Because of these factors, they are only “likely” to be true based on what I think the average AWS customer is capable of.

Lambda cons

Cold starts.
15 mins max duration.
Not good for waiting. Because you pay for execution time by the milliseconds. It’s a poor solution when it comes to waiting for something to happen because all that execution time (and money) is wasted.

Step Functions pros

The visual workflows make it very easy to debug problems. This is true even for non-technical team members, such as product owners and customer support teams.

Step Functions has a built-in audit history of everything that happened, including:

When did a state start?
What were the input and output?
Error message and stack trace.

Step Functions have direct service integration with almost every AWS service. So it’s possible to implement an entire workflow without needing any Lambda functions.
No Lambda, no cold starts. No cold starts = more predictable performance.
Long execution time. A Standard Workflow can run for up to a year.
Callback patterns are a great way to support human decisions (e.g. approve a deployment request) in a workflow.
Standard Workflows are arguably the most cost-efficient way to wait. Because you don’t pay for the duration, only the state transition.
You can implement more robust error handling. This is important for business-critical workflows.

To make a workflow more robust, you need to have both:

Retries. Preferably with exponential backoff and random jitters.
A fallback or error handling logic. In case an operation fails after a reasonable number of retries.

With Lambda, this puts you between two opposing forces:

Security best practices recommend short timeouts. This mitigates the impact of denial-of-wallet attacks.
We need a long timeout to cater for the worst-case scenario. We need to allow enough time for the retries and the fallback. This means the timeout needs to be many times the expected execution time on a happy path. Oh, and you might also need retries in your fallback!

It’s difficult to guess the right timeout in these situations. Especially when your workflow might have different branches. And if you get it wrong, then your workflow would be killed off halfway and there’s no easy way to restart from the point of failure.

By lifting the error handling and retries out of your code and into the state machine itself, you alleviate this tension.

Step Functions lets you resume an execution from that point of failure [4].

Step Functions cons

Cost.
Learning curve.
More complexity. There are a lot more things to configure and manage.
Your business logic is split between the State Machine definition and your code (if using Lambda for Task states).
Hard to test. However, this is getting easier with the new TestState API [5].

Conclusion

In conclusion, Step Functions offer a plethora of capabilities. But they come with their own set of complexities and costs.

Whether it’s right for you depends on the demands of your use case.

Generally, I advocate for the path of least resistance: simple workflows call for simple solutions. And Lambda functions excel in these scenarios.

However, for more workloads, Step Functions can simplify the implementation with its built-in functionalities that would otherwise require custom solutions.

For example, when you need to incorporate human decisions into a workflow. You should use Step Functions and leverage its callback patterns instead of creating a bespoke solution.

Similarly, for workflows where resuming from a point of failure is important, you should go with Step Functions.

Personally, I heavily lean towards Step Functions for business-critical workflows. The advantages of visualization, audit trails, and robust error handling align with the high stakes involved. These workflows, like payment processing, warrant the extra investment in Step Functions due to their critical nature.

I will leave you with this diagram with the pros & cons captured in one place. To learn more about testing Step Function, check out my course, Testing Serverless Architectures [6]. I have a whole chapter dedicated to Step Functions.

Links

[1] Why is Step Functions so hard to test?

[2] What is AWS Step Functions? An in-depth overview.

[3] AWS Lambda functions now scale up to 12X faster

[4] Introducing AWS Step Functions redrive to recover from failures more easily

[5] Does Step Function’s new TestState API make end-to-end tests obsolete?

[6] Testing Serverless Architectures course

The post When to use Step Functions vs. doing it all in a Lambda function appeared first on theburningmonk.com.

When to use API Gateway vs. Lambda Function URLs

Yan Cui — Sun, 03 Mar 2024 23:06:11 +0000

“Lambdalith” is a monolithic approach to building serverless applications where a single Lambda function serves an entire API, instead of one function per endpoint.

It’s an increasingly popular approach.

It provides portability between Lambda functions and container applications. You can lift and shift an existing application into Lambda without rewriting it. You can use web frameworks you are already familiar with, and lean on the existing ecosystems of tools, ORMs and middleware. It also makes testing easier, because you can apply familiar testing methodologies.

Tools like the AWS Lambda Web Adapter [1] have made this approach more accessible. In addition, Lambda Function URLs [2] also work well with this pattern.

Which brings up a good question.

“In 2024, if you want to build a REST API using serverless technologies. Should you use Lambda Function URLs or API Gateway?”

Here are some trade-offs you should consider when making this decision.

Function URL pros

It works naturally with Lambdaliths.
No latency overhead from API Gateway.
No API Gateway-related costs.
It supports response streaming [3]. This is useful for returning large payloads in the HTTP response.
There are fewer things to configure.
Your API can run for up to 15 mins.

Function URL cons

You have to use Lambdaliths and deal with all the shortcomings that come with that.
No per-endpoint metrics. It’s possible to make up for this by implementing custom metrics with embedded metric format (EMF) [4].
No direct integration with WAF. Although this is possible through CloudFront, the original function URL would still be unprotected.
It only supports AWS_IAM auth.
You cannot configure different auth methods per endpoint.

Where Function URL makes sense

If you want to (i.e. not forced into it by the choice to use Function URL) build a Lambdalith and you don’t need any of the additional features that API Gateway offers. Then Function URLs make sense?—?it’s cheaper, faster and has fewer moving parts.

Similarly, if you need to return a large payload (> 10MB) or to run for more than 29s, then Function URL can also make sense. That is if you can’t refactor the client-server interaction.

Given the limited support for authentication & authorization, it’s not suitable for user-facing APIs. These APIs often require a Cognito authorizer or a custom Lambda authorizer.

This leaves function URLs as best suited for public APIs or internal APIs inside a microservices architecture.

By “internal API”, I refer to APIs that are used by other services but not by the frontend application directly. These APIs usually require AWS_IAM auth because the caller is another AWS resource?—?a Lambda function, an EC2 instance, an ECS task, etc.

API Gateway pros

It’s more flexible. It works with both Lambdaliths and the one function per endpoint approach.
It has direct integration with most AWS services. So in many cases, you don’t even need a Lambda function. If you’re curious about when and why you should consider this, check out this video [5] where I explain the rationale behind service proxies.
It can proxy requests to any HTTP APIs. This is very useful for integrating with 3rd party APIs.
It offers lots more features, including (but limited to) the following:

- cognito authorizer
- usage plans (great for SAAS applications that offer tiered pricing)
- built-in request validation with request models
- detailed per-endpoint metrics
- mock endpoints (useful for endpoints that return static data)
- request and response transformation (also useful for integrating with 3rd party APIs)
- WebSockets support
- and lots more…

API Gateway cons

Additional latency overhead.
Additional cost.
No response streaming.
29s integration limit.
10MB response limit.

When API Gateway makes sense

Given the vast array of features that API Gateway offers, it makes sense in most cases if you’re OK with the additional cost that comes with the convenience.

The 29s and 10MB response limits can be problematic in some cases.

But they can be mitigated with patterns such as Decoupled Invocations [6] and S3 presigned URLs [7]. However, these workarounds require you to refactor the client-server interaction, so they are not always viable.

Summary

Because of its flexibility, I prefer API Gateway over Function URLs or ALBs. But Function URL is a useful tool, especially when cost and performance are your primary concerns.

It’s also an important lift-and-shift option for people migrating from an existing EC2 or container-based application. It lets them enjoy the benefits of Lambda without rewriting their application.

Finally, as Ben eloquently puts it, this tradeoff represents a deeper choice we often face.

What’s easier at author time might mean more ownership at runtime. For example, we don’t get per-endpoint metrics from function URLs so we have to bring that capability to the table ourselves and be responsible for them.

Something for you to think about ;-)

Links

[1] AWS Lambda Web Adapter

[2] AWS Lambda: function URL is live!

[3] Introducing AWS Lambda response streaming

[4] Embedded metric format specification

[5] API Gateway: Why you should use Service Proxies

[6] How to use the Decoupled Invocation pattern with AWS Lambda and serverless

[7] Hit the 6MB Lambda payload limit? Here’s what you can do

The post When to use API Gateway vs. Lambda Function URLs appeared first on theburningmonk.com.

First impressions of the fastest JavaScript runtime for Lambda

Yan Cui — Tue, 27 Feb 2024 12:27:10 +0000

I thought Lambda needed a specialised runtime. One that works well with its resource-constraint execution environment. I even floated a few ideas in the past but sadly I don’t have the chops to make them happen myself.

So I was pleasantly surprised when AWS open-sourced the LLRT runtime for JavaScript [1]!

What is LLRT?

LLRT, or Low Latency Runtime, is a new and experimental JavaScript runtime for Lambda. It promises 10x faster startup time. Which should significantly help with the dreaded Lambda cold starts.

Naturally, I had to test it out for myself and see if the hype was real.

My first impressions of LLRT

In my limited tests, the init duration for a simple function went from ~750ms to 55ms. Which was very impressive! The function under test only uses the DynamoDB client from the AWS SDK v3 and makes one request to DynamoDB.

Along the way, I also discovered several limitations:

It only supports ESM modules.
It doesn’t work with the popular middy [2] middelware engine. Because LLRT hasn’t implemented the node:stream API. See the API compatibility [3] page for the list of supported APIs.
It doesn’t work with the Lambda Powertools [4]. Because the LLRT doesn’t allow importing the node:console method in userland. See Andrea Amorosi’s response here for more details.

LLRT is not ready for prime time yet. And given that it’s been in the works for almost two years, we should set expectations accordingly.

It’s probably not going to be production-ready in the coming months. But the potential is there and I’m really excited about it!

For one thing, it might finally stop people talking about cold starts. As I said on LinkedIn [5] yesterday, most people are overthinking about Lambda cold starts. If you’re not sure if Lambda cold starts is likely a problem for you, then go read the LinkedIn post and come back here.

Back to LLRT.

I later spoke with Richard Davison, the creator of the LLRT project.

I wanted to learn more about the project and the design decisions behind it.

What made it start so much faster?

What trade-offs did they make?

LLRT is the answer to the question “What would a purposely built JavaScript runtime look like for Lambda.”

A lot of people see that “It’s built in Rust” and automatically assume that’s why it’s fast. But it’s more than that.

When it comes to performance optimizations, it always boils down to what you let go. If you choose to do everything the incumbent does, then you won’t make any significant performance gains.

No JIT compilation

With LLRT, they chose to not include a JIT compiler. Because it’s focused on Lambda’s resource constraint and short-lived execution environments.

As a result, LLRT is likely less performant than the Node.js runtime for CPU-intensive tasks.

However, most Lambda functions do not perform CPU-intensive tasks. Instead, they tend to be IO-intensive.

And the Lambda execution environments are short-lived. So a JIT compiler would have been less effective at optimizing hot code paths.

At the same time, a JIT compiler comes with significant startup costs. It also introduces occasional latency spikes when it needs to evict cached items.

So it appears a sensible trade-off for LLRT to not include a JIT compiler.

QuickJs + Rust

The QuickJs engine [6] plays a crucial role in LLRT and its outstanding performance. Another reason why LLRT is fast is because they wrote all the APIs in Rust.

As much as possible, the team wants to stay in native code to guarantee a strong performance. Bun [7] took the same approach and implemented all the APIs in a system language called Zig [8].

The downside to this approach is that it’s harder for contributors to get in on the action. There are a lot fewer Rust developers than Node.js developers. And even fewer Rust developers who are interested in a JavaScript runtime.

LLRT vs Bun (and other JS runtimes)

LLRT is different from other JavaScript runtimes. It’s not a general-purpose runtime for JavaScript. It doesn’t have to worry about running in the browser or on mobile phones.

Instead, it’s solely focused on the Lambda execution environment.

This allows them to make decisions that just wouldn’t make sense with Bun or Node.js. Decisions such as not including a JIT compiler. Or which of the JavaScript APIs do they implement first, or at all?

The goal is to eventually become WinterCG compliant. But we don’t have to wait for that to start using LLRT. For LLRT to be useful (but not perfect!), it just needs to support the AWS SDK and a few popular libraries.

Summary

To summarize, LLRT is an exciting new runtime for JavaScript.

It’s purposely built for Lambda.

It’s not intended as a general-purpose runtime for JavaScript.

It makes design trade-offs (such as not having a JIT compiler) to achieve an optimal cold start time.

It uses the QuickJs engine and implements all the JavaScript APIs in Rust.

To learn more about LLRT and how to contribute to it, please check out my conversation with Richard on YouTube [9].

I will be covering LLRT in my upcoming workshop, including how to use it with the Serverless Framework and CDK. If you wanna take your serverless game to the next level, then you should check it out! More information is available on the course page [10].

Links

[1] GitHub repo for LLRT

[2] Middy middleware engine

[3] LLRT’s API compatibility page

[4] AWS Lambda Powertools for TypeScript

[5] Most people are overthinking about Lambda cold starts

[6] QuickJs engine

[7] The Bun runtime for JavaScript

[8] The Zip programming language

[9] My interview with Richard Davison, the creator of LLRT

[10] Production-Ready Serverless workshop

The post First impressions of the fastest JavaScript runtime for Lambda appeared first on theburningmonk.com.

What’s the best way to migrate Cognito users to a new user pool?

Yan Cui — Wed, 21 Feb 2024 01:18:10 +0000

I shared on Linkedin [1] the other day that you should avoid using Cognito subs as the user ID for your system. One of the reasons is that a user’s sub does not carry over when you migrate to a new user pool.

Someone responded by asking “Is this type of migration really that common that it necessitates consideration?”

It’s a great question, so let’s dive into it.

When should you consider a user pool migration?
How best to do this migration?

When to consider user pool migration

Migrating users from one Cognito User Pool to another can be highly disruptive. But sometimes it’s our last resort.

Here are some common reasons why you have to consider it:

Changing user pool settings

Many user pool settings cannot be changed after it’s created. Most notably, immutable attributes cannot be changed to mutable later. For example, if you mark the email attribute as immutable but change your mind later.

This is by far the most common reason for migration between Cognito user pools.

Reorganization of environments/apps

Another reason is when you need to reorganize your application(s) and AWS environments. Maybe you want to consolidate multiple user pools into one. Or maybe you want to move your production users to a new production account.

This might be to meet regulatory requirements.

For example, you may need to limit the no. of people who have access to production user data. But if you had one AWS account for all environments then it’s difficult to meet that access control requirement. So to stay compliant, you have to move the production user data into its own account and restrict access to it.

Data sovereignty requirements

You may need to move some users to a different region to comply with data protection regulations. This means moving some users from one user pool to another.

The BEST way to migrate to a new user pool

The challenge with this migration is that it’s not possible to extract the user password from Cognito.

This is a good thing. It shows that Cognito follows security best practices and does not store user passwords in plain text.

But it makes our lives more difficult during a migration.

Broadly speaking, here are three ways to approach this migration:

Option 1. Add existing users to the new user pool and set a temporary password. Email the users with their temporary password and ask them to log in and change the password.

Option 2. Use the migrate user Lambda trigger 2 to migrate users from the old user pool when the user next signs in. This trigger fires whenever a user is not found at the time of sign-in or in the forgot password flow. See the official guide [3] on how to use this trigger to import users into Cognito.

Both of these options suck…

Option 1 puts the burden of migration on the users. Some users will churn, and it’s not a rabbit that you can pull out of the hat too many times.

Option 2 has a loooong tail. You have to keep the Lambda trigger until the last user has been migrated. Chances are some users have churned and will never come back and sign in.

In practice, most people use a combination of these two options:

Use the migrate user Lambda trigger to migrate active users. Leave the trigger running for, say, 6 months.
After 6 months, disable the trigger and apply option 1) on all remaining users. Those who have not churned will log back in and set a new password.

And that leaves us with my favourite.

Option 3. Go passwordless.

If passwords make migration difficult, then ditch them! Use this opportunity and modernize your app with passwordless authentication.

Cognito doesn’t support passwordless authentication out-of-the-box. But you can implement these custom flows with Lambda triggers.

I shared two ways to implement passwordless authentication with Cognito:

Using one-time passcodes [4]
Using magic email links [5]

Besides these, you should also consider social sign-in with Google, LinkedIn, X/Twitter and so on.

Conclusions

Migrating users to a new Cognito user pool is perhaps more nuanced than you might think. There are technical limitations to consider. And you have to worry about the impact on your user experience.

It’s something that I try to avoid if I can. But if I must, I prefer to turn my misfortunes into an opportunity and improve my sign-in experience with passwordless authentication.

Do you agree?

Links

[1] Why you should stop using Cognito subs as user ID

[2] Migrate user Lambda trigger

[3] Importing users into user pool with a user migration Lambda trigger

[4] Passwordless authentication with Cognito: one-time passcodes

[5] Passwordless authentication with Cognito: magic email links

The post What’s the best way to migrate Cognito users to a new user pool? appeared first on theburningmonk.com.

How to secure CI/CD roles without burning production to the ground

Yan Cui — Fri, 16 Feb 2024 00:43:03 +0000

By now, most of us have moved away from using IAM users for CI/CD pipelines. Instead, we’d use dedicated CI/CD roles, one for each pipeline. This forces us to consider who can assume this role.

Identity federation is widely supported by 3rd-party providers such as GitHub Actions [1]. So, no more putting IAM credentials in CI/CD tools and worry that they might be compromised in a security breach [2].

However, attackers can still compromise the pipeline through supply chain attacks. For example, by compromising a Docker image we depend on in our CI/CD pipeline. Or by compromising static analysis tools such as eslint [3].

So, the question of “ How best to limit CI/CD role’s permissions? ” has come up many times during my Production-Ready Serverless [4] workshop.

Your instinct might be to lock down the CI/CD role to just what it needs. Because we all want to follow the principle of least privilege.

There are different ways to achieve this. Here are two common approaches:

Start with a blank slate and add permissions to the role until it’s able to deploy your application.
Start with the AdministratorAccess policy and allow the role to do everything. Use CloudTrail to track the actions performed by the role over some time. Then use the CloudTrail data to create a tailored set of IAM permissions.

The first approach is very time-consuming and laborious and is best avoided. But in truth, both approaches have significant drawbacks in the long run.

Rollback surprises

For most people, CloudFormation rollbacks don’t happen often. Especially for those of us who don’t write CloudFormation templates by hand.

So when a deployment fails for the first time, you find out that the CI/CD role is missing a whole bunch of permissions! This can leave you in an awkward position where your stack is stuck in the ROLLBACK_FAILED state. And you are stuck until you resolve the missing permissions.

In essence, crafting a least privileged CI/CD role takes twice as much work as you think. Because you also need to factor in all the permissions for rollbacks.

A tax on productivity

Your architecture is not static. It needs to evolve with your application and your business needs. Every time you introduce a new service, you also need to revise the CI/CD role.

Want to start using OpenSearch? Not before you update the CI/CD role!

Want to start using Bedrock? Not before you … you get the point.

In organizations where the application team doesn’t own the CI/CD role, this productivity tax can be steep. It requires many back-and-forth between teams and hits productivity hard.

And you likely have to do these for every service because every service has its own CI/CD pipeline.

This friction is a constant tax on your productivity. It hampers innovation and delays the adoption of new services into the tech stack. Because introducing a new service means new CI/CD permissions. Who in their right mind wants to deal with all that bureaucracy?

They are not as safe as you think

The reason why we pursue least privileged CI/CD roles is to reduce the blast radius of a security breach. If an attacker compromises our CI/CD pipeline, we want to limit what the attacker can do.

However, the CI/CD role likely needs to deploy IAM roles and Lambda functions. In the event of a compromise, a least privileged CI/CD role is not enough to stop the attacker.

Because the attackers can use the CI/CD role to create confused deputies to act on their behalf.

For example, attackers can use the CI/CD role to create IAM roles for them to assume. And they can also create Lambda functions to carry out malicious activities.

We can mitigate these risks to some extent with permission boundaries. But that requires lots more work and it’s hard to get right. And again, you have to do it for every CI/CD role.

Prefer a permissive role that is hard to abuse

It’s counterproductive to blindly pursue least privileged CI/CD roles. Beyond a certain point, you get diminished return-on-investments.

That is, you have to work much harder to address security risks that are unlikely to happen.

With that in mind, here’s my preferred approach to securing a CI/CD pipeline.

Use a separate AWS account for each environment

This is something everyone should do. It insulates environments from each other. If one environment is compromised, it’s contained by the account boundary. If an attacker compromises your dev environment, they won’t be able to access your production data.

For large organizations, you should go a step further. I recommend having at least one account per team per environment. That way, teams are insulated from other teams’ mistakes.

For business critical systems, they should have their own set of accounts too. That is one account per service per environment. It’s not always necessary. But it’s a good idea to protect these business critical systems from other systems.

Service Control Policies (SCPs)

Use SCPs to deny access to unused resources. These broad strokes include access to unused AWS regions and unused services. This way, you can eliminate large parts of AWS as possible attack surfaces.

For example, if you’re not running any EC2 instances, then deny all EC2 activities. That’s one less target for crypto-jacking attacks. But hey, ECS is the new EC2 for crypto-jacking [5]. If you can’t disable ECS as a whole, then at least deny access to the regions where you don’t have any containers.

Attribute-Based Access Control (ABAC)

Support for ABAC [6] is improving all the time, although it’s still lacklustre in its current form. But many services already support the aws:ResourceTag/${TagKey} and aws:PrincipalTag/${TagKey} conditions.

These conditions can be used to stop attackers from accessing and creating resources. They can be added to the IAM permissions of the CI/CD role. Or they can be enforced through a permission boundary.

To make them more effective, the CI/CD role should be denied the ability to find out about itself. That way, the attacker can’t find out what tags the CI/CD role has and what tags they need to use to create new resources.

Doing so helps us stop the attacker from accessing our resources through the CI/CD role. But they might target the following services to execute malicious code or escalate their privilege:

Lambda
EC2
ECS
IAM
CodePipeline
CloudFormation

Luckily, all these services (and many others) support the aws:ResourceTag/${TagKey} condition. We can use this condition to stop the attacker from creating new resources.

A permissive CI/CD role

After doing the above:

We have removed the attack surface associated with used regions and services.
We have limited the blast radius of a compromise to individual teams and services.
We have made it difficult for attackers to abuse the CI/CD role by requiring correct tags on resources. Which the attacker is not able to figure out easily because the CI/CD role is not allowed to describe itself.

Now we can allow ourselves to have a permissive CI/CD role that is not a pain in the ass to create and maintain! Because the role will be difficult to abuse, and there is a ceiling to what it can actually do.

By “permissive”, I don’t necessarily mean “AdministratorAccess”. But you can give broad read & write permissions to all the services you use. Instead of picking out just the actions you need with a fine comb!

Conclusions

Achieving a balance between security and productivity is critical.

Adopting more permissive CI/CD roles introduces higher risks. But it’s counterbalanced with the aforementioned compensating controls and security practices.

You also need continuous monitoring and regular security audits. The SCPs and ABAC settings should be reviewed regularly.

It’s equally important to cultivate a security-conscious culture within teams. When everyone is aware of and responsible for security, the CI/CD pipeline becomes more resilient.

Ultimately, the goal is not to eliminate risk but to manage it efficiently. When we combine a permissive role with strong security controls, we can be safe and productive!

Links

[1] Identity federation for GitHub Actions on AWS

[2] CircleCi incident report for Jan 4, 2023 security breach

[3] ESLint: Compromising the Build using Supply Chain Attack

[4] Production-Ready Serverless workshop

[5] Tales from the cloud trenches: Amazon ECS is the new EC2 for crypto mining

[6] Which AWS services support ABAC

The post How to secure CI/CD roles without burning production to the ground appeared first on theburningmonk.com.