DEV Community: Eoin Shanaghy

How to use EventBridge as a Cross-Account Event Backbone

Eoin Shanaghy — Tue, 24 May 2022 16:01:48 +0000

Two of the emerging best practices in modern AWS applications are:

Use a separate AWS account per application
Decouple communication between systems using events instead of point-to-point, synchronous communication.

This post will show how EventBridge can provide an ideal event backbone for applications in multiple AWS accounts, achieving both of these best practices with minimal complexity. To illustrate these concepts, we will use the example of an eCommerce application. We will simplify it and imagine just two services, each running in a separate AWS account. Simple applications do not need to have all components separated in their own accounts like this, but as the product grows and each service becomes sufficiently complex, with different teams involved, it becomes necessary to use dedicated accounts.

Our eCommerce application has two services - the Order Service and the Delivery Service. There is a logical link between orders and the delivery of products being fulfilled, but these are regarded as separate services that should not be tightly coupled.

In this multi-account setup, we will also dedicate an account for the event backbone itself. This gives us three accounts, with the potential to add more as we grow the number of services or applications.

Before we dive into the solution, let's talk about why account separation and event-driven communication are regarded as best practices. We want to avoid the mistake of accepting 'best practices' without understanding the reasoning!

Separate AWS accounts

An AWS account is the ultimate boundary for permissions and quotas. IAM is used for fine-grained access control and is one of the most impressive and fundamental AWS services there is. There is no escaping the fact that enforcement of the principle of least privilege takes a lot of time and a solid understanding. If you mix applications, environments or teams in a single AWS account, you rely on permissions boundaries, complex IAM policies and strict policy change processes to avoid impacting others' resources. This will result in a significant overhead in engineering time as well as the risk of human error. By providing each application and environment with a separate account, you can rely more on the account boundary to protect against the workloads and actions of others. You can still enforce minimal privilege but in a much more agile way, by detecting and continually improving policies instead of strict enforcement up front. For development accounts, this is a big productivity boost.

AWS quotas put limits on the number of requests you can make or resources you can utilise by default in an account. You have a mix of soft quotas, which can be raised if you ask, and hard quotas, which, more often than not, cannot budge. There may be some exceptions here if you have specific needs and engage directly with your AWS account manager. Once you mix workloads in an AWS account, those workloads share quotas. This can have the effect of limiting the scalability of each application or the more drastic result where one application can deny the other from operating because it is simply reaching a quota. Take AWS Lambda as an example. If your region has a default quota of 1,000 concurrent executions and one application reaches this, it will throttle any other workload from using Lambda. Using separate accounts removes the risk of this cross-application side-effect.

The drawback of using separate accounts for everything will come with the account management overhead. If you end up with more than a handful of accounts (100's or 1000's of accounts is not uncommon!), account automation is a must. I recommend a solution based on org-formation for this. To read more about the benefits of a multi-account strategy, check out this post by Tomasz Łakomy.

Event based communication

The case for event-based communication is less clear cut and more a case of nuanced trade-offs. With synchronous communication, an application will send a request to a known address and wait to receive a response. The advantage as a developer or architect is that you know what service you are addressing and you can follow the flow of logic and data clearly. There are many disadvantages, however.

Synchronous communication means you have to know the address of the application or service that is processing your request. This is called location coupling, and it means you have to have a mechanism to update the address in all clients if it changes. Service discovery solutions are used to solve this but not without their own complexity. With synchronous communication, you also get temporal coupling, since the action of making a request is bound in time to the processing of that request. Temporal coupling has a greater impact, since it results in failures when the request processor is not online, not reachable, or just busy with other requests. Temporal coupling means that the receiver must scale exactly in line with the request volume.

Asynchronous (event-driven) communication can remove these forms of coupling. Instead of sending events or requests to a known receiver, you send events to a bus, queue or topic. The receiver can scale independently and even delay processing. Message durability provided by the bus or queue can ensure that events don't get lost or undelivered, even if the event processor is temporarily offline.

That said, asynchronous communication is harder to reason about. It becomes more difficult to follow the flow of data and logic. I would say it requires a mindset change for engineers and also means you need better observability tooling to capture event flows.

While event-driven seems to be the more architecturally sound approach, there is still a case for going synchronous. We have all become used to integrating SaaS platforms using APIs and webhooks. This is essentially all synchronous communication. It has become a de-facto standard for SaaS product integration because it is easy for the consumer to understand, get started and troubleshoot. It shifts the burden to the SaaS provider who now has to ensure the API is always available, robust and scalable.

Even though I'm a big fan of event-driven, I still think there's a valid case for good, well-documented, synchronous APIs where simplicity and clarity are more important than decoupled perfection. A well-balanced enterprise architecture might combine a small number of REST APIs at high-level boundaries across distinct applications with asynchronous messages for callbacks and updates as well as lower-level, inter-service communication.

We have covered the reasons for these two underlying best practices. Let's now dive in to our cross-account event driven solution using EventBridge!

What is an Event Backbone?

An Event Backbone is simply an event communication mechanism that serves multiple applications. The concept evolved from the idea of an Enterprise Service Bus (ESB) in the time of service-oriented architecture (SOA). ESBs have a broader set of capabilities however, including more complex routing, extensive transformation capabilities and even business rule execution. An event backbone is fundamentally simpler, focusing on scalable, reliable event transport and ensuring that business logic belongs to the services and not the bus.

The term is commonly used for such systems based on Apache Kafka, since Kafka was one of the first technologies that enabled event backbones for microservice communication with massive scale and performance. Since Kafka was first released over a decade ago, cloud managed services have evolved to the degree where you don't need to Kafka to have a scalable, reliable event backbone. Amazon EventBridge is the most obvious example, since it has managed to pull off the amazing feat of having a large feature set and massive scalability while remaining one of the simplest cloud services there is.

If you are a Kafka fan, there is effort being put in by AWS in reducing the complexity with the managed MSK, including the MSK Serverless version. I would compare MSK to EventBridge in the same way I would compare EKS to Fargate or Lambda. You get a lot more control and configurability but even with the AWS managed service, you still have plenty of complexity.

The beauty of something like EventBridge is that the investment is so low. If your needs evolve, you can adapt and use alternative options for specific cases. You are not stuck with it because of a large investment in infrastructure or training. If you need durability, add SQS! If you need lower latency, ordered streams, you can add Kinesis! It's possible to build a event backbone on Kinesis or SNS/SQS but EventBridge is still the best place to start, integrates with more services and has really good cross-account support.

Cross-account EventBridge

We already mentioned that EventBridge has good support for cross-account scenarios. With EventBridge, you can create a Rule with any other EventBridge bus as a target. This bus can be in a different account.

For this to work, the target bus should have a policy that allows the source account to send events to it.

Now, let's imagine this idea at a larger scale, where we have multiple accounts, each with their own applications or services. Where does each application need to send events? There are a few options here. If you want to explore all the options, take a look at this great talk from re:Invent 2020 on Building event-driven architectures. In this article, I'll focus on my preferred option, referred to in that video as the "single-bus*, multi account-pattern". There are are in fact multiple buses, but a central bus in a dedicated account is used to route messages to multiple accounts, each with their own local bus.

The important characteristics of this architecture are:

Every service send events to a global bus, a dedicated bus in a separate account
Every service receives events from a local bus in its own account
The global bus has rules to route all events to every local bus except the local bus of the event sender. This could be classified as a fan-out pattern.

Note: Every account comes with a default EventBridge bus, so it's not mandatory to create custom buses. We do so so we can control permissions at a bus level, and to separate these custom events completely from the AWS service events that are sent to the default bus.

Why do we need a global bus and local buses?

You might ask why services can't send events to their local bus instead of the global bus. Since each service receives events from their local bus, should it not publish there too? Apart from adding an additional layer, it's simply not possible with EventBridge. You cannot have events transitively routed to a third bus (local -> global -> local). Only one cross-account target is allowed in the chain (global -> local). This is covered in the EventBridge documentation:

“If a receiver account sets up a rule that sends events received from a sender account on to a third account, these events are not sent to the third account.”

You might also wonder why we can't get rid of the local buses altogether and just have the global bus, letting all services send and receive events to and from it. There are two main reasons against this approach. To receive messages from a bus in another account, you would have to create rules in another account's bus for every pattern you want to match. This is not a clean separation of concerns. Secondly, even if you did create rules in the global bus, you cannot invoke any cross-account target with an EventBridge rule, say, a Lambda function, you can only target another EventBridge bus in another account.

Cross-account EventBridge backbone example

Let's return to our eCommerce use case. Our application has two services - the Order Service and the Delivery Service. In a real world scenario, these systems have sufficient features and logic, so it's warranted to separate them in different account. There is a logical link between orders and the delivery of products being fulfilled, but these are regarded as separate services that should not be tightly coupled.

When orders are created, we want the delivery service to be notified.
When deliveries are sent, we want to update orders accordingly.

We have two services and three accounts:

The Order Service account, which has the order service logic and its own "local" EventBridge bus
The Delivery Service account, which has the delivery service logic and also has its own "local" EventBridge bus
The Global Bus account, which only has a "global" EventBridge bus. This is used to route messages to other accounts.

The flow of events for the order creation use case is as follows.

An HTTP POST API is used to create an order. The backing Lambda function generates an order ID and sends an Order.Created event to the global bus.
The delivery service picks up the Order.Created event from its local bus, processes the order , and sends a Delivery.Updated event including all the important delivery details to the global bus.
The order service picks up the Delivery.Updated event from its local bus, and finally sends an Order.Updated event to the global bus.

Example Source
The full source code with documentation for this is available on github.com/fourTheorem/cross-account-eventbridge. It include a CDK pipeline for deployment of all resource to the three accounts.

Global bus event rules

Events cannot be sent anywhere in EventBridge without a rule. Rules can be based on a schedule or an event pattern. For our backbone, we need to create pattern-based routing rules in the global bus. We create a single rule for each service account:



eventPattern:
  account:
    - 'anything-but': 12345789012  # Anything but the source account

This rule will route all events to every service account except the one that sent the message.

Logging events for debugging and auditing

One of the challenges people encounter with EventBridge when using it for the first time relates to observability. It can be difficult to understand which events are flowing through a bus and see their contents so that you can troubleshoot delivery failures. A simple way to address this is to create a rule to capture and log all events to CloudWatch Logs. How do you capture all events? EventBridge rules require you to have at least one condition in your filter, but a prefix match expression with an empty string will capture all events from any source:



eventPattern:

  source:

    - prefix: ''

Event structure, schemas and validation

Parts of this section were taken from What can you do with EventBridge? (fourTheorem blog).

With EventBridge, you have no obligation to provide a schema for your events but there is support if you wish to do so. Without a schema, you can be flexible in how you evolve the event structure but it can also lead to confusion for other developers who are trying to consume your events or even publish them in a way that is consistent with the organisation. This is even more important when we are talking about an event backbone, since you can assume that producers and consumers are in different teams or departments.

Start with a clear set of principles for the structure of these events and how to manage changes in event structure over time. With EventBridge, each event can contain the following properties:

Property	Purpose
Source	This defines the message origin. For AWS services, this might be something like `aws.config` but for your custom events you could specify the application or service name, like `order.service`.
DetailType	This usually denotes the event type, for example, `Order.Created`
EventBusName	The name of your custom bus or default.
Detail	This is the JSON-encoded payload of the message and it can contain relevant details about your event (e.g. order ID, customer name, product name, etc.)

The event structure can represent a contract between the producer and the consumer. Once a consumer strictly relies on fields being available in the Detail payload, you have semantic coupling between producer and consumer. There is a balance to be struck between including as much detail as possible in the message and reducing this semantic coupling. Too little data means that consumers will likely have to go and fetch extra data from the originating system.

Too much data for your events means consumers come to rely on all that data in a certain structure, semantically coupling it to the producer and making it hard to change the structure later. A reasonable approach here is to start with less data and add properties incrementally as the need arises.

Some basic principles for event structure include:

Do not rely on the Source for pattern matching as a consumer. A consumer should not need to be concerned with where the event came from.
Enforce a consistent structure for DetailType.
Separate the Detail into two child properties, metadata and data. This allows you to add additional metadata without mixing it in with the event payload itself.

While it is optional to enforce schema validation, it is worthwhile if you are serious about EventBridge adoption at scale. EventBridge allows you to publish schemas in a registry for other teams and developers. This feature supports typed code binding generation too. If you do not want to create and upload the schema, you have the option to let EventBridge discover the schemas for you from events passing through the bus.

If stricter schema enforcement is something you want to do, I'd recommend looking at the approach taken by PostNL as described by Luc van Donkersgoed in this insightful talk.

For great ideas on structuring event payloads, take a read of Sheen Brisals' post on the Lego Engineering blog.

3 Ways to Read SSM Parameters

Eoin Shanaghy — Tue, 23 Nov 2021 22:59:30 +0000

AWS Systems Manager Parameter Store (or SSM Parameter Store) is a convenient way to store hierarchical parameters in AWS. You can use it for any configuration values, including secure values like passwords or API keys. It integrates well with other AWS services too.

When it comes to reading parameters from SSM, there are a few available options.

Read at runtime
Read at build time
Read at deploy time

When thinking about these options it's important to consider:

When can you be sure the parameter will be available?
Can the parameter value change?
Is the parameter a secret that should be protected from data leaks?

Reading at Runtime

Reading at runtime is the safest way to both protect a secret parameter and deal with parameters that might not be available at build or deploy time. It also helps to deal with parameters whose values change frequently. Parameters can be read using the AWS SDK for your language of choice(e.g., JS, Python) or the SSM API.

It's a good idea to think about how often you will make the call to read parameter values. Parameter Store has a low throughput quota by default (40 per second) and reading the value every time you need it will also add latency. The typical solution here is to load at startup time. For example, in the context of a Lambda function, the parameter read could be outside the handler function, executed when your code is bootstrapped.

If you are worried about keeping hold of stale values at runtime, you can re-read after a reasonable timeout. ssm-cache in Python and Middy SSM for Node.js Lambda functions are two of the open source libraries that make this easy.

The runtime reading approach keeps your secret parameters in memory only, so you can be reassured that, as long as you don't write these values anywhere else, the risk of exfiltrating secrets is lower than with using files or environment variables containing plaintext secrets.

Reading at build time

There is a subtle difference between this option and the next one (reading at deploy time). Let's assume that you are using Infrastructure as Code and you are packaging your code and infrastructure together at build time. This packaging process happens before you deploy to AWS.

Reading at build time uses the SDK or API to load a parameter value and include it somewhere in your code. This could be a generated .env file or a value set in a Lambda function environment variable.

With the Serverless Framework, this can look like the following:



functions:
  getItem:
    handler: handler.handleGetItem
    environment: ${ssm:/path/to/secret}

This is really convenient because the Serverless Framework takes care of retrieving the value with little code (docs). The downside comes with secret parameters. Storing a protected secret value in code or in an environment variable leaves it open to multiple types of attack.

The CloudFormation template for this example shows how the SSM parameter value is stored in JSON in the clear.



   "Environment": {
     "Variables": {
       "SECRET_CODE": "shhhhh!s3cr3t"
     }
   },

The environment variable can also be seen after we deploy in the Lambda function's configuration. Once it's available as an environment variable, it can easily be discovered and exfiltrated by any malicious code running in the function or actor with access to the function's configuration.

Reading at build time also requires you to ensure that your build environment has the right credentials to read the parameter and that it exists, even if you are only building without deploying.

Reading at Deploy Time

If you are using AWS CloudFormation, or any of the great tools built on top of it, like CDK, Serverless Framework or SAM, you have two options to load and use SSM Parameters at deploy time. This means that responsibility for reading the parameters is deferred until after you build and when CloudFormation performs cloud-side deployment of your full stack. The advantages are that you do not need privileges to read the parameters at build time and you don't have to ensure the values exist until it comes to deployment in any given environment.

The first option is using a special syntax for Cloudformation Parameters. In raw CloudFormation YAML, the declaration looks something like this:



resources:
  Parameters:
    UserPoolArnParameter:
      Type: AWS::SSM::Parameter::Value<String>
      Default: /dev/user-service/user-pool-arn

  Resources:
    CognitoAuthorizer:
      Type: AWS::ApiGateway::Authorizer
      Properties:
        ...
        ProviderARNs:
          - !Ref UserPoolArnParameter

A complete example is available here. This syntax does not support SecureString types. It is documented along with other CloudFormation Parameter types here.

The second deploy-time option in CloudFormation uses a feature called dynamic parameters and this option does support SecureString types for a fixed set of services. CloudFormation dynamic parameters also support specific parameter versions. It looks like the following.

{{resolve:ssm:/path/to/parameter:VERSION}}
or
{{resolve:ssm:/path/to/secure-parameter:VERSION}}

The :VERSION suffix is optional in both cases, with the latest version being automatically selected by default.

While these options may be more verbose than the Serverless Framework's ssm: variable syntax, CloudFormation syntax tends to be more stable, ensures deploy-time validation, and fits better with the cloud-side deployment model of CloudFormation stacks.

AWS CDK provides convenient functions for SSM parameters that dynamically generate either of these CloudFormation options for you. It might not be obvious that these are deploy-time lookups rather than at CDK build time, but you can check the synthesized CDK output to see what is happening under the hood.

Secrets Manager

The focus here is on SSM Parameter Parameters as opposed to Secrets Manager. The principles all apply but with Secrets Manager, you are more likely to be dealing with sensitive values that are rotated regularly so the emphasis on secure, late binding of secret values is event more important! CloudFormation dynamic values support Secrets Manager as well but there is no support in CloudFormation Parameters.

Conclusion

The rule of thumb for reading SSM parameters is generally, read them as late as possible. This reduces the likelihood of reading stale values and of having stored secrets that can be compromised. If you need to look up the value before running your code, try one of the CloudFormation methods. Reading at build time should be avoided where possible.

Container Image Support in AWS Lambda Deep Dive

Eoin Shanaghy — Tue, 01 Dec 2020 18:28:22 +0000

AWS today announced support for packaging Lambda functions as container images! This post takes a look under the hood of this new feature from my experience during the beta period.

Lambda functions started to look a bit more like container images when Lambda Layers and Custom Runtimes were announced in 2018, albeit with a very different developer experience. Today, the arrival of Container Image support for Lambda makes it possible to use actual Docker/OCI container images up to 10GB in size as the code and runtime for a Lambda function.

But what about Fargate?! Wasn’t that supposed to be the serverless container service in AWS? While it might seem a bit confusing, support for Image Functions in Lambda makes sense and brings huge benefits that were probably never going to happen in the world of Fargate, ECS and EKS. Container Image deployment to Lambda enables Lambda’s incredibly rapid and responsive scaling as well as Lambda’s integrations, error handling, destinations, DLQs, queueing, throttling and metrics.

Of course, Lambda functions are stateless and short-lived. That means that a lot of container workloads in their current form may still suit the Fargate/ECS/EKS camp better. Having personally spent too much time optimising Fargate task scheduling in the past, I will be glad to use Lambda for bursty batch processing workloads where the cost trade-offs work for the business. (We all want Lambda performance at Fargate Spot pricing!) Fargate will remain useful for more traditional, longer-lived workloads that don’t have a need to scale quickly to 100’s or 1000’s of containers.

Let’s take a look at the experience of building and deploying Lambda functions based on container images. In this post, we’ll cover development, deployment, versioning and some of the pros and cons of using image functions.

Development

Container images are typically designed to run either tasks or servers. Tasks usually take parameters in through the container’s CMD arguments and exit when complete. Servers will listen for requests and stay up until they are explicitly stopped.

With Lambda functions, neither of these models applies. Instead, functions deployed from container images operate like functions packaged as ZIPs, staying alive for 30 minutes and handling events one at a time. To support this, a runtime fetches events from the Lambda environment and passes them to the handler function. Since this isn’t something that Docker/OCI containers support, images need to include the Lambda Runtime Interface Client.

Images can be built with any tools that support The Open Container Initiative (OCI) Specification v1.0 or later or Docker Image Manifest V2 Schema 2.

There are two options to pick from in order to build a container image for use with Lambda:

Take an AWS Lambda base image and add your own layers for code, modules and data
Take an existing base image and add the AWS Lambda Runtime Interface Client.

The AWS Lambda Runtime Interface Client is an open source native binary written in C++ with bindings for the supported runtimes (.NET Core, Go, Java, Node.js, Python and Ruby). Containers can use these flavours of the runtime client or implement the Lambda Runtime API to respond to and process events. This is the same API used in Custom Lambda Runtimes.

Using the AWS-provided base images, the Dockerfile for building your image is relatively straightforward:

FROM public.ecr.aws/lambda/python:3.8
RUN mkdir -p /var/task
WORKDIR /var/task
COPY app/requirements.txt /var/task
RUN pip install -r requirements.txt
COPY app/ /var/task/app/
CMD [app/handler.handle_event]

To see how this works in practice, you can take a look at our example based on an AWS-provided Node.js base image. It uses Firefox, FFmpeg and Xvfb to capture a video of a webpage loading process and is available on GitHub.

To use your own base image instead of an AWS-provided image, you will need to add the runtime interface client. This is available for Python (PyPi), Node.js (NPM), Ruby (Gem), Java (Maven), Go (GitHub) and .NET (NuGet).

An example of this can be found in our PyTorch-based machine learning example.

Deployment

Functions deployed using container images must refer to a pre-existing repository + tag in ECR (other image repositories are not yet supported). Deployment of a function is therefore always a three-step process. This isn’t much different from functions packaged as a ZIP, where code is typically uploaded to S3 and referenced when the function is created or updated. It will however require some thought when planning your deployment.

The three steps may be performed automatically by serverless packaging tools but you may also wish to deploy the ECR repository and push the container images during separate build phases. In the latter case, there is more control but also more complexity since the order is strict - you cannot deploy a function before a tagged image is in place in ECR. This is a consideration for organisations who want to leverage existing container image build and deployment pipelines and handle it separately to infrastructure deployment.

It is important to note that the image tag is resolved to the image digest during function deployment time so changes to a tag after deployment have no effect.

When it comes to the AWS SDK, CloudFormation and the CLI, differences between image-packaged and ZIP-packaged functions is small.

With boto3:

lambda_client.create_function(
    FunctionName=name,
    PackageType=’Image’,
    Code={‘ImageUri’: ecr_repo_tag},..
)

Note that you do not have to specify the handler when creating functions packaged as container images since this can be configured in the image, most likely using the CMD configuration. The entrypoint, cmd and workdir can be specified when the function is created or updated.

With the Node.js AWS SDK:

await lambda.updateFunctionConfiguration({
  FunctionName: functionName,
  Code: {ImageUri: ecr_repo_ui},
  PackageType: ‘Image’,
  ImageConfig: {Command: [‘index.handleEvent’]}
}).promise();

At this time, once you create a function, it’s not possible to migrate to a different package type. This is set to change so you will soon be able to port existing functions packaged as a ZIP to container images.

Once a function has been deployed, it may not be available for invocation just yet! When the cold-start behaviour of Lambdas in VPCs was improved last year, you might recall that functions entered a Pending state while the VPC resources were created. You can check the status of a function and wait for it to enter the Active state.

These states also apply to Lambdas using container images. Functions stay in the Pending state for a few seconds while the container image is “optimised” and cached for AWS Lambda.

Local Development and Testing

When it comes to testing in development, I have not yet found a better experience than that provided by Docker tooling. Once you build a container image, you have an immutable artifact that you can run in development, test and production environments. You gain confidence that the runtime is consistent across all environments. When you need to make changes, you can iterate quickly, only modifying the layers that change. I would love to have similar speed and confidence in the development workflow for Lambda functions packaged as ZIPs but that has yet to materialise.

Local function testing is enabled through the AWS Lambda Runtime Interface Emulator (RIE). The emulator is included in the AWS-provided base images. To test locally, you can just run the container:

docker run -p 9000:8080 your_image

Your function can then be triggered by posting an event using a HTTP request:

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d @test-events/event.json

I found this development and testing workflow to be simple, efficient and easy to understand.

Versioning

Within Lambda, versioning support remains the same. Every time you update the code for a function, a new version is created with a single numeric version that automatically increments. When a version is published, it receives the $LATEST alias. Developers can create additional aliases too. Aliases or version numbers can form part of a fully-qualified ARN to invoke specific versions as part of a deployment strategy.

A common complaint with this versioning system is that it is incompatible with semantic versioning widely used today. With container images, we can at least apply semantic version tags to images and use these tags to point to the function’s code in Lambda. Again, bear in mind that if the tag is moved to point to a different image digest, the Lambda function version will still point to the digest referenced at deployment time.

Creating a Lambda with the Serverless Framework, AWS SAM or other tools sometimes makes it feel like the “infrastructure” (resources) and code are deployed as a single unit. In reality, deployment of the code and the resources are separate. Using container images with version tags will allow developers experienced with container deployment to employ a familiar versioning scheme.

Layers vs. Layers

Let’s take a look at how container image layers differ from AWS Lambda layers. Lambda functions packaged as ZIPs can have up to five layers. The layers themselves are explicitly defined and packaged in a similar way to function code. When layers were introduced, they enabled teams to support sharing pre-packaged libraries and modules or, in more rare cases, custom runtimes.

Container image layers are very different. They are more implicitly defined and you can have as many as you need (up to 127, it appears). Image layers are created as part of the image build and do not need to be individually deployed.

Lambda Layers	Container Image Layers
Limited to 5	Up to 127
Explicitly defined	Implicitly defined as part of the image build
Single number versioning (though packaging as a SAR application allows semantic versioning)	One or more layers can be tagged as an image using any versioning scheme
Deployed as Lambda Layer resource	Pushed automatically to the image registry (e.g., ECR) when an image is pushed

The simple yet powerful relationship between a Dockerfile and the layers is one of the benefits that made Docker and containers successful in the early days. It only takes a single line to add a new layer and the layer is automatically rebuilt only if that line or any of the previous layers change. Layer caching can make the development feedback loop super fast.

Runtimes

Lambdas use AWS-provided runtimes by specifying one of the supported Node.js, Go, Java, Python, Ruby or .NET versions in the Runtime property. Custom runtimes, packaged as layers, are also possible by specifying the runtime property value “provided”.

For Container Image Lambdas, the runtime is always essentially provided by the user. AWS does however provide container base images with runtimes for Java, Python, Node.js, .NET and Go. In addition, Amazon Linux base layers for custom runtimes are available.

To add Lambda support to existing container images, developers are required to include the Lambda Runtime Interface Client for the language of choice (Java, Python, Node.js, .NET, Go and Ruby). The runtime interface clients are open source implementations of the AWS Lambda Runtime API. Lambda functions of all types use this API to get events and provide results. The Runtime API and AWS Lambda execution environment are nicely documented and worth reading to understand the context in which your function is invoked.

More Heavy Lifting

You might notice that using container images gives you more control over the execution environment for a Lambda function. While there are clear benefits, something smells a bit unserverless about this! It is always worth choosing the simplest option, the one that hands control and responsibility for maintenance and patches to the cloud provider.

When you deploy a Lambda using a container image, you define the full code stack including OS, standard libraries, dependencies, runtime and application code. Even if you use an AWS-provided base image, you need a process to update the full image when that base image is patched. Make no mistake, this is extra heavy lifting that you should strive to avoid if possible.

The premise of Lambda and serverless computing in general is to let you focus on the minimal amount of code needed to deploy features that are unique to you. The responsibility of managing and maintaining all these base layers is not something that comes for free. Container Image support may be a bridge to Lambda for many applications but it doesn’t mean it’s the final destination. All applications should aim to eliminate any of this maintenance burden over time. That means creating small, single-purpose Lambda functions using a supported runtime or, better still, looking for ways to eliminate that function altogether!

Issues Encountered

There were only a few problems we encountered over the past few weeks of working with Container Image support.

Firstly, we noticed that Billed Duration was calculated differently to ZIP-packaged functions. The billed duration reported in each REPORT log seemed to be the sum of Init Duration and Duration.

Here is a log example for a ZIP-packaged function, where the billed duration was 300ms, even though the init duration was over 750ms.

REPORT RequestId: 5aa36dcc-db7b-4ce6-9132-eae75a97466f 
Duration: 292.24 ms Billed Duration: 300 ms
...
Init Duration: 758.94 ms

For one of our Image-packaged functions, we were being billed for 5200ms, the sum of duration (502.81ms) + init duration (4638.39) rounded up to the nearest 100ms:

REPORT RequestId: 679c6323-7dff-434d-9b63-d9bdb054a4ba
Duration: 502.81 ms Billed Duration: 5200 ms
...
Init Duration: 4638.39 ms

I spoke to AWS and they clarified that this billing behaviour is because we are using a custom runtime, not because we are using a function packaged as an image. This is the same behaviour as with custom runtimes packaged as a ZIP.

The second issue we encountered was for our machine learning case. We ran into an issue with PyTorch DataSet loaders which use Python multiprocessing Queues (and thus /dev/shm) to allow parallel data fetching during model execution. Lambda does not provide /dev/shm. This is a known issue with all types of Lambda functions (see this article from AWS and StackOverflow here).

We had to work around it by setting the loader to use the main CPU rather than separate processes. With Lambda’s remit expanding to handle larger modelling workloads, particularly with multiple vCPUs, issues like this are going to become more prevalent. The traceback is included here in case it helps anyone who's searching for this problem.

[ERROR] OSError: [Errno 38] Function not implemented
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aws_lambda_powertools/logging/logger.py", line 247, in decorate
return lambda_handler(event, context)
File "/var/task/handler.py", line 12, in handle_event
result = run_test(jobs)
File "/src/aws_test_densenet.py", line 89, in run_test
for data in dataloaders[split_name]:
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
return _MultiProcessingDataLoaderIter(self)
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 684, in __init__
self._worker_result_queue = multiprocessing_context.Queue()
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__
self._rlock = ctx.Lock()
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__
sl = self._semlock = _multiprocessing.SemLock(
[ERROR] OSError: [Errno 38] Function not implemented

Conclusion

If you are comfortable with container tooling and deployment, container image support in AWS Lambda will be a big win. If, on the other hand, you are more familiar with ZIP-packaged Lambas and see no need to use container tooling, there is no change required. This feature brings options for new use cases and new types of users with different concerns and perspectives.

It feels like a lot of thought has gone into providing support for container images in a way that doesn’t disrupt the AWS Lambda experience for existing developers. There’s not a lot to learn if you are familiar with containers and Lambdas as separate topics already. The addition of the open source Runtime Interface Client and Runtime Interface Emulator are really welcome as it allows you to really get to grips with what’s going on under the hood. Even for a managed service, this kind of context can be really valuable when unexpected problems arise.

If you haven’t already, check out our high level overview of Container Image support for AWS Lambda here.

How to Create Secure Internal APIs on AWS without VPCs

Eoin Shanaghy — Fri, 05 Jul 2019 11:24:46 +0000

Let's say you have a serverless deployment in AWS with external, public-facing APIs and some Lambda functions behind those APIs. As your deployment grows, you are likely to need internal communication between isolated parts of your system (microservices). This can be divided into three categories:

Pub/Sub-style event-driven communication. This is where one service publishes events about what has occurred. Other, subscribing services react to events as required according to their responsibility.
Point-to-point event-driven communication. This is where you have a queue as part of a service so it can receive messages to be proceeds. For example, an Email Service might receive a message containing to, from, subject and from fields and use these fields to construct and send an Email using SES or SendGrid
Point-to-point Synchronous communication. This is when a calling service needs to make a request to another service (which may be internal or external) and blocks waiting on the response. An example of this could be a request made to the User Service to find a user's email address based on a user ID found in an authorization header.

Synchronous Calls Between Services

This post is about the third case; point-to-point synchronous communication. In a serverless context, you could do this with a function-to-function invocation. In AWS, this would use Lambda.invoke. The calling service could grant itself the right IAM permissions to invoke the target Lambda defined by its ARN. However, this can be seen as a bad smell! It leaks implementation details about the service you are calling. If you want to replace that Lambda service with something else, like an external web-service, a container-based implementation or even a managed AWS service with no Lambda code required, you have a problem. You would then have to replace the implementation and change the code in each individual calling service.

Instead, a simple abstraction can be achieved by putting the implementation behind a HTTP interface. HTTP is ubiquitous, well understood and can be maintained while you change the underlying implementation without any significant drama. For our serverless, Lambda-backed implementation, this means putting them behind an API Gateway, just like our external APIs.

Options for Securing Internal APIs

Then comes the crucial question: how do we secure them? We only want our internal APIs to be accessed internally. The way I see it, these are your options:

Use VPCs
Use an API Key to secure the API Gateway
Use AWS IAM authorization on the API Gateway

A VPC approach would require putting your invoking Lambdas in a VPC and defining the API Gateway as a private API with a VPC endpoint. Avoid VPCs for your Lambdas if at all possible!. This will give you restrictions in your Lambda scalability and an additional (8-10 second) cold start time. Yan Cui covers this topic really well in the Production-Ready Serverless video course.

UPDATE 2019/10/17: Since this post was first written, AWS have reworked the ENI allocation method for VPCs so the cold start penalty is starting to go away for Lambdas in VPCs. This is yet to be rolled out to all major regions but it will change the picture significantly. I would still avoid VPCs in many cases unless necessary because of the additional complexity of managing VPCs.

An API Key approach seem reasonable. You can generate an API Gateway API Key for the internal API service and share it to the invoking service. The key is added to the authorization header when making the request. I have done this and it works. The challenge here was sharing the API Key. In order to do this, I had to create a custom CloudFormation resource to store the API Key in SSM Parameter Store so it could be discovered by other, internal services with permissions to access the key. API Keys are intended for controlling access to external APIs with quotas so internal APIs are not really their intended purpose. If you are interested in understanding how to do this, take a look at my solution here. For any questions, comment below or tweet me.

Internal API Gateway Security with IAM

The best practice recommendation is to use IAM authorization on the APIs. If you are using the Serverless Framework, it looks like this:

get:
  handler: services/users/get.main
  events:
    - http:
        path: user/{id}
        method: get
        cors: false
        authorizer: aws_iam

This will result in the AuthorizationType being set to AWS_IAM in the associated ApiGateway::Method CloudFormation resource. If you now try and invoke your API externally, you should get a 403 Forbidden response. You're API is now secured! So, how do we grant permissions to other internal services who need to call it?

Now that we are using IAM authorization for the API Gateway, invoking services need to be granted IAM permissions to invoke it. For an AWS Lambda function, this means granting access to invoke the target API and to make requests with the relevant HTTP verbs (GET, POST, PUT, PATCH, etc.)

- Effect: Allow
  Action:
    - execute-api:Invoke
    - execute-api:GET
  Resource:
    - arn:aws:execute-api:#{AWS::Region}:#{AWS::AccountId}:*/${self:provider.stage}/*/user/*

This IAM Policy snippet grants access to a Lambda function to invoke the GET methods for any API Gateway in the same account with /user/ in its path. The first wildcard (*) is for the API Gateway Resource ID. This is dynamically generated, so we don't want to explicitly write it here. The second wildcard is for the HTTP verb and the last is for the specific resource path.

Providing IAM Credentials in HTTP Requests

The final piece of the puzzle is in making the invocation with the correct credentials. When we use any HTTP request library (like requests in Python or axios in JavaScript), our AWS Lambda function role credentials will not be passed by default. To add our credentials, we need to sign the HTTP request. There is a bit of figuring out required here so I created an NPM module that wraps axios and automatically signs the request using the Lambda's role. The module is called aws-signed-axios and is available here.

To make the HTTP request with the credentials associated with the API Gateway permissions, just invoke with the wrapper library like this:

const signedAxios = require('aws-signed-axios')
...

async function getUser(userId) {
  ...

  const { data: result } = await signedAxios({
    method: 'GET',
    url: userUrl
  })

  return result
}

That's it. In summary, we now have a clear, repeatable approach for internal serverless APIs with:

A way to secure an API Gateway without VPCs
IAM permission we can use to grant to any services we wish to allow
A simple way to sign HTTP requests with the necessary IAM credentials

If you want to explore more, take a look at our open-source serverless starter project, SLIC Starter:

fourTheorem / slic-starter

A complete, serverless starter project

SLIC Starter

ℹ️ Note SLIC Starter is an archived project. It's a useful reference for ideas on building serverless applications with AWS but is not actively maintained. If you have any questions or need any help building modern applications on AWS, reach out to us on by email.

Jump to: Getting Started | Quick Start | CI/CD | Architecture | Contributing

SLIC Starter is a complete starter project for production-grade serverless applications on AWS. SLIC Starter uses an opinionated, pragmatic approach to structuring, developing and deploying a modern, serverless application with one simple, overarching goal:

Get your serverless application into production fast

SLIC Starter

View on GitHub

I work as the CTO of fourTheorem and am the author of AI as a Service. I'm on twitter as @eoins.

Find Changes Between Two Git Commits Without Cloning

Eoin Shanaghy — Sat, 02 Mar 2019 21:34:18 +0000

There are some cases where you want to find out information about changes in your Git repository without having to clone the full repository. This will usually be in your automated build environment. When I used Jenkins, Travis or Circle CI, I had access to the cloned Git repository and could use git log, git ls-remote and git diff without any problem.

Other tools, and I am talking specifically about AWS CodeDeploy, take a different approach. Instead of giving you access to a cloned repo, AWS CodeDeploy gives you a snapshot of your code without the .git folder. This makes it impossible to run checks on what has changed since a previous build or even to determine what has changed in the commit that triggered your build. Some CI environments will give you a "shallow clone" without the full Git history, leaving you with a similar challenge.

I wanted to run these kind of checks to determine which microservices in our monorepo had changed so I knew which ones to build and redeploy. This is a technique described well in this Shippable blog post.

I looked at two options to find out folders which had seen changes since the last successful deployment:

Clone the full repository manually in a CodeBuild step
Use the GitHub API to retrieve information about the commits

The first option was one I wanted to avoid. It meant cloning a potentially large and growing repository at the start of the build. A shallow clone would not be sufficient as it would not capture the history of changes back to the previous release.

The GitHub REST API includes a compare API and a list-commits API. The compare API is limited to 250 commits so that couldn't be relied on. The get-commits API could work but it means making multiple paged requests for a large amount of data just to get the changed paths. After a bit of trial and error, I ultimately abandoned the GitHub API approach.

After some further digging, I came across a StackOverflow post that gave me a third option. It allows me to fetch the two individual commits using the git command and compare then to determine changed filenames. In this example, I'm using the public lodash/lodash repository. Assume we want to compare the changes between the tag 4.0.0 and the HEAD of the master branch, the sequence of commands looks like this:



git init .                                               # Create an empty repository
git remote add origin git@github.com:lodash/lodash.git   # Specify the remote repository

git checkout -b base                                     # Create a branch for our base state

git fetch origin --depth 1 4.0.0                         # Fetch the single commit for the base of our comparison
git reset --hard FETCH_HEAD                              # Point the local master to the commit we just fetched

git checkout -b target                                   # Create a branch for our target state

git fetch origin --depth 1 master                        # Fetch the single commit for the target of our comparison
git reset --hard FETCH_HEAD                              # Point the local target to the commit we just fetched

git diff --name-only base target                         # Print a list of all files changed between the two commits

The directory size with this minimal fetching approach is 4.6M compared to 49M for the full lodash repository.

I'm the CTO at fourTheorem. Follow me on twitter: @eoins

DEV Community: Eoin Shanaghy

How to use EventBridge as a Cross-Account Event Backbone

Separate AWS accounts

Event based communication

What is an Event Backbone?

Cross-account EventBridge

Why do we need a global bus and local buses?

Cross-account EventBridge backbone example

Global bus event rules

Logging events for debugging and auditing

Event structure, schemas and validation

Further reading and viewing

3 Ways to Read SSM Parameters

Reading at Runtime

Reading at build time

Reading at Deploy Time

Secrets Manager

Conclusion

Container Image Support in AWS Lambda Deep Dive

Development

Deployment

Local Development and Testing

Versioning

Layers vs. Layers

Runtimes

More Heavy Lifting

Issues Encountered

Conclusion

How to Create Secure Internal APIs on AWS without VPCs

Synchronous Calls Between Services

Options for Securing Internal APIs

Internal API Gateway Security with IAM

Providing IAM Credentials in HTTP Requests

fourTheorem / slic-starter

A complete, serverless starter project

SLIC Starter

Find Changes Between Two Git Commits Without Cloning