DEV Community: Gert Leenders

A Hidden Gem: Two Ways to Improve AWS Fargate Container Launch Times

Gert Leenders — Thu, 27 Oct 2022 18:30:36 +0000

TL;DR

There’s a little gem hidden away on GitHub about boosting Fargate launch times. Kudos to Massimo Re Ferrè, the author of the Github comment in question. With this post, I want to bring his comment to your attention. Feel free to just read the original posting, this blog post is a shorter version that zooms into the core problem and solutions.

The problem

On AWS Fargate, each containerized workload, Amazon ECS Task or Kubernetes Pod, runs on its own single-use single-tenant instance that’s not reused after the workload finishes. The container images required to run a workload on AWS Fargate are downloaded for every Amazon ECS Task or Kubernetes Pod. This process is in contrast to multi-tenant instances like Amazon ECS Container Instances or Kubernetes Nodes, where a container image may already exist on a host from a replica of the same workload.

The use of a single-use single-tenant instance makes image caching not easy to fix on AWS Fargate. However, AWS is working on two alternative approaches to reduce container launch times.

Approach 1: Reducing AWS Fargate Startup Times with zstd Compressed Container Images

By default, container image builders compress each container image layer using the gzip compression algorithm. When a container runtime downloads an image layer, the runtime decompresses the contents into the container runtime storage system.

Know that container image builders and container runtimes also support an alternative compression algorithm for image layers: zstd. Benchmarks show that zstd can achieve higher compression ratios and higher decompression speeds than the gzip compression algorithm. AWS internal testing of zstd compressed container images on AWS Fargate has shown up to a 27% reduction in Amazon ECS Task or Kubernetes Pod startup times.

The reduction in startup time varies by container image, with the larger container images seeing the most significant improvement. Uses cases like machine learning, artificial intelligence, and data analytics traditionally have large container images. Consequently, these workloads could see the most benefit from adopting zstd compression.

From: Reducing AWS Fargate Startup Times with zstd Compressed Container Images

Approach 2: Seekable OCI for lazy loading container images

Prior research has shown that container image downloads account for 76% of container startup time, but on average only 6.4% of the data is needed for the container to start doing valuable work.

Lazy loading is an approach where data is downloaded from the registry in parallel with the application startup.

Seekable OCI (SOCI) is a technology open-sourced by AWS that enables containers to launch faster by lazily loading the container image. It’s usually not possible to fetch individual files from gzipped tar files. With SOCI, AWS borrowed some of the design principles from stargz-snapshotter, but took a different approach. A SOCI index is generated separately from the container image and is stored in the registry as an OCI Artifact and linked back to the container image by OCI Reference Types. This means that the container images do not need to be converted, image digests do not change, and image signatures remain valid.

The soci-snapshotter tool is used to create SOCI indices for existing OCI container images and a remote snapshotter. It provides containerd the ability to lazy load images that have been indexed by SOCI.

From: Introducing Seekable OCI for lazy loading container images

Conclusion

While image caching for Fargate is not solved, know that you have these two techniques to your proposal to reduce Fargate launch times!

Enjoy and until next time!

Sustainability and Software Development: Properly Prioritise Your Actions

Gert Leenders — Thu, 13 Oct 2022 19:24:41 +0000

TL;DR

We should take every possible action to limit our carbon footprint. Nevertheless, prioritise those actions based on impact (see the priority table below).

The Sustainable Pillar

In today's world, sustainability should be a concern. Information Technology (IT) goes with that flow, having climate concerns put on most agendas. Regarding cloud computing, it’s nice to see that AWS, Google Cloud, and Azure made sustainability part of their strategy.

As a cloud engineer, the following article was recently brought to my attention: Python is Destroying the Planet. The article has good content, but I had many concerns after reading it.

Context is Everything.

To summarize, the article is about measuring energy efficiency of programming languages and concludes with an open question. The latter opens the door for people jumping to the wrong conclusions or even worse, taking improper action. 😨

Based on the article, I came up with the following metaphor.

Let’s say a parcel needs to ship from Belgium to Rome with the lowest carbon footprint. Therefore the package is loaded into a truck. Although it's a hot summer day, the driver - concerned about the planet - pushes himself to the limit. To further reduce the carbon footprint, he turns off the air-conditioning to save some gasoline.

Although the driver’s intentions are good, I immediately ask myself the following questions:

Did that parcel have to ship to Rome? Is there any value in doing this? Maybe it’s just an empty box?
Did the truck take the shortest or fastest route?
Was the truck loaded at full capacity?
Were the truck’s tires inflated sufficiently?
Given the parcel dimensions, what were the other freight options?
...

My point? The intentions are correct, but turning of the airconditioning is no more than a drop in the ocean compared to the outcome of the above questions! The effect of answering those questions could impact a few magnitudes bigger than turning off the airconditioning.

Carbon footprint measures for IT prioritized

With multiple options available, it’s better to apply the one with the most impact first.

Concerning IT sustainability, I prioritize actions as follows:

The business case: There has to be value in the software written to begin with! For example:
- There's no value in a Petabyte data lake without data access
- There's no value in calculated predictions that are applied nowhere
- You get the point ;-)
Energy input: is the data center running on green energy? If yes, everything else becomes irrelevant because there’s no carbon footprint. (Yes, a bit simplistic, but again, you get the idea)
Energy output: is the residual heat of the data center reused?
Technology: Cloud Native, Serverless, CQRS, ...
Elasticity:
- Are things shut down when not in use
- A low baseline and scale-out when needed
Clean code
...
...
Programming language: is the language used energy efficient?
...

It would be wrong to state that the choice of programming language has no impact on carbon footprint. But before digging deeper into that, many other things could yield much more with less effort.

To wrap up

When thinking about sustainability, properly prioritize your efforts toward limiting your carbon footprint.

Running your workloads in the cloud is likely already a step in the right direction because cloud providers have usually a lower carbon footprint and are more energy efficient than typical on-premises alternatives. This is because they invest in efficient power and cooling technology, operate energy-efficient server populations, and achieve high server utilization rates.

Cloud workloads reduce impact by taking advantage of shared resources, such as networking, power, cooling, and physical facilities. The incentive to run as optimized as possible is also much bigger for those big providers.

Enjoy and until next time!

Understand, Visualize, and Lower CloudWatch Costs

Gert Leenders — Wed, 22 Jun 2022 19:41:32 +0000

In this post, I’ll create awareness on AWS CloudWatch costs and help you better understand what could cause these costs.

Logs & Metrics

Whenever you build a service or application, to get insights, logs and metrics are crucial in a setup. In case of trouble, logs are the first place to look. Besides, metrics enable you to predict events and to get notified as soon as numbers start deviating.

Often starting small, together with a service grows the amount of logs and metrics. In AWS, metrics and logs are part of AWS CloudWatch.

Hide & Seek

Over the years, I've noticed the following regarding CloudWatch cost. Most of the time, AWS CloudWatch is not the most expensive service on the AWS Bill. Usually, CloudWatch costs don't even make it to the top three. CloudWatch seems to be a master in disguise in a goody-goody way. Often it ends in the fourth or fifth place on your AWS bills. That place -just outside the top three- appears to be the perfect spot to eat up a good portion of your budget without becoming a usual suspect in the next cost-saving round. 😓

I'm going to be straight; I suspect AWS CloudWatch to be one of AWS’s cash cows. Why? Besides being a master at staying under the radar, CloudWatch also has costly defaults. Take log group retention as an example. The default for log groups to never expire only makes sense to AWS! My suspicion for CloudWatch being a cash cow is amplified by the fact that it’s damn hard (to nearly impossible) to get useful cost insights.

CloudWatch Log Group cost distribution

First, regarding CloudWatch Logs, note that it's not data storage that’s costly. It’s data ingestion that generates costs. Your first question about logs: which log groups generate the highest cost? A good insight isn’t available out of the box, but I figured out an excellent way to do it. I’ve built a graph showing the log group data ingestion distribution. The graph shows the log groups eating up the most money from left to right.

Immediately you notice a significant outlier on the far left of the graph. Not a big surprise; in this case, the clear winner is AWS CloudTrail. Outliers like these make it hard to zoom in on other elements. To further drill down in the graph, start disabling the outliers on the graph’s legend to the right.

Setting up the log group data ingestion distribution graph yourself is easy. In CloudWatch, create a new dashboard and add a new widget. The most important thing is to fill in the following query to build the graph:

To enhance the contrast between group sizes, you can play around with the graphs period:

Moving the legend to the right helps to select certain groups.

The graph visualization allows you to prioritize the log groups optimization. Ultimately, the optimization will likely come down to lowering your number of logs:

Do you need that many log statements? Challenge yourself!
Question the log level. Having a DEBUG level is sometimes great but using that log level everywhere all the time is probably overkill
The CloudWatch log agent is not shy to generate a good amount of logs/costs
...

I know, this is kicking in an open door 😄

CloudWatch Metrics

If getting logs insights was complex, getting metrics data is even trickier. On top, I have the feeling CloudWatch metrics are an even bigger money pit than logs. Worse, there’s currently no way to get an insight into CloudWatch Metrics. I contacted support asking for a way, but I‘m told it’s impossible.😟

So all that I can do for now is create awareness about CloudWatch metrics cost. Remember that metric costs can become expensive very fast. For example, at some point, I created a deep health check on a fleet of instances with the help of some custom metrics. I assumed the cost would be negligible, only to discover those deep health checks cost around 900$ a month. I can tell you that health is long gone now. 😉

To summarize

My advice is to check your CloudWatch cost at least monthly. Be aware that CloudWatch costs often tend to stay under the radar and try to find a way to get a good insight into your CloudWatch costs.

Enjoy and until next time!

IT Professionalism: Have You Ever Wondered Why We Ship Shit?

Gert Leenders — Tue, 22 Feb 2022 18:35:10 +0000

Warning: This article contains strong language that may offend some readers. Please blame Uncle Bob.

If the sentence "Don't Ship Shit" doesn't ring a bell, then be sure to watch Uncle Bob's talk about professionalism and delivering quality software first.

The particular episode of Uncle Bob's talk I like to discuss in this post is the one about the two sets of ears picking up the tagline:

"I expect that we will not ship shit".

Bob pictures this like this: while the ear of the programmer is yelling at you saying this is insane, the ear of the normal human is saying this is completely reasonable. Basically, Bob states that programmers believe it's impossible to deliver quality software while meeting deadlines unless they ship shit.

Without any doubt, Uncle Bob is a great speaker, and the Voxxed CERN 2019 Keynote is an excellent presentation. The story is so recognizable, making the talk so enjoyable. Yet, at the same time, it's also painful because it's so true. As developers, we know we're often cutting corners so big that it results in poor-quality software. The boundary between cutting corners and crappy software is thin. On the other hand, business seems unaware of these shortcuts and assumes developers always deliver top-notch applications and services 😦

I want to elaborate a bit more on Bob's talk: I think they're actually two kinds of shit: Shit by Design and Shit by Incompetence.

Shit by Design

Shit by design: shit caused by overcutting corners and being too indulgent.

Bad design driven by wrong estimates

I can't repeat it enough: estimations should be accurate but not precise. Also, estimates are non-negotiable. You should feel offended if someone questions your estimates!

Don't cave under pressure to make silly estimates!

Bad design by bureaucratic meddling

Business defines the functionals, IT defines the non-functionals.
Compare writing software with buying a car. As a customer, you can decide upon a wide variety of options. However, a customer can't choose to leave out the airbag or seatbelt to save some dollars. Manufacturers don't offer that choice because the law doesn't allow or to protect customers against themselves (in case somebody is still to be convinced those things are a lifesaver in case of an accident) 😓

Don’t let businesses decide on the non-functionals. IT will feel the pain of those when they are not properly executed. Security can be compared with a car's seatbelt, it's not optional.

Bad design by feature stuffing

Designing a system without knowing RTO and RPO inevitably results in over- or under engineering. Besides, software quality assurance is not a one-off. Continuously use your SLOs as a defense shield against feature stuffing and to reserve enough time for the non-functionals.

The key takeaway: be more assertive as a developer. No one expects you to take shortcuts so big that they ruin quality and SLOs!

Shit by Incompetence

Shit by Incompetence: people not being aware of the mess they're making.

In Bob's words: we don't know shit. 😄

There is nothing as dangerous as the Dunning–Kruger effect: failing in self-assessment and overestimating your abilities. Personally. I like Socrates' saying: I know that I know nothing. That mindset is a good starter and half the battle!

In a nutshell: don't over- or underestimate your skills and be assertive! That's the best defense against shipping poor quality software.

Enjoy and until next time!

Improving Resiliency using Immutable Infrastructure

Gert Leenders — Mon, 31 Jan 2022 11:48:16 +0000

The reason for writing a post about resiliency and immutable infrastructure originates from the following question recently asked to me:

Just read your comments about building AMIs. Why do you think it is best to do everything on compile time? We decided to work the other way around. We mostly deploy the standard images provided by AWS and do all the customization as post-install config with Ansible.

Compile-time vs. Runtime aka Baking vs. Frying

To explain why I prefer doing things at compile-time, we need to look at the possibilities for system provisioning first. Generally, you have two options for system provisioning: bake all configuration into a system before launch (baking) or add configuration after launching, called frying. Often you use a mixture of both, where parts of the system are baked, and parts are fried.

Towards developers, I often compare the concepts of baking and frying with compile-time vs. runtime or early vs. late binding. That terminology is easier to understand for developers because most of them are familiar with those concepts and their benefits and drawbacks.

Mitigating Unpredictably

The main reason to prefer baking, early binding, or compile-time is simple: it avoids unpredictability. Whenever things are being done at runtime or whenever you use late binding, there is a chance of unexpected behavior, potentially breaking things.

For example, imagine the trivial case where yum update is executed in cloud-init on an EC2 Instance. Executing simple code while booting can potentially break because of unpredictable behavior. While it can look like a good idea to run yum update during initialization to keep package versions evergreen. It will become clear it wasn't when you end up without healthy instances Sunday at 8 am, just because some yum dependency was temporarily unavailable. 😅

Need for Speed

Besides stability, there's a second benefit of baking: speed. Replacement and scale-out events will likely finish faster using baked artifacts because those require fewer steps during startup.

Especially in the case of scaling out, velocity is critical. When scaling is slow, there's a bigger chance to overload a service until resources can't come up anymore. Think about premature starvation due to resource exhaustion and fail fast whenever scaling is involved. Keep in mind: a pre-built immutable artifact always wins the race to spin up over a post-build playbook.

Immutable Infrastructure

Doing everything at compile-time ideally results in what is called "Immutable infrastructure." If you are not familiar with the concept or you want a quick refresher. I recommend first reading Martin Fowler's Immutable Server. Drilling down further from that article will bring you to the PhoenixServer, an interesting concept linked to immutable infrastructure.

It is a good idea to virtually burn down your servers at regular intervals. A server should be like a phoenix, regularly rising from the ashes.

Furthermore, I want to emphasize that Snowflake servers are to be avoided. To say it in Martin's words: they're good for a ski resort, bad for a data center. The problem with a snowflake server is that it's difficult to reproduce. Should you run into problems, it's hard to fire up a new server to support the same functionality. New resources that come up need to be exact copies and predictable.

Immutable infrastructure and Infrastructure as Code (IaC)

I repeat myself saying I'm a fan of both Immutable infrastructure and Infrastructure as Code. Assure not to switch things up, Infrastructure as Code can facilitate building Immutable Infrastructure, but it's not a necessity.

Even in case everything is installed manually without using Infrastructure as Code, Immutable infrastructure can be achieved as long as you "bake" the manual install steps into a machine image. Follow the same baking procedure whenever changes are made. It's maybe not ideal, but sometimes it's the best option if time is a constraint.

In any case, it is crucial to reflect your installation procedure in your Recovery Time Objective (RTO). If a setup or update is time-consuming, this should increase your Recovery Time Objective.

Embrace Chaos (aka Resilience) Engineering

Adopting a new strategy like Immutable Infrastructure requires extra effort and perseverance at the start. Be aware it's easy to fall into old habits. I regularly see people login into instances making changes using some "last resort remote access". That is something you really want to avoid. For that reason, I also advise embracing Chaos Engineering. It will help to get quick feedback about violations against Immutable Infrastructure. By shortening instance lifetime, people will feel the pain of manual changes after a week or two instead of several months, which is far more desirable. 😉 Start using AWS Fault Injector, Chaos Monkey, or simply put a max lifetime on your autoscaling group.

Enjoy and until next time!

One-Click Set Up: Querying AWS ALB and VPC Flow Logs Made Easy

Gert Leenders — Wed, 12 Jan 2022 18:55:31 +0000

Let's face it, as soon as you get in trouble with an application or infrastructure, logs are your first resort. To have the right logs at hand, VPC Flow Logs should be enabled when a VPC is part of an infrastructure. On top, a load balancer is most likely the second most common component of a setup, and it's advised to have access logging enabled on those.

Because these two log streams are often quite verbose, ingesting them cost-efficiently means offloading the log data to S3 instead of CloudWatch. ALB logs even don't offer a choice, those are stored on AWS S3 anyway. The tradeoff of storing logs on S3? It makes querying harder. As an answer, AWS created AWS Athena to facilitate querying structured S3 data.

Out-of-the-box ALB and Flow Log queries

Last year, AWS released Flow Logs Athena integration. Taking away the pain of the Athena VPC Flow setup. A similar counterpart to easily query ALB logs is sadly missing for the moment... Well, until now! 😉

In the background, the AWS Athena Flow log integration turned out to be a vanilla CloudFormation template that bootstraps some Athena resources. I simply enhanced the template adding Athena ALB log integration following AWS best practices.

To allow your account to easily query ALB and VPC Flow Logs, all you need to do is deploy this CloudFormation template. The details of the template are described below.

Here's the result:

The CloudFormation Log Stack in Detail

The stack parameters:

EnvironmentName: to support multiple environments (DTAP). Examples: dev or prod
Context: The ALB log context, could be something like an application or account name.
FlowLogsLocation: the S3 location of the flow logs. Format: s3://doc-example-bucket/prefix/AWSLogs/{account\_id}/vpcflowlogs/{region\_code}/
AlbLogsLocation: the S3 location of the ALB logs. Format: s3://your-alb-logs-directory/AWSLogs/(account\_id}/elasticloadbalancing/{region\_code}/
InitialPartitionDays: the number of days (in the past) of log data that will be partitioned on setup.

The most essential resources in the stack are:

An Athena Database
An Athena workgroup
An Athena Partitioned Table for the VPC Flow Logs
An Athena Partitioned Table for ALB logs
A Partitioner that will partition the log data for x amount of days in the past on setup
A partitioner to daily partition new log data

A word on Partitioning Data in Athena

By partitioning Athena Tables, you can restrict the amount of data scanned by each Athena query, thus improving performance and reducing cost. You can partition your data by any key. A common practice is to partition the log data based on time. In this case, the data is partitioned by year, month and day, because it probably makes the most sense.

To verify if a table is correctly partitioned you can run:

show partitions vpcflowlogs_non_prod

Enjoy querying your logs and until next time!

How-to Use Static Stability to Design a Resilient Architecture

Gert Leenders — Thu, 23 Dec 2021 08:26:49 +0000

TL;DR

Building a resilient architecture, it could help to separate systems along the control and data plane boundary. Once separated, focus on the data plane for targeting a higher availability. Because of its relative simplicity, a data plane is much better suited for High Availability than a control plane. Finally, introduce static stability by preparing yourself for impairments before they happen. Don't rely on reacting to impairments as they happen, it's less effective. A statically stable design is achieved once the overall system keeps working, even when a dependency becomes impaired!

Very recently, while browsing some online AWS documentation, I landed on a page about Static stability using Availability Zones. Although that page had nothing to do with what I was looking after, I found the title quite intriguing. I had never heard about the concept before. Triggered, I decided to read the paper. It turned out to be one of the more interesting papers I've read in a while. The remainder of this post is a brief rehash of the paper, focussed on the core concept and supplemented with my thoughts.

Important note: although the paper originates from AWS, the concepts are not limited to the AWS Ecosystem but are broadly applicable in system design.

Before we start

Because static stability is all about resilience, I need once more to rant a little on the importance of RTO and RPO first. I find it hard to understand that sometimes systems are being designed without knowing their recovery objectives. It seems impossible to make a great design without knowing these. A design that doesn't consider the recovery objectives from the start will inevitably result in over-or underengineering.

Always sort out the recovery objectives first before starting a system design!

What is Static Stability?

In a statically stable design, the overall system keeps working even when a dependency becomes impaired. Perhaps the system doesn’t see any updated information that its dependency was supposed to have delivered. However, everything it was doing before the dependency became impaired continues to work despite the impaired dependency.

Before we drill further down, we first need to talk about Availability Zones, because on AWS, availability zones are a main pillar of statically stable designs. Therefore a quick refresher of the definition of an Availably Zone: "an Availability Zone is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region".

Over the years, when building systems on top of Availability Zones, AWS has learned to be ready for impairments before they happen. A less effective approach might be to deploy to multiple Availability Zones with the expectation that, should there be an impairment within one Availability Zone, the service will scale up in other Availability Zones and be restored to full health. This approach is less effective because it relies on reacting to impairments as they happen, rather than being prepared for those impairments before they happen. In other words, it lacks static stability. In contrast, a more effective, statically stable service would over-provision its infrastructure to the point where it would continue operating correctly without having to launch anything new, even if an Availability Zone were to become impaired.

However, understanding static stability itself is only half the story. To properly apply the pattern to a system design, you need to understand a second concept: control and data planes.

What is a Control and Data plane?

A control plane is the machinery involved in making changes to a system (adding resources, deleting resources, modifying resources) and getting those changes propagated to wherever they need to go to take effect. A data plane, in contrast, is the daily business of those resources, that is, what it takes for them to function. Furthermore, It's crucial to understand that the data plane is usually far simpler than its control plane. As a result of its relative simplicity, a data plane is much better suited for targeting a higher availability than a control plane.

Putting it all together

Separate systems along the control and data plane boundary

The concepts of control planes, data planes, and static stability are broadly applicable in various system designs. Suppose you're able to decompose a system into its control plane and data plane. In that case, it might be a helpful conceptual tool for designing highly available services for several reasons:

It's typical for the availability of the data plane to be even more critical to the success of the customers of a service than the control plane.
It's typical for the data plane to operate at a higher volume (often by orders of magnitude) than its control plane. Thus, it's better to keep them separate so that each can be scaled according to its own relevant scaling dimensions.
AWS found over the years that a system's control plane tends to have more moving parts than its data plane, so it's statistically more likely to become impaired for that reason alone.

Putting those considerations altogether, it seems a best practice to separate systems along the control and data plane boundary.

Now, let's zoom in on some examples of Static Stable Designs.

Design 1: Static Stability using an Active-Active setup on Availability Zones

A common example of an active-active design on AZ's is a load-balanced HTTPS service. The diagram below shows a public-facing Load Balancer providing an HTTPS service. The load balancer's target is an Auto Scaling group that spans three Availability Zones. This is an example of active-active high availability using Availability Zones.

In the event of an Availability Zone impairment, this design requires no action. The instances in the impaired Availability Zone will start failing health checks, and the Load Balancer will shift traffic away from them.

Design 2: Static Stability using an Active-standby on Availability Zones

The next diagram shows an Amazon RDS database. In this case, the RDS database setup spans multiple Availability Zones. In a multi-AZ setup Amazon RDS puts the standby candidate in a different Availability Zone from the primary node. In this setup, when the Availability Zone with the primary node becomes impaired, the stateful service does nothing with the infrastructure. In the background, RDS will manage the failover and then repoint the DNS name to the new primary in the working Availability Zone.

These two patterns have in common that both of them had already provisioned the capacity they’d need in the event of an Availability Zone impairment well in advance of any actual impairment.

The principle of independent Availability Zones

A third way to use the principle of independent Availability Zones is to design a packet flow to stay within an Availability Zone rather than crossing boundaries. Keeping network traffic local to the Availability Zone is worth exploring in more detail. The following diagram illustrates a highly available external service, shown in orange, that depends on another, internal service, shown in green. A straightforward design treats both of these services as consumers of independent EC2 Availability Zones.
Each of the orange and green services is fronted by a Load Balancer, and each service has a well-provisioned fleet of backend hosts spread across three Availability Zones. One highly available regional service calls another highly available regional service. This is a simple design, and for many of the services we’ve built, it's a good design.

Suppose, however, that the green service is a foundational service. That is, suppose it is intended not only to be highly available but also, itself, to serve as a building block for providing Availability Zone independence. In that case, you might instead design it as three instances of a zone-local service, on which we follow Availability Zone-aware deployment practices. The following diagram illustrates the design in which a highly available regional service calls a highly available zonal service.

The reasons to design a building-block service to be Availability Zone independent comes down to simple arithmetic.

In a system where one highly available regional service calls another highly available regional service and a request is sent through the system, then with some simplifying assumptions, the chance of the request avoiding the impaired Availability Zone is 2/3 * 2/3 = 4/9. That is, the request has worse than even odds of steering clear of the event.

In contrast, if you built the green service to be a zonal service, then the hosts in the orange service can call the green endpoint in the same Availability Zone. With this architecture, the chances of avoiding the impaired Availability Zone are 2/3. If N services are a part of this call path, then these numbers generalize to (2/3)^N for N regional services versus remaining constant at 2/3 for N zonal services!

I hope you enjoyed reading about the concept of static stability and the principles of independent AZ's as much as I did. 😉

Kudos to Becky Weiss and Mike Furr, the authors of the original AWS paper about Static Stability.

Until next time!

AWS RDS Backup Dilemma: Why It is Hard to Do Good on RTO and RPO Simultaneously

Gert Leenders — Thu, 09 Dec 2021 20:11:56 +0000

Objectives and Disaster recovery

Because a disaster event can potentially take down a workload, your objective for Disaster Recovery should be bringing your workload back up or avoiding downtime altogether. A recovery strategy itself is most often built upon two objectives. RTO and RPO:

Recovery time objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
Recovery point objective (RPO) is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.

(Database) Backups

Needless to say, a vast majority of software projects still contain some kind of database in their setup. That persistent data store is often also one of the main subjects in the disaster recovery strategy. Backups are the answer to this issue. In the context of AWS RDS, there are two options for doing Database backups:

RDS DB snapshots: a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases.
RDS Recovery Point in Time: this allows you to restore a DB instance to a specific point in time, creating a new DB instance.

The Dilemma

So far the introduction, to make my point, we need to set some objectives first. I don't want to push it to the extreme, so let's put some reasonable figures for RTO and RPO:

RTO: 30 mins, preferably less
RPO: 1h, preferably less

No questions asked, the less data we lose when recovering, the better. With this in mind, "Recovery Point in Time" always seems the better option. But is it? Have you ever tried doing a "Point in Time recovery"? Although offering better RPO, is there any difference regarding RTO? While not all that obvious, yes there is! Point in Time recovery requires more time to restore your data.

"Recovery Point in Time" behind the scenes.

To understand why a point in time restore is slower, we need to know how it works. To do its point-in-time magic, regular snapshots are taken from time to time. On top of that, to prevent losing as little data as possible, a binary log containing all operations (aka oplog or operations log) is stored (on AWS S3). So whenever you run a Point in Time recovery, there's added time to replay the operations log on top of the regular snapshot restore time. Of course, the extra time will depend on the age of the snapshot and the number of average manipulations that you run on the database. But inevitably, it will take you extra time.

So here's your dilemma:

"Do you choose regular DB snapshots offering a likely lower RPO but faster RTO" vs. "Do you choose Point in time restore offering better RPO but slower RTO?"

No, you can't have the best RPO without paying a price for it.

What have I learned?

In my case, it's okay to lose some data, and in fact, my service can live with an RPO of 1 hour. On the other hand, the better our RTO, the happier business will be in case of a catastrophe.
Those objectives made us drop "Recovery Point in Time Restore" and daily backup instead of just hourly snapshots. This new approach will offer us a one-hour RPO with the best possible RTO.

After all, testing the Disaster Recovery Plan to sort out all those assumptions taught me a lot 😉

A small update: DB Snapshot Restore uses lazy loading

After reading this blog, fellow AWS Hero Renato Losio pointed out another factor to consider regarding RTO when an RDS Snapshot restore is performed: Lazy loading. It is something I was completely unaware of, although it’s all in docs:

You can create a new DB instance by restoring from a DB snapshot. You can use the restored DB instance as soon as its status is available. However, the DB instance continues to load data in the background. This is known as lazy loading. Furthermore, If you access data that hasn't been loaded yet, the DB instance immediately downloads the requested data from Amazon S3, and then continues loading the rest of the data in the background.

There’s even a way to diminish these effects:

To help mitigate the effects of lazy loading on tables to which you require quick access, you can perform operations that involve full-table scans, such as SELECT *. This allows Amazon RDS to download all of the backed-up table data from S3.

From: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RestoreFromSnapshot.html

Kudos to Renato for making me aware of this!

Enjoy and until next time!

AWS re:Invent 2021 recap by a DevTools Hero

Gert Leenders — Mon, 06 Dec 2021 11:02:05 +0000

With another amazing re:Invent edition behind us, It's time for a little recap. This edition was also my first as an AWS Hero. Looking back, I must say AWS knows how to please its community members. This year AWS even gave us our own special AWS Hero Lounge! Kudos AWS for the overall support and pampering. It was crazy.

Before I start my recap, I want to say that even with all the nasty Covid-19 restrictions, I really enjoyed this year re:Invent. Personally, I find it hard to 'virtually' attend conferences. Being at a conference in person, feeling the vibes and atmosphere, is kind of a must for me.

So here's my first piece of advice: if you feel comfortable with it, I would strongly recommend attending re:Invent in person. All the great sessions are just a tiny part of the conference's value. What makes attending re:Invent stand out for me are all the AWS experts hanging around. The technical knowledge and expertise of all the people you can talk to at the conference are mind-blowing. It's a great place to pick some brains 😛

Content wise

Before jumping into the list of the newly announced stuff I like the most. I must stress that such a list is very personal and heavily influenced by someone's background. Having a Developer background myself and a strong interest in DevOps and Security is reflected in my list. On top of that, the list also matches the services I primarily work with daily.

All that said, here we go.

Application Integration

Amazon SQS Enhances Dead-letter Queue Management Experience For Standard Queues
AWS Lambda now supports partial batch response for SQS as an event source With this feature, when messages on an SQS queue fail to process, Lambda marks a batch of records in a message queue as partially successful and allows reprocessing of only the failed records.

Containers / Compute / Serverless

AWS Releases Lambda Function URLs finally...NOT. I guess it's still only a matter of time before this one gets out.
Amazon ECR announces pull through cache repositories. This new feature allows you to automatically sync images from publicly accessible registries. Yes, I was waiting on that one ;-)
AWS Compute Optimizer now offers resource efficiency metrics. AWS Compute Optimizer now helps you quickly identify and prioritize top optimization opportunities through two new sets of dashboard-level metrics: savings opportunity and performance improvement opportunity.
AWS Compute Optimizer now offers enhanced infrastructure metrics, a new feature for EC2 recommendations. AWS Compute Optimizer now offers enhanced infrastructure metrics, a paid feature that when activated, enhances your Amazon EC2 instance and Auto Scaling group recommendations by capturing monthly or quarterly utilization patterns.

Database

Amazon DynamoDB announces the new Amazon DynamoDB Standard-Infrequent Access table class, which helps you reduce your DynamoDB costs by up to 60 percent Another cool feature to help bringing those bills down effortless. 👌

Developer Tools

AWS Cloud Development Kit (AWS CDK) v2 and Construct Hub are now generally available
Introducing AWS Amplify Studio: a visual development environment that offers frontend developers new features to accelerate UI development with minimal coding.
AWS SDK for Swift (Developer Preview)
AWS SDK for Kotlin (Developer Preview)
AWS SDK for Rust (Developer Preview)

Management & Governance

Introducing Amazon CloudWatch Metrics Insights. As a fast, flexible, SQL based query engine, Metrics Insights enables you to identify trends and patterns across millions of metrics in real time and helps you use these insights to reduce time to resolution.
Introducing Amazon CloudWatch RUM for monitoring applications’ client-side performance. Amazon CloudWatch RUM is a real-user monitoring capability that helps you identify and debug issues in the client-side on web applications and enhance end user’s digital experience.
Introducing Amazon CloudWatch Evidently for feature experimentation and safer launches. This is a new Amazon CloudWatch capability that makes it easy for developers to introduce experiments and feature management in their application code. CloudWatch Evidently may be used for two similar but distinct use-cases: implementing dark launches, also known as feature flags, and A/B testing.
Amazon Virtual Private Cloud (VPC) announces IP Address Manager (IPAM) to help simplify IP address management on AWS Using IPAM, you can automate IP address assignment to VPCs, eliminating the need to use spreadsheet-based or homegrown IP planning applications, which can be hard to maintain and time-consuming. This automation helps remove delays in on-boarding new applications or growing existing applications, by enabling you to assign IP addresses to your VPCs in seconds. IPAM also automatically tracks critical IP address information, including its AWS account, Amazon VPC, and routing and security domain, eliminating the need to manually track or do bookkeeping for IP addresses.
AWS Chatbot now supports management of AWS resources in Slack (Preview). This feature allows you to use AWS Chatbot to manage AWS resources and remediate issues in AWS workloads by running AWS CLI commands from Slack channels.

Security

AWS Shield Advanced introduces automatic application-layer DDoS mitigation. AWS Shield Advanced now automatically protects web applications by blocking application layer (Layer 7) DDoS events with no manual intervention needed by you or the AWS Shield Response Team (SRT)
Amazon Virtual Private Cloud (VPC) announces Network Access Analyzer to help you easily identify unintended network access Amazon VPC Network Access Analyzer is a new feature that enables you to identify unintended network access to your resources on AWS. Using Network Access Analyzer, you can verify whether network access for your Virtual Private Cloud (VPC) resources meets your security and compliance guidelines
AWS announces the new Amazon Inspector for continual vulnerability management This is a big one. In a nutshell the new Inspector provides:
- Continual, automated assessment scans—replaces periodic, manual scanning.
- Automated resource discovery – once enabled, the new Amazon Inspector automatically discovers all running Amazon EC2 instances and Amazon ECR repositories.
- New support for container-based workloads—workloads are now assessed on both EC2 and container infrastructure.
- Integration with AWS Organizations—allowing security and compliance teams to enable and take advantage of Amazon Inspector across all accounts in an organization.
- Removal of the stand-alone Amazon Inspector scanning agent—assessment scanning now uses the widely deployed AWS Systems Manager agent, eliminating the need for a separate agent installation.
- Improved risk scoring —a highly contextualized risk score is now generated for each finding making it easier to identify the most critical vulnerabilities to address as a priority.
- Integration with Amazon EventBridge —integrate with event management and workflow systems such as Splunk and Jira. And, you can trigger automated remediation.
- Integration with AWS Security Hub

Storage

Announcing the new Amazon S3 Glacier Instant Retrieval storage class. The lowest cost archive storage with milliseconds retrieval. This new Glacier archive storage class delivers the lowest cost storage for long-lived data that is rarely accessed and requires milliseconds retrieval. In fact, in combination with the Amazon S3 Intelligent-Tiering storage class this automatically save up to 68% for data not accessed in the last 90 days. Really nice 💪
Amazon S3 Object Ownership can now disable access control lists to simplify access management for data in S3. This new S3 Object Ownership setting, 'Bucket owner enforced', disables access control lists (ACLs), simplifying access management for data stored in S3. When you apply this bucket-level setting, every object in an S3 bucket is owned by the bucket owner, and ACLs are no longer used to grant permissions.
Announcing preview of AWS Backup for Amazon S3. This allows you to create a single policy in AWS Backup to automate the protection of application data stored in S3 alone.
Amazon S3 Event Notifications with Amazon EventBridge help you build advanced serverless applications faster

Yeah, I'm sure I missed at least a few announces 😉

Enjoy an until next time!

Ransomware Mitigation: The New Vault Lock for AWS Backup

Gert Leenders — Fri, 05 Nov 2021 16:06:05 +0000

Very recently, AWS announced Vault Lock for AWS backup. This new feature enables the protection of backups from accidental or malicious actions. Behind the scenes, this extra safeguard is made possible by storing backups using a Write-Once-Read-Many (WORM) model.
Additionally, using a simple setting, you can now also prevent users from deleting backups or changing their retention periods, providing an additional layer of data protection!

The main reason to rehash this? Unique features like this seem to stay under the radar way too often. Secondly, if you already use AWS Backup, then enabling this extra protection is almost effortless.

Here's an example of AWS Backup Vault using Locks in CloudFormation:

  SomeBackupVault:
    Type: AWS::Backup::BackupVault
    DeletionPolicy: Retain
    UpdateReplacePolicy: Retain
    Properties:
      BackupVaultName: SomeBackupVault
      Notifications:
        BackupVaultEvents:
          - BACKUP_JOB_FAILED
          - BACKUP_JOB_EXPIRED
        SNSTopicArn: !Ref AlertSnsTopic
      LockConfiguration:
        ChangeableForDays: 3
        MaxRetentionDays: 180
        MinRetentionDays: 14

  SomeBackupPlan:
    Type: AWS::Backup::BackupPlan
    Properties:
      BackupPlan:
        BackupPlanName: SomeBackupPlan
        BackupPlanRule:
          - RuleName: Daily14DaysRetention
            TargetBackupVault: !Ref SomeBackupVault
            ScheduleExpression: "cron(0 2 * * ? *)"
            StartWindowMinutes: 60
            Lifecycle:
              DeleteAfterDays: 14

  TagBasedBackupSelection:
    Type: AWS::Backup::BackupSelection
    Properties:
      BackupSelection:
        SelectionName: TagBasedBackupSelection
        IamRoleArn: !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:role/service-role/AWSBackupDefaultServiceRole
        ListOfTags:
         - ConditionType: STRINGEQUALS
           ConditionKey: backup
           ConditionValue: daily
      BackupPlanId: !Ref SomeBackupPlan

The new properties to enable a Vault lock are under the LockConfiguration key of the AWS::Backup::BackupVault resource:

ChangeableForDays: specifies the number of days before the lock date. For example, setting ChangeableForDays to 30 on Jan. 1, 2022 at 8pm UTC will set the lock date to Jan. 31, 2022 at 8pm UTC. AWS Backup enforces a 72-hour cooling-off period before Vault Lock takes effect and becomes immutable. Therefore, you must set ChangeableForDays to 3 or greater.
MaxRetentionDays: specifies the maximum retention period that the vault retains its recovery points.
MinRetentionDays: specifies the minimum retention period that the vault retains its recovery points.

From: AWS::Backup::BackupVault LockConfigurationType

If you want to go more in-depth on this feature, check out: Enhance the security posture of your backups with AWS Backup Vault Lock. You also find a step-by-step walkthrough to enable this feature using the AWS Web Console on that post.

Enjoy an until next time!

AWS API Gateway Best Practices in-depth

Gert Leenders — Mon, 14 Jun 2021 15:30:01 +0000

Best Practices

Forgive me, the Bart Simpson in me couldn't resist using 'Best Practices' once again. Sure, there's a lot to say to stamp out “best practice” and I agree with most arguments in the article. Framing something as a best practice is subjective, and it could give the impression of being arrogant. But, as grown-ups, I'm convinced most of us know how to deal with the term and see why it's used. So pick the practices you agree on, which you see as 'best' practices yourself.

A front door: The importance of API Gateway

I have the feeling that the importance of API Gateway in a setup is sometimes overlooked. AWS wrote down the practices themselves (also using the term 'Best practices 😉). But IMHO, their documentation is a tad too brief. Also, the documentation lacks a 'WHY' in general.

The 'WHY'

Assuming the vast majority of API Gateways are public-facing, it's easy to picture an API Gateway as a front door. One of the characteristics of a front door is access control: who to let in and how many to let it (at once).

Nowadays, a front door camera even provides a track record of everyone that came across your door, logging all rejected and allowed entrance calls.

That being said, API Gateway is a front door, treat it like one! It's begging for attention security-wise.

The Importance of Logs.

People less familiar with security, easily miss the importance of logs. But whoever encountered a security breach will endorse their significance. In case something gets compromised, investigation often starts with looking at access logs. That is why every public endpoint (Web Server, Load Balancer, API Gateways, ...) should have access logs enabled. If you would ask me, access logs should be mandatory, non-negotiable.

Knowing this, I find it hard to understand why AWS Security hub only recommends API logs to be enabled. I have no clue why access logs didn't make it to the Foundational Security Best Practices standard for CSPM 😟. I hope AWS will settle this in the next iteration.

Access Log Retention Period

As a retention period for access logs, I recommend at least one year. To save costs, you could only retain them in CloudWatch for one month. After 30 days, you could transfer them to something like S3.

If a year sounds long, notice that breaches sometimes stay under the radar for quite some time. To tackle this, ensure your track record is long enough to allow successful investigation in case of trouble.

For the ones using CloudFormation and AWS SAM, here's the IaC:

  ApiAccessLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "servicename-api-${environment}-ApiGateway-cfn"
      RetentionInDays: 30

  SomeServiceGateway:
    Type: AWS::Serverless::Api
    Properties:
      Name: !Sub "Someservice API GW - ${environment}"
      StageName: "v1"
      EndpointConfiguration: REGIONAL
      TracingEnabled: True
      MethodSettings:
        - LoggingLevel: INFO
          MetricsEnabled: True
          DataTraceEnabled: True
          ResourcePath: "/*"
          HttpMethod: "*"
      AccessLogSetting:
        DestinationArn: !GetAtt ApiAccessLogGroup.Arn

Protecting an Unauthorized API Gateway

Unauthenticated API routes are open to the world. Therefore it's recommended to limit their use. It’s important to protect these unauthenticated API's against common risks, such as denial-of-service attacks or consumer errors.

AWS WAF

Applying AWS WAF to API Gateway helps to protect an application from SQL injection and cross-site scripting attacks. It's your first line of defense.

AWS CloudFront as traffic absorber

In case of a denial of service attack on an unauthenticated API, it’s possible to exhaust API throttling limits, Lambda concurrency, or DynamoDB provisioned read capacity on an underlying table. Putting an AWS CloudFront distribution in front of the API endpoint with an appropriate time-to-live configuration may help absorb traffic in a DoS attack without changing the underlying solution for fetching the data.

See: "Operating Lambda: Building a solid security foundation – Part 2" for more information.

Use API keys and Throttling

API Gateway allows throttling when API keys are used.

Use API Keys for unauthenticated API's when possible and never trust consumers 😉. Needless to say that negotiated contracts with (API) consumers change over time. Often these changes aren't briefed. Or maybe a consumer's business just grows, together with the requests he's sending you. But even without any reason, if a consumer sends you a few hundred thousand requests instead of the few hundreds he promised to... who will feel the pain? You or him?

With this in mind, design (your API Gateway) for error. The last thing you want is a drowned service due to a consumer's error. Throttling should be enabled by default on your API Gateway. It will prevent you from resource exhaustion, or even worse, scaling to the moon (together with your AWS bill). If a consumer breaks his quota, he should get a 429 Too Many Requests for coloring outside the lines. Let him feel the pain, not you!

Conclusion

Both throttling and logging are easy to enable but can be a real-lifesaver. Forewarned is forearmed.

Enjoy and until next time!

The Quirks of Software Effort Estimation

Gert Leenders — Mon, 07 Jun 2021 17:46:38 +0000

One of the most complex parts of Software Development is effort estimation. Over the years, I encountered some interesting concepts and laws regarding this topic that I would like to share. Being aware of these mechanics helped me to interpret and make estimates.

Here we go...

1. Accuracy vs. Precision

Probably the most important topic on the list.

Precision is how close measure values are to each other, basically how many decimal places are at the end of a given measurement.
Accuracy is how close a measure value is to the true value.

Precision is independent of accuracy.

Estimates should be accurate, not precise. It’s fine to say something will take two to six weeks. Most likely that's accurate, although not precise. Being more precise often leads to being less accurate. I challenge you to make a significant accurate estimate with minute precision 😉

2. The "Scotty Principle"

From How to Inflate Tasks and Extend Due Dates

If you're not a Star Trek fan, you may not be familiar with the Scotty Principle, but it's fairly simple. When asked how long a job will take, estimate the time required to complete the job, add about 25%-50% onto that estimate, and commit to the longer estimate. Then, when you deliver ahead of schedule (or something else happens that throws you off, and you're forced to deliver on schedule instead of ahead), you're a miracle worker who really knows how to get the job done.

On top of that, the Scotty Principle is often amplified as it echoes through every level it passes (Developer -> Project Manager -> ... -> Sales). This is because every level is unaware of the buffers added by others. Everyone in the chain can apply this principle on top of each other. So besides the developer, the project leader and sales guy can also add an extra 25% to 50%. Being unaware of additional safety nets, each creates its own on top of the others.

In the end, a one-week job could end up estimated as an entire month.

3. Parkinson's law

From Wikipedia, Parkinson's law

Parkinson's law is the adage that "work expands so as to fill the time available for its completion".

Take a minute to sponge this 😰. If you combine the Scotty Principle with Parkinson's law, the result is a zero-sum game. It's even worse! Due to Parkinson's law you somehow managed to eat up every buffer to finish just in time... what did you do with all that extra time 😏 !?

4. Unfounded Optimism

From: Software Estimation: Demystifying the Black Art

Optimism assails software estimates from all sources. On the developer side of the project, Microsoft Vice President Chris Peters observed that "You never have to fear that estimates created by developers will be too pessimistic because developers will always generate a too-optimistic schedule". In a study of 300 software projects, Michiel van Genuchten reported that developer estimates tended to contain an optimism factor of 20% to 30%. Although managers sometimes complain otherwise, developers don't tend to sandbag their estimates—their estimates tend to be too low!

What to do with this? Should a manager be so gentle as to add another 30% on top of your estimate to counter this? Maybe he's already doing this (without telling you)?

5. The Cone of Uncertainty

From Wikipedia, Cone of Uncertainty

In project management, the Cone of Uncertainty describes the evolution of the amount of best-case uncertainty during a project. At the beginning of a project, comparatively little is known about the product or work results, and so estimates are subject to large uncertainty. As more research and development is done, more information is learned about the project, and the uncertainty then tends to decrease, reaching 0% when all residual risk has been terminated or transferred. This usually happens by the end of the project i.e. by transferring the responsibilities to a separate maintenance group.

Don't underestimate the power of uncertainty. At the start of a project, it could mean your estimate is four times off! So an estimate of one month could be finished in one week or in the worst case, it could take four months!

6. Estimates are non-negotiable

I can’t count the times someone immediately replied to my estimate with: “Could you finish it in less time?”
Seriously? Where's your trust!? You ask me for an estimation and now you think I just made something up?

Replies like those really irk me. Imagine I ask “How tall are you?” and you reply 69.5 inches. Wouldn't it be awkward if I replied “could it be a bit smaller?” 😦. Estimates are like body length measurements, non-negotiable.

Conclusion

There's a lot to tell about Software Estimation. I would say that it's a science on its own, one which I don't master. Nevertheless, I always keep these six concepts in the back of my mind when making estimations, which helps.

For further reading, I can recommend: Software Estimation: Demystifying the Black Art

Enjoy and until next time!