1 million Amazon S3 buckets with AWS CDK

#cloud #aws #awscdk #infrastructureascode

With this recent announcement from AWS it is now possible to create up to 1 million Amazon S3 buckets in a single AWS account. Let's look at how it can be achieved using AWS CDK infrastructure as code tool.

Disclaimer: per every additional Amazon S3 bucket above 2,000, there is currently a $0.02 monthly fee (details under Security & buckets), which will result in a $19,960 monthly fee if 1 million Amazon S3 buckets are created in a single AWS account.

Motivation

The amount of data created yearly is accelerating at a rapid pace, with the prediction of exponential growth between 2025 and 2035. This is reflected by the revenue generated by the storage market worldwide. Data storage systems need to adapt to that scale. Cloud computing infrastructure with scalability at its core is an alternative to consider when designing systems for the future.

In a scenario where storage needs are growing, logically grouping the data according to its type or sources could be an additional requirement. Implementing this requirement in AWS can be achieved by keeping each category in a distinct Amazon S3 bucket. An example is a data lake platform, where the architects have decided to place each dataset in a separate Amazon S3 bucket for cost and governance reasons. Thousands of Amazon S3 buckets could be created for large organisations that generate many datasets daily.

Solution

For solutions of this scale, examining organisational best practices is essential, as highlighted in this article. A dedicated platform engineering team could be assigned to concentrate on architecting and implementing complex solutions like the one presented here.

Initial Architecture

The platform engineering team might start by placing all Amazon S3 buckets in top-level AWS CDK stacks. It's a good start. However, the team quickly realises there is an issue with this approach. AWS CDK uses AWS CloudFormation for the deployment. AWS CDK stacks correspond to the CloudFormation stacks. The team discovers a current limit of 500 resources that can be placed in a single AWS CloudFormation stack (details under Resources). This is a hard limit, and it can't be increased by a request to AWS. Aiming to create 1 million resources will result in placing them into 2,000 AWS CloudFormation stacks, each composed of 500 resources with the KodlotS3Stack example below.

It turns out that the 2,000 AWS CloudFormation stacks that the team needs to create with this solution is exactly the limit on the number of stacks that can be created in a single AWS account (details under Stack). In this case, it is a soft limit, and it can be increased by a request to AWS. Still, if the AWS account is to be used for purposes other than storage, creating 2,000 top-level stacks can make finding other stacks difficult. The team starts to explore what alternative options they have to refine the initial architecture.

Refinement 1 - Using Nested Stacks

With the concept of AWS CloudFormation nested stacks (and the corresponding AWS CDK nested stacks), the top-level stack can contain up to 500 nested stacks in it. Each AWS CloudFormation nested stack can contain up to 500 resources, as presented in the KodlotS3NestedStack example below.

With the introduction of the KodlotS3NestedStack definition, the original KodlotS3Stack can now be defined as presented below.

With this refined architecture, one top-level AWS CDK stack contains 500 AWS CDK nested stacks. Each AWS CDK nested stack containing 500 resources results in 250,000 Amazon S3 buckets that can be created in a single top-level stack. The team can solve the original goal by creating four top-level stacks. The result is much simpler management, as the AWS CloudFormation console offers filtering of the nested stacks, which at this point are an implementation detail not interesting in the daily operations of the solution.

However, platform engineers encounter an error when attempting to deploy this solution. The next section describes the resolution step that the team must take.

Refinement 2 - Deployable State

With the top-level stack hierarchy introduced, AWS CloudFormation attempts to create 250,000 AWS resources in parallel. The team finds out that this is 100 times over the hard limit of the maximum number of AWS CloudFormation resources a nested stack can create, update, or delete per operation (details under Nested stacks). To avoid crossing this limit and maintain the current stack structure, platform engineers figured out that the nested stacks need to depend on each other, as presented on the diagram below by the arrows between each nested stack. The dependency is from the AWS CDK stack that needs to depend on another AWS CDK stack and can be added using stack.addDependency(stack) stack API.

With the dependencies indicated between the AWS CDK nested stacks, each nested stack is created and updated sequentially, starting with the first one. Only after the first one is done processing, the second one proceeds, and so on, until the last one. This means 500 resources are created, updated, or deleted at most at any given time, staying within the 2,500 hard limit. The team can now successfully deploy the solution.

Refinement 3 - Production Readiness

The platform engineering team can take the solution one step further to prepare it for real-world scenarios where Amazon S3 buckets will be removed and created regularly. From the original example, imagine a dataset that is no longer needed. In such a case, the corresponding Amazon S3 bucket can be deleted. However, if such a deletion occurs, it creates an empty slot in the AWS CDK stack hierarchy, and it would be optimal to use that slot the next time a new Amazon S3 bucket needs to be created.

To achieve the ability to dynamically discover empty slots in the AWS CDK stack hierarchy, Amazon DynamoDB can be utilised to create a lookup table containing the information on where each Amazon S3 bucket is currently deployed.

Nested Stack	Slot	Bucket Name
KodlotS3NestedStack1	1	KodlotBucket1
KodlotS3NestedStack1	2	KodlotBucket2
KodlotS3NestedStack1	3	KodlotBucket3
...	...	...
KodlotS3NestedStack1	500	KodlotBucket500
...	...	...
KodlotS3NestedStack500	1	KodlotBucket249501
...	...	...
KodlotS3NestedStack500	500	KodlotBucket250000

Assuming that KodlotBucket3 gets decommissioned, the lookup table is going to reflect that by having the item corresponding to composite key Nested Stack=KodlotS3NestedStack1, Slot=3 removed. The team needs to add a script executing after the deployment that updates the lookup table with the current state of the deployed resources. They also adjust the AWS CDK code to use the lookup table to allocate the resources. It makes the infrastructure code more complex and adds a network lookup as a part of the AWS CDK stack synthesis. Without the network connection and valid AWS access keys, platform engineers can't synthesise the stacks with this solution. This makes local development harder, and as the team matures, they start exploring how to address it.

Refinement 4 - Removing the AWS Account Lookup

To overcome the last obstacle of not requiring the connection to the AWS account in the infrastructure code, platform engineers substitute the Amazon DynamoDB lookup table with the AWS SSM parameters. This is because AWS CDK can handle the AWS SSM parameter lookup without the connection to the AWS account.

The team can create a lookup mechanism based on AWS SSM parameters with the following AWS SSM parameters using its hierarchical structure:

kodlot/nestedstack1/slot1/bucketname = KodlotBucket1
kodlot/nestedstack1/slot2/bucketname = KodlotBucket2
kodlot/nestedstack1/slot3/bucketname = KodlotBucket3
...
kodlot/nestedstack1/slot500/bucketname = KodlotBucket500
...
kodlot/nestedstack500/slot1/bucketname = KodlotBucket249501
...
kodlot/nestedstack500/slot500/bucketname = KodlotBucket250000

The platform engineering team discovers that there is a limit to this approach as well. AWS SSM has two types of parameters: standard and advanced. Currently, there can be 10,000 standard and 100,000 advanced parameters created per AWS account and AWS Region (details here). When creating 1 million Amazon S3 buckets, assigning an AWS SSM parameter to each Amazon S3 bucket is impossible. Since AWS SSM parameters allow sizes of 4KB for standard and 8KB for advanced parameters, the solution needs to bundle, e.g. each nested stack slots together into a single parameter, making it harder to manage. This path is for those platform engineers who want to explore this possibility, especially if the requirement is to create more than 100,000 Amazon S3 buckets crossing the limit of the number of AWS SSM parameters. It is worth it, as it will bring valuable lessons for the team.

Conclusions

Engineering scalable infrastructure as code solutions entails various challenges that can be seen as opportunities for learning and considering when architecting similar solutions. Platform engineering teams will face challenges described here and similar ones as they support the continuously growing needs. Through this experience, mature solutions are developed, and the platform engineering teams' problem-solving ability is developed. The creative mindset is key for exploring available options, combined with knowledge of the infrastructure as code tools and cloud services that can support creating the desired outcome. The right investments in ongoing learning for the teams are crucial, topped with the clear support of exploration and treating failures as a feedback loop for the future.