DEV Community: Keisuke FURUYA

In LLM applications with limited concurrency, can ALB Target Optimizer enable concurrency control and corresponding scaling?

Keisuke FURUYA — Thu, 22 Jan 2026 13:44:10 +0000

Answer

Yes, it is possible. However, since scaling is reactive—occurring only after requests are rejected—some clever workarounds might be necessary for production use. I conducted some tests but haven't found a perfect solution yet, so if anyone has any insights, please let me know.

Application Load Balancer Target Optimizer

Announced in November 2025, this feature controls traffic by running an "ALB Agent" on the target side, allowing for information exchange between the ALB and the Agent. A primary use case is for LLM applications where a target instance can only handle one or two requests simultaneously. By controlling concurrency, it aims to prevent excessive load on individual instances.

This got me thinking: the feature reduces the load on the target by having the ALB return errors once the set concurrency limit is reached. In such a scenario, there’s naturally a need to scale the targets. I wondered if there was a way to auto-scale precisely when that concurrency limit is hit.

For instance, in Google Cloud Run, you can explicitly control concurrency per instance, and it automatically scales when an instance can no longer handle the load. My understanding was that we can now achieve similar control for instances behind an ALB based on per-instance concurrency. So, I decided to experiment.

Step 1. Preparation

To implement an intentional concurrency "overflow," I set up the environment referring to the following blog posts:

https://aws.amazon.com/jp/blogs/networking-and-content-delivery/drive-application-performance-with-application-load-balancer-target-optimizer/

https://dev.classmethod.jp/articles/try-aws-alb-target-optimizer/

First, I configured three instances with TARGET_CONTROL_MAX_CONCURRENCY set to 1 and confirmed that I could perform parallel load testing.

Step 2. Finding Metrics for Scaling

Quoting from the AWS Blog:

You can troubleshoot using the following metrics in CloudWatch:
TargetControlRequestCount: Number of requests forwarded by ALB to the agent.
TargetControlRequestRejectCount: Number of requests rejected by ALB due to no targets being ready to receive requests. This metric shows an uptick when TargetControlWorkQueueLength is zero.
TargetControlActiveChannelCount: Number of active control channels between the ALB and agents. Ideally, this should be equal to the number of agents.
TargetControlNewChannelCount: Number of new channels created between the ALB and agents.
TargetControlChannelErrorCount: Number of control channels between ALB and agents that failed to establish or experienced an unexpected error.
TargetControlWorkQueueLength: Number of signals received by the ALB from agents asking for requests.
TargetControlProcessedBytes: Number of bytes processed by ALB for traffic going to target groups that enable target optimizer.

An increase in TargetControlRequestRejectCount suggests that the system can no longer process requests and needs to scale. However, this means 503 errors are already being returned to the user, so you would need to implement client-side retries or other workarounds. While TargetControlWorkQueueLength dropping to zero also signals that rejections are about to start, both metrics effectively trigger at the same timing.

Step 3. Testing Actual Scaling

Next, let's set up a CloudWatch Alarm using the TargetControlRequestRejectCount value to trigger auto-scaling. First, I created the Alarm.

Then, I linked it to the Auto Scaling Group. Under these settings, I applied load using the tool mentioned earlier and confirmed that the system does indeed scale out.

Summary

Using ALB Target Optimizer, you can scale compute resources when pre-set concurrency limits are exceeded. For use cases like LLM applications, where a single request is heavy and uneven load distribution across instances is problematic, this mechanism works well by letting the Agent signal the ALB whether it can handle more traffic.

In this experiment, I used TargetControlRequestRejectCount to trigger a scale-out, but this results in "reactive" scaling—scaling only after the user has already been impacted. To ensure a smooth user experience, further refinement is needed. If anyone has any ideas, I'd love to hear them.

The fact that a dedicated Agent communicates with the ALB to allow for more granular traffic control is a mechanism previously unseen in ALB. I'm excited to see how this functionality evolves in the future.

What would you do if you were asked to expand the subnets where ECS and Aurora clusters are running?

Keisuke FURUYA — Wed, 31 Dec 2025 13:39:38 +0000

I would like to share the process and lessons learned from a project involving subnet reconfiguration. While subnet modification is a task best avoided if possible, I hope this serves as a helpful guide for those facing similar challenges.

TL;DR

Due to challenges in the VPC network architecture, I migrated the subnets for the ALB, ECS cluster, and Aurora cluster.
All migrations were completed successfully with zero or minimal downtime.

The Challenge

In a system I was responsible for, we faced the following issues regarding network configuration:

IP address exhaustion: Due to service growth, we were running out of available IP addresses in our subnets.
Batch processing impact: Specifically, when a large number of batch tasks were executed simultaneously, the lack of available IPs became a critical bottleneck.

To resolve these issues, we decided to secure future scalability by expanding the subnet ranges.

The Ideal State

The system in question has a simple architecture: an API server and batch jobs running on ECS (Fargate) behind an ALB, with an Aurora (MySQL) database. Although we were already using public/private subnets, we reorganized them into the following structure based on AWS best practices:

Public Subnets: Housing the ALB and NAT Gateways.
Private Subnets: Housing the ECS clusters.
Isolated Subnets: Housing the Aurora clusters.

We secured these across three Availability Zones (AZs) using significantly wider CIDR ranges than before.

Step 1: ALB Migration

To create subnets with wider CIDR ranges, the existing subnets needed to be deleted. This required moving the ALB to a temporary "evacuation" subnet first. I considered two methods for the ALB subnet migration:

Create a new ALB in the destination subnets and switch traffic using Route 53 weighted records.
- Pros: Gradual traffic transition; fast rollback if issues occur.
- Cons: If clients have DNS caching enabled, they might continue hitting the old ALB after the switch, leading to errors.
Modify the existing ALB's subnet settings.
- Pros/Cons: Essentially the inverse of the above.

I chose Method 2 (Changing ALB subnet settings) because I determined we could not fully control client-side DNS caching errors.

Lessons & Tips

Even when changing subnet settings, AWS performs a rolling update internally, allowing us to migrate the ALB without downtime. You can see exactly what is happening by monitoring the Elastic Network Interfaces (ENI). First, new ENIs are created for the ALB in the new subnets; once they reach a "Ready" state, the old ENIs are deleted. (I previously wrote a blog post about how understanding ENI behavior is the key to mastering VPC networking: "Mastering ENIs is mastering your VPC network.")

Note: Behavior may vary depending on your application characteristics, so always verify this in a staging environment first.

Step 2: Aurora Migration

After securing the wide isolated subnets, I migrated the Aurora cluster. Again, there are several ways to do this:

Schedule a maintenance window, create a snapshot, restore it to a new cluster in the new subnets, and update the application connection strings.
Create a Reader instance in the new subnet, scale down the cluster to one old Writer and one new Reader, and perform a failover.

Since Aurora failover typically completes in under 60 seconds, I chose Method 2 to minimize downtime.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.AuroraHighAvailability.html#Aurora.Managing.FaultTolerance

While often overlooked, a DB Subnet Group can include multiple subnets within the same AZ. For example, you could set up a configuration like this (assuming us-west-2d was not previously used):

us-west-2a: Old Subnet A
us-west-2b: Old Subnet B
us-west-2c: Old Subnet C
us-west-2d: New Subnet D

In this state, you can create a new Aurora instance and explicitly select us-west-2d. Once the new instance is ready, reduce the cluster to one old Writer and the one new Reader, then fail over.

Lessons & Tips

By performing the subnet switch via failover, we didn't need to change the Aurora cluster endpoints, allowing us to complete the change with a very short maintenance window.

Step 3: ECS Service & Task Migration

The API servers running on the ECS cluster also needed to move. Since we already use rolling updates for our deployments, we were able to transition to the new subnets without downtime simply by updating the service's network configuration.

Summary

I have shared the process and lessons learned while reorganizing subnets for future scalability. For the services discussed here—ALB, ECS, and Aurora—it is possible to switch subnets with zero or minimal downtime. I hope this article provides a helpful reference for your own infrastructure projects.

Mastering ENI is Mastering Your VPC Network

Keisuke FURUYA — Mon, 22 Dec 2025 13:22:51 +0000

Understanding the Elastic Network Interface (ENI) is essential for grasping AWS networking. This is because everything launched within a VPC (like EC2 instances) must have an ENI. While you might not think about them often, knowing they exist is incredibly helpful for troubleshooting and verifying behavior. Let's dive into some specific examples.

What is an ENI?

First, let's quickly review what an ENI is.

An elastic network interface is a logical networking component in a VPC that represents a virtual network card. You can create and configure network interfaces and attach them to instances that you launch in the same Availability Zone.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html

Essentially, it's a virtual network card. It's most commonly discussed in the context of being attached to EC2 instances. In most cases, its lifecycle matches that of the EC2 instance, so you rarely need to think about it.

Where ENIs are Useful

Consider load balancers like an Application Load Balancer (ALB). You can't see their underlying infrastructure (the instances) from the console. However, as mentioned earlier, anything launched in a VPC has an ENI. This means you can see the ENIs attached to the load balancer's underlying instances in the console.

By looking here, you can determine how many instances the load balancer is running and which Availability Zones (AZs) they are deployed in. It might not be a common task, but if you're changing the subnets for your load balancer, you can watch these ENIs to confirm the replacement process.

Let's look at another example. Some services have special VPC connection settings. Amazon Quick Sight is one of them. You need Quick Sight permissions to check its VPC connection settings, but you can see all the relevant information just by looking at the ENI.

Connectivity issues are common with these types of inter-service connections, so remembering that you can see valuable information on the ENI side is useful.

To summarize, the great thing about ENIs is that they allow you to indirectly observe resources that aren't directly visible in the console.

Application: Accessing the Internet from AWS CloudShell in a VPC

This might feel a bit like a "hack," but it's an idea I had during a recent troubleshooting session. CloudShell can be launched within a VPC. It can access the internet if it's in a private subnet that has a route to a NAT Gateway in a public subnet.

Your AWS CloudShell environment can only connect to the internet if it is in a private VPC subnet.

https://docs.aws.amazon.com/cloudshell/latest/userguide/using-cshell-in-vpc.html

This is because a public IP is not automatically assigned to its ENI.

Recently, I was troubleshooting and launched CloudShell in a public subnet. I needed to access the internet from this CloudShell session, and that's when I remembered the ENI. I thought, "If it doesn't have a public IP, why not just attach one?" I allocated an Elastic IP (EIP) and attached it to the CloudShell's ENI. It worked perfectly, and I was able to connect to the internet.

Summary

By looking at ENIs, you can start to see things that were previously invisible. Knowing that "everything launched in a VPC has an ENI" makes you stronger at network troubleshooting.

Remember: Mastering ENI is mastering your VPC network.

Considerations for Setting ReadonlyRootFilesystem to true in ECS Task Definitions

Keisuke FURUYA — Sat, 23 Aug 2025 02:20:55 +0000

When enhancing container security on ECS, you might encounter the following finding from AWS Security Hub's Cloud Security Posture Management (CSPM):

[ECS.5] ECS containers should be restricted to read-only access to their root file system.

This control checks whether an Amazon ECS container has read-only access to its root file system. The control fails if the readonlyRootFilesystem parameter is set to false, or the parameter doesn't exist in the container definition within the task definition. This control evaluates only the latest active revision of an Amazon ECS task definition.

https://docs.aws.amazon.com/securityhub/latest/userguide/ecs-controls.html#ecs-5
While it's a straightforward process to improve security by setting the readonlyRootFilesystem parameter to true in your task definition, enforcing a universal write prohibition can cause certain features to malfunction. Below are a couple of examples.

ECS Exec

ECS Exec allows you to connect to a running container and leverages the AWS Systems Manager (SSM) framework. This feature requires write permissions to specific volumes to function correctly. Specifically, you need to allow write access to the /var/lib/amazon and /var/log/amazon directories. Here is an example of how to configure this in your task definition.

{
  ...
  "containerDefinitions": [
    {
      ...
      "mountPoints": [
        {
          "sourceVolume": "var-lib-amazon",
          "containerPath": "/var/lib/amazon",
          "readOnly": false
        },
        {
          "sourceVolume": "var-log-amazon",
          "containerPath": "/var/log/amazon",
          "readOnly": false
        }
      ],
      ...
    },
    ...
  ],
  ...
  "volumes": [
    {
      "name": "var-lib-amazon",
      "host": {}
    },
    {
      "name": "var-log-amazon",
      "host": {}
    }
  ],
  ...
}

Datadog Agent

Similarly, the Datadog Agent will not start if it doesn't have write permissions to certain directories. Additionally, to properly collect Fargate-specific metrics like ecs.fargate.cpu.usage, you must also set specific dockerLabels as confirmed by Datadog support.

{
  ...
  "containerDefinitions": [
    {
      ...
      "mountPoints": [
        {
          "sourceVolume": "dd-agent-etc-datadog-agent",
          "containerPath": "/etc/datadog-agent",
          "readOnly": false
        },
        {
          "sourceVolume": "dd-agent-opt-datadog-agent-run",
          "containerPath": "/opt/datadog-agent/run",
          "readOnly": false
        },
        {
          "sourceVolume": "dd-agent-var-run-datadog",
          "containerPath": "/var/run/datadog",
          "readOnly": false
        }
      ],
      ...
      "dockerLabels": {
        "com.datadoghq.ad.init_configs": "[{}]",
        "com.datadoghq.ad.instances": "[{}]",
        "com.datadoghq.ad.check_names": "[\"ecs_fargate\"]"
      },
      ...
    },
    ...
  ],
  ...
  "volumes": [
    {
      "name": "dd-agent-etc-datadog-agent",
      "host": {}
    },
    {
      "name": "dd-agent-opt-datadog-agent-run",
      "host": {}
    },
    {
      "name": "dd-agent-var-run-datadog",
      "host": {}
    }
  ],
  ...
}

Summary

Enabling ReadonlyRootFilesystem is a vital step for enhancing container security. However, it's crucial to be aware that some applications and features require specific directories to remain writable to function correctly.

Transforming AWS Liabilities into Assets on a Container Foundation

Keisuke FURUYA — Wed, 29 Nov 2023 03:54:42 +0000

Periodically running small scripts, serving auxiliary roles, EC2, Lambda, and the likes - I believe many of you have numerous instances of these in your environments. They are, of course, excellent services that support our daily operations. However, having proper maintenance flows and mechanisms for them would be ideal. Unfortunately, in our case, that was lacking, leading to them becoming liabilities.

To mitigate these liabilities, we've been continuously working on improving operability and maintenance by containerizing these services and migrating them onto the execution foundation of our applications.

Do you have these?

Do any of you have these 'legendary' beings in your AWS accounts?

The legendary EC2 triggering SQL queries to RDS via cron for metrics
The legendary Lambda periodically merging OpenSearch

They're the ones handling detailed administrative tasks alongside the main infrastructure supporting services. Nowadays, using Lambda for this type of functionality is more common. The benefit of using Lambda is its serverless nature once you've written the code.

Challenges with EC2 or Lambda for Administrative Tasks

We had numerous EC2s and Lambdas serving such purposes in our company. However, when it came time to update them, it was often a manual deployment despite having the source code. Whether deploying manually or using 'sls' from a local PC, it required an extra step. Especially with Lambda, since runtime updates occurred periodically, manual function updates were sometimes necessary if the code needed modification. As these updates weren't frequent, it became a common cycle of neglect without improvement.

Will Containerization Solve This?

One might suggest, "Let's containerize them!" But containerization alone won't solve the issue. When it comes to aspects like development, modifications, and testing depending on runtimes like Lambda, having a container with good portability does ease the process. Containerization isn’t just crucial; it has its benefits. Running these alongside the main application on the same infrastructure and unifying deployment methods make maintenance easier even in the event of future changes. Hence, containerization isn't just blindly doing so; it presupposes that the main application runs on containers and the foundation is established.

Example: Replacing Lambda Python Script with EKS Cronjob (Shell)

Let me share an actual migration story from our company. We had one such being:

A legendary Lambda periodically merging OpenSearch

Running on Python 3.6, it required an upgrade due to runtime support reaching its end. While upgrading to 3.7 was an option, the CD part wasn't well-structured, making it challenging to verify the functionality. Therefore, we decided to migrate it to EKS, where our main application runs. The process itself was simple, and since it didn't necessarily need to be in Python, we rewrote it in shell and scheduled it as a Kubernetes Cronjob. We promote GitOps for applications using helmfile and Argo CD within our company. By integrating this OpenSearch management script into GitOps, deploying for verification became simple. This not only simplified the verification process but also made it clearer where the scripts were located, where to make changes, and how to deploy them.

The specific structure using helmfile was as follows:

.
├── environments
│   └── values.yaml
├── helmfile.yaml
└── manifests
    ├── merge.sh
    └── manifest.yaml.gotmpl

We store Kubernetes manifests in a 'manifests' directory, utilizing gotmpl format for extensibility, containing both Kubernetes manifests and the actual shell script for execution. Next up is the 'helmfile.yaml'.

environments:
  {{ .Environment.Name }}:
    values:
    - environments/values.yaml

---

releases:
  - name: "merge"
    namespace: "{{ .Namespace | default .Values.merge.namespace }}"
    chart: "./manifests"
    installedTemplate: "{{ .Values.merge.installed }}"
    labels:
      namespace: "{{ .Values.merge.namespace }}"
      default: "false"

Helmfile treats the manifest files under the 'manifests' directory as a Chart. It generates manifests based on the following gotmpl, considering them as Charts, and pods mount the shell script via ConfigMap to execute periodically using CronJobs.

apiVersion: v1
kind: ConfigMap
metadata:
  name: merge
data:
  merge.sh: |
    {{- readFile "merge.sh" | nindent 6 }}
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: merge
spec:
  schedule: "5 01 * * *"
  timeZone: "Asia/Tokyo"
  concurrencyPolicy: "Forbid"
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          restartPolicy: Never
          serviceAccountName: {{ .Values.merge.serviceAccountName }}
          containers:
          - name: merge
            image: chatwork/aws:2.8.10
            env:
              - name: ES_DOMAIN_ENDPOINT
                value: {{ .Values.merge.esDomainEndpoint }}
            command:
              - "/bin/bash"
              - "/tmp/merge.sh"
            volumeMounts:
            - name: merge
              mountPath: /tmp
          volumes:
          - name: merge
            configMap:
              name: merge
              items:
                - key: merge.sh
                  path: merge.sh

Benefits of Organizing the Deployment Flow

As mentioned earlier,

Running these alongside the main application on the same infrastructure and unifying deployment methods make maintenance easier even in the event of future changes.

We reaped the benefits of this. Particularly, the helmfile manifests mechanism proved powerful. For simple scripts, consolidating them here and enabling GitOps made it seem like this could handle everything (albeit a radical thought). Moreover, previously, such administrative code resided in repositories accessible only to the infrastructure and SRE teams, closing off visibility. However, by shifting the deployment flow towards the application side, developers gained visibility, enabling everyone to contribute to maintenance.

Conclusion

While we tailored the setup to operate on Kubernetes due to using EKS, similar configurations could be implemented even if you use ECS. From an EKS perspective, Argo CD's GitOps deployment method is exceptionally robust, and I genuinely believe there's no reason not to leverage this setup. Nonetheless, what matters most is leveraging the advantages of containerization while utilizing the infrastructure and deployment mechanisms you're accustomed to. If this helps shed light on those resources silently operating as liabilities in your AWS accounts, I'd consider it fortunate.

Creating Network Load Balancer (SG supported) with AWS Load Balancer Controller

Keisuke FURUYA — Fri, 22 Sep 2023 08:05:47 +0000

AWS Network Load Balancer (NLB) has finally added support for Security Groups.
https://aws.amazon.com/about-aws/whats-new/2023/08/network-load-balancer-supports-security-groups/

This significant update means that AWS Load Balancer Controller now automatically attaches Security Groups to NLBs by default.

What Has Changed

Detailed behavior changes can be found in the release notes for version 2.6.0 of the AWS Load Balancer Controller.
https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases
The key points include:

Similar to Application Load Balancers (ALB) in Ingress, if you do not explicitly specify the attached Security Group, two Security Groups are automatically created and attached:
- One for receiving external traffic (frontend SG)
- One for communicating with the backend Node Groups (backend SG), which is automatically added to the Node Group's Security Group ingress rules.
You can explicitly specify the frontend SG using the service.beta.kubernetes.io/aws-load-balancer-security-groups annotation.
Existing NLBs without attached Security Groups can still be managed using AWS Load Balancer Controller.

Creating a New NLB with AWS Load Balancer Controller

I confirmed the behavior of Security Groups created using annotations. First, prepare an EKS cluster with AWS Load Balancer Controller installed:



eksctl create cluster --name new-nlb

Installation instructions for AWS Load Balancer Controller can be found here.
https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.6/deploy/installation/

Next, prepare the application (nginx) to connect to:



apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.25.2
        ports:
        - containerPort: 80

To create an NLB, create a Service resource. Initially, deploy it with the default settings:



apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: '2'
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: '10'
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: '2'
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
    service.beta.kubernetes.io/aws-load-balancer-type: external
  labels:
    app.kubernetes.io/instance: postfix-internal
    app.kubernetes.io/name: nginx
    app.kubernetes.io/version: 1.25.2
  name: nginx
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
  loadBalancerSourceRanges:
    - [your ip]

This will create an NLB with automatically generated Security Groups.

Additionally, the backend SG rules (TCP:80) allowing communication from the NLB's backend SG to the Node Group's Security Group are automatically added.

Next, I specify an existing Security Group using the service.beta.kubernetes.io/aws-load-balancer-security-groups annotation:



apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: '2'
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: '10'
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: '2'
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-security-groups: [sg id]
  labels:
    app.kubernetes.io/instance: postfix-internal
    app.kubernetes.io/name: nginx
    app.kubernetes.io/version: 1.25.2
  name: nginx
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

This will attach the specified Security Group (frontend SG) to the NLB. However, in this case, the backend SG is not created automatically. You will need to manually update the Node Group's Security Group to allow communication from the frontend SG.

To automate the creation of the backend SG, attachment to the NLB, and modification of the Node Group's Security Group, set the service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules annotation to 'true'. This annotation defaults to 'false', meaning it won't manage the backend SG when you explicitly specify the frontend SG. Let's deploy the NLB again after deleting it:



apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: '2'
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: '10'
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: '2'
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-security-groups: [sg id]
    service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: 'true'
  labels:
    app.kubernetes.io/instance: postfix-internal
    app.kubernetes.io/name: nginx
    app.kubernetes.io/version: 1.25.2
  name: nginx
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

Now, the specified frontend SG is attached, the backend SG is automatically generated, and the Node Group's Security Group is updated.

Conclusion

The addition of Security Group support for NLB has brought about changes in how NLBs created with AWS Load Balancer Controller behave. While existing setups remain unaffected, for future NLB creations, it's essential to be mindful of the specifications outlined above.

Managing AWS Resources with ACK and helmfile

Keisuke FURUYA — Tue, 27 Jun 2023 08:09:39 +0000

AWS Controllers for Kubernetes (ACK) is a Kubernetes Custom Controller that allows you to manage AWS resources through Kubernetes manifests. By using ACK, you can easily manage AWS resources in the same lifecycle as your applications. It is suitable for scenarios where you need to create AWS resources alongside your applications for testing purposes and delete them together with the application when no longer needed. This article explains how to manage AWS resources using Kubernetes manifests. While there are various ways to manage Kubernetes manifest files, this article focuses on using helmfile for management.

Prerequisites

Please install the following on your local machine:

eksctl
kubectl
helm
helmfile

Set up the following in your AWS account. This IAM policy has the necessary permissions for the IAM Controller of ACK to create IAM resources.

IAM policy
- https://github.com/aws-controllers-k8s/iam-controller/blob/main/config/iam/recommended-inline-policy
- name: ack-iam-controller-policy

First, let's create an EKS cluster. In this example, we will deploy it with minimal configuration using eksctl.

eksctl create cluster --name ack-helmfile

Next, create an OIDC provider, IAM Role and service account.

eksctl utils associate-iam-oidc-provider --cluster=ack-helmfile --approve
eksctl create iamserviceaccount --cluster=ack-helmfile --name=ack-dynamodb-controller --namespace=ack-system --attach-policy-arn=arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess --approve
eksctl create iamserviceaccount --cluster=ack-helmfile --name=ack-iam-controller --namespace=ack-system --attach-policy-arn=arn:aws:iam::[your AWS account ID]:policy/ack-iam-controller-policy --approve

Install ACK on the created EKS cluster. This time, we will install the IAM and DynamoDB controllers. Below is a sample configuration for the ACK controllers using helmfile.

helmfile.yaml

repositories:
  - name: aws-controllers-k8s
    url: public.ecr.aws/aws-controllers-k8s
    oci: true

environments:
  {{ .Environment.Name }}:
    values:
    - environments/{{ .Environment.Name }}/values.yaml

---

releases:
  - name: "ack-dynamodb-controller"
    namespace: "{{ .Namespace | default .Values.awsControllersK8s.namespace }}"
    createNamespace: true
    chart: "aws-controllers-k8s/dynamodb-chart"
    version: "{{ .Values.awsControllersK8s.dynamodb.version }}"
    installedTemplate: "{{ .Values.awsControllersK8s.dynamodb.installed }}"
    values:
    - values/dynamodb.yaml.gotmpl

  - name: "ack-iam-controller"
    namespace: "{{ .Namespace | default .Values.awsControllersK8s.namespace }}"
    createNamespace: true
    chart: "aws-controllers-k8s/iam-chart"
    version: "{{ .Values.awsControllersK8s.iam.version }}"
    installedTemplate: "{{ .Values.awsControllersK8s.iam.installed }}"
    values:
    - values/iam.yaml.gotmpl

environments/test/values.yaml

awsControllersK8s:
  namespace: ack-system
  region: ap-northeast-1

  dynamodb:
    installed: true
    version: 1.1.1

  iam:
    installed: true
    version: 1.2.1

values/dynamodb.yaml.gotmpl

aws:
  region: {{ .Values.awsControllersK8s.region }}

serviceAccount:
  create: false
  name: ack-dynamodb-controller

values/iam.yaml.gotmpl

aws:
  region: {{ .Values.awsControllersK8s.region }}

serviceAccount:
  create: false
  name: ack-iam-controller

Applying helmfile will create the IAM and DynamoDB controllers.

helmfile -e test apply .

kubectl get deployment -n ack-system
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
ack-dynamodb-controller-dynamodb-chart   1/1     1            1           79s
ack-iam-controller-iam-chart             1/1     1            1           79s

Deploying Required AWS Resources in the Application

Now that the DynamoDB controller is ready, let's use it to create a DynamoDB resource. Here, we will deploy a simple table called "ack-dynamodb-table".

helmfile.yaml

environments:
  {{ .Environment.Name }}:
    values:
    - environments/{{ .Environment.Name }}/values.yaml

---

releases:
  - name: "ack-dynamodb-table"
    namespace: default
    installedTemplate: "{{ .Values.dynamodb.installed }}"
    chart: "./manifests"

environments/test/values.yaml

dynamodb:
  installed: true

manifests/table.yaml

apiVersion: dynamodb.services.k8s.aws/v1alpha1
kind: Table
metadata:
  name: ack-dynamodb-table
spec:
  tableName: ack-dynamodb-table
  attributeDefinitions:
    - attributeName: id
      attributeType: "N"
  keySchema:
    - attributeName: id
      keyType: "HASH"
  provisionedThroughput:
    writeCapacityUnits: 1
    readCapacityUnits: 1

helmfile -e test apply .

Applying it will create the DynamoDB table defined in the manifest.

Deploying Role and Service Account for IRSA

To utilize DynamoDB from the application, we need to use IRSA (IAM Roles for Service Accounts). IRSA enables the association of Kubernetes service accounts with IAM roles. Specifically, it involves setting up the IAM role, appropriate trust relationships, and annotating the service account to make it usable. While eksctl can handle this through commands(as we did earlier), we will create them using ACK this time.

helmfile.yaml

environments:
  {{ .Environment.Name }}:
    values:
    - environments/{{ .Environment.Name }}/values.yaml

---

releases:
  - name: "ack-iam-role"
    namespace: default
    installedTemplate: "{{ .Values.iam.installed }}"
    chart: "./manifests"

We will obtain the OIDC URL required for the trust relationship of the role used in service account from the EKS management console and add it to the values.yaml file.

environments/test/values.yaml

iam:
  installed: true
  account_id: [your AWS account ID]
  oidc_url: [your EKS oidc url]

We will define the policy, role, and service account in the manifest that have permissions to access DynamoDB.

manifests/role.yaml

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Policy
metadata:
  name: ack-dynamodb-policy
spec:
  policyDocument: |
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Action": [
                    "dynamodb:UpdateTable",
                    "dynamodb:UpdateItem",
                    "dynamodb:Scan",
                    "dynamodb:Query",
                    "dynamodb:PutItem",
                    "dynamodb:ListTables",
                    "dynamodb:GetItem",
                    "dynamodb:DescribeTable",
                    "dynamodb:DeleteItem",
                    "dynamodb:BatchWriteItem",
                    "dynamodb:BatchGetItem"
                ],
                "Resource": [
                    "arn:aws:dynamodb:ap-northeast-1:{{ .Values.iam.account_id }}:table/ack-dynamodb-table"
                ]
            }
        ]
    }
  name: ack-dynamodb-policy
---
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  name: ack-app-role
spec:
  assumeRolePolicyDocument: |
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Federated": "arn:aws:iam::{{ .Values.iam.account_id }}:oidc-provider/{{ .Values.iam.oidc_url }}"
                },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                    "StringEquals": {
                        "{{ .Values.iam.oidc_url }}:aud": "sts.amazonaws.com",
                        "{{ .Values.iam.oidc_url }}:sub": "system:serviceaccount:default:ack-app-serviceaccount"
                    }
                }
            }
        ]
    }
  name: ack-app-role
  policies:
  - "arn:aws:iam::{{ .Values.iam.account_id }}:policy/ack-dynamodb-policy"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::{{ .Values.iam.account_id }}:role/ack-app-role
    eks.amazonaws.com/sts-regional-endpoints: "true"
    eks.amazonaws.com/token-expiration: "86400"
    argocd.argoproj.io/sync-wave: "1"
  name: ack-app-serviceaccount
  namespace: default

helmfile -e test apply .

Applying it will create the IAM policy, IAM role with the attached policy, and Kubernetes service account defined in the manifest.

kubectl get serviceaccount

NAME                     SECRETS   AGE
ack-app-serviceaccount   0         7m57s
default                  0         49m

Testing Access to DynamoDB

With the created service account, access to DynamoDB is now permitted from the application. Let's verify this by using the AWS CLI to access DynamoDB. First, let's try executing the PutItem operation on DynamoDB without specifying the service account.

kubectl run awscli --image=amazon/aws-cli -- dynamodb put-item --table-name ack-dynamodb-table --item '{ "id": { "N": "1" }, "value": { "S": "test" }}'

This will result in an error:

kubectl logs awscli

An error occurred (AccessDeniedException) when calling the PutItem operation: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/eksctl-ack-helmfile-nodegroup-ng-NodeInstanceRole-EVFGB3QY0C29/i-097e6311fa2cc1da1 is not authorized to perform: dynamodb:PutItem on resource: arn:aws:dynamodb:ap-northeast-1:xxxxxxxxxxxx:table/ack-dynamodb-table because no identity-based policy allows the dynamodb:PutItem action

Let's now execute PutItem on DynamoDB using the service account.

kubectl run awscli --image=amazon/aws-cli --overrides='{ "spec": { "serviceAccountName": "ack-app-serviceaccount" } }' -- dynamodb put-item --table-name ack-dynamodb-table --item '{ "id": { "N": "1" }, "value": { "S": "test" }}'

The PutItem operation is now successful.

Summary

In this article, we used AWS Controllers for Kubernetes (ACK) to manage AWS resources using Kubernetes manifests. We discussed the configuration to make the application able to utilize AWS resources and explored how to manage Kubernetes manifest files using helmfile.

Using ACK and the described workflow, it becomes easier to manage AWS resources in conjunction with the application's lifecycle, allowing for convenient creation and deletion of AWS resources alongside the application.