DEV Community: Matt Lewis

Implement multi-tenancy on Amazon EKS Auto Mode clusters

Matt Lewis — Fri, 10 Jul 2026 09:15:57 +0000

The latest CNCF Survey in 2025 highlighted the increasing adoption of containers within the enterprise, with 92% of organisations now using containers in production. Of these, 82% of them run Kubernetes, which has rapidly become the industry standard. Overall, 93% of organisations are now using, piloting, or evaluating Kubernetes, with 79% of these running managed services from the hyperscalers.

In an enterprise with many engineering teams looking to containerise their applications, a common challenge is whether to run these on a single Kubernetes cluster. This blog post focuses on the options that exist for implementing multi-tenancy on an Amazon EKS Auto Mode cluster.

Hard versus Soft Isolation Boundaries

In this blog post, we want to look at the options for two workload teams to deploy their containerised application onto Amazon EKS. Amazon EKS is a fully upstream and certified conformant and compliant version of Kubernetes. Kubernetes provides a single shared control plane and supports soft multi-tenancy through a number of isolation mechanisms. This means that a single instance of the control plane is shared among all the tenants within a cluster.

The diagram below shows the main components that are running when two pods are deployed on the same worker node on Amazon EKS.

This highlights that containers on the same worker node share much of the underlying operating system, including the Linux kernel, node networking and storage. However, even if these pods were deployed on different nodes, they will still be sharing the same control plane and therefore components like etcd which provides the backing datastore for Kubernetes state across the cluster.

The only way to guarantee hard isolation between the applications of the two workload teams is to deploy them onto separate Amazon EKS clusters in separate AWS accounts. By default, resources in one AWS account cannot access resources in another AWS account, limiting the blast radius of a misconfiguration or malicious action.

However, there is a significant operational and cost overhead of having each workload team manage its own Amazon EKS cluster. This has led to many organisations establishing a platform team, and allowing applications from different workload teams to be hosted on this central Amazon EKS platform. The rest of this blog post looks at the controls that can be put in place to achieve logical soft isolation between these applications. These controls are part of a defence in depth strategy.

This post breaks down this defence in depth strategy into the following layers:

Layer 1 — Account boundary and Service Control Policies
Layer 2 — Namespace, Quotas, and Pod Security isolation
Layer 3 — Pod Identity, role chaining, and ABAC
Layer 4 — Network isolation
Layer 5 — Policy as code with Kyverno
Layer 6 — GitOps deployment isolation
Layer 7 — Observability and audit isolation

Layer 1 — Account boundary and Service Control Policies

AWS recommend using a multi-account strategy with AWS Organizations to help isolate and manage business applications and data.

Each of the workload teams have their own set of AWS resources required as part of their overall application e.g. Amazon S3 buckets, RDS databases and DynamoDB tables. These resources are hosted within the AWS account of the individual workload team. This limits the shared resources to the compute capability of Amazon EKS.

This is reinforced by the use of Service Control Policies (SCP). SCPs offer central control over the maximum available permissions for users and roles in an AWS organization. We create and apply the following two SCPs as examples:

Prevent any non-EKS principal from setting the Pod Identity session tags (kubernetes-service-account, kubernetes-namespace, eks-cluster-arn) that our access model depends on
Deny EKS cluster creation in any workload account, ensuring that clusters are centralised

Layer 2 — Namespace, Quotas, and Pod Security isolation

The standard practice to support soft multi-tenancy on EKS is to align with Kubernetes namespaces as a mechanism for isolating groups of resources. Namespaces allow you to divide the cluster into logical partitions. Quotas, network policies, service accounts and several other objects are all scoped to a namespace. In our example, workload team 1 and workload team 2 are assigned their own dedicated namespace.

Pod Security Admission (PSA) is a Kubernetes built-in admission controller that enforces Pod Security Standards. It has three levels:

privileged - largely unrestricted and intended for trusted system workloads.
baseline - blocks known privilege escalations
restricted - the most hardened built-in profile

PSA is enabled per namespace via labels. We apply this when creating the namespace as follows:

apiVersion: v1
kind: Namespace
metadata:
  name: team-1
  labels:
    team: team-1
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

PSA governs what a pod is allowed to do when it is admitted to the cluster. It does not control which workloads can communicate with one another or what AWS resources they can access. In addition to PSA, we also extend the use of compliance-as-code with Kyverno, which we discuss later on in the blog.

We also define ResourceQuota and LimitRange to prevent the noisy neighbour problem where one team could scale their deployment and starve other workloads of nodes and memory.

ResourceQuota provides a hard ceiling on how much of the cluster's shared resources this namespace is allowed to consume. The request values (requests.cpu and requests.memory) limit the aggregate CPU and memory requests that all pods in the namespace may declare. The limits values (limits.cpu and limits.memory) define the maximum value for the sum of all limits in the namespace. In our example we allow workloads to burst above their guaranteed resources while still reserving only 8 vCPUs of scheduler capacity. We also show how we can limit the maximum number of pods that can exist and the maximum number of Kubernetes Services that can exist at any one time.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-1-quota
  namespace: team-1
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "50"
    services: "10"

LimitRange defines the minimum, maximum and default request/limit values per container, preventing workloads from requesting unreasonably small or large amounts of CPU and memory. The default values apply if a pod has not specified any limits.

apiVersion: v1
kind: LimitRange
metadata:
  name: team-1-limits
  namespace: team-1
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 50m
        memory: 64Mi

In addition, we also create a Kubernetes ServiceAccount for each workload within the namespace. A service account is scoped to a namespace and not a cluster. This means in our example below, the "team-1-sa" service account only exists in the "team-1" namespace, and cannot be used by a pod in a different namespace.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: team-1-sa
  namespace: team-1
  labels:
    team: team-1

Every namespace automatically gets a default service account created for it. If you don't specify one, pods will use default. We enforce the use of named service accounts with Kyverno which we cover later in the blog.

Layer 3 - Pod Identity, role chaining and ABAC

Each application developed by the workload team and running on the central EKS cluster needs to access AWS resources that are running in the workload team's AWS account. Historically, this was achieved using IAM Roles for Service Accounts (IRSA), which allowed you to deliver temporary AWS credentials to workloads running on EKS. This also requires enabling cross-account access and setting up an IAM OIDC provider.

At re:Invent 2023, AWS launched EKS Pod Identities as a simpler way of delivering temporary AWS credentials to your pods running on EKS. EKS Pod Identities integrate with the EKS control plane and on-cluster agent so that pods receive credentials without requiring you to create or manage an IAM OIDC identity provider. EKS Pod Identities are the AWS recommended approach for new workloads on supported node types, and is the approach adopted in this blog post.

With EKS Pod Identity, you associate a Kubernetes service account in your cluster with an IAM role in the same AWS account as the cluster. EKS uses this association to obtain temporary credentials on behalf of the pod for that IAM role and securely deliver them to pods to use the service account.

EKS Pod Identities natively support cross account access by using a target IAM role in the workload account and IAM role chaining. When you create a Pod Identity association for a Kubernetes service account, you specify both a pod IAM role in the cluster account and a target IAM role in the workload account. EKS Pod Identity uses the pod role to assume the target role and returns temporary credentials for the target role to the pod.

We set this up in terraform in the platform account as follows:

Firstly we create an IAM role called "pod_role_team_1". This has a trust policy that means only the EKS Pod Identity service is allowed to assume the role and attach session tags.

resource "aws_iam_role" "pod_role_team_1" {
  name = "pod-role-team-1"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "pods.eks.amazonaws.com" }
      Action    = ["sts:AssumeRole", "sts:TagSession"]
    }]
  })
  tags = { Team = "team-1" }
}

We then create a policy that has only the permissions to assume the "Team1PodTargetRole" in the workload account and attach session tags. This permission is intentionally minimal to restrict the blast radius of it being compromised.

resource "aws_iam_policy" "pod_role_team_1_assume" {
  name = "pod-role-team-1-assume-target"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["sts:AssumeRole", "sts:TagSession"]
      Resource = "arn:${local.partition}:iam::${var.workload_team_1_account_id}:role/Team1PodTargetRole"
    }]
  })
}

The next step is to attach the policy to the "pod_role_team_1" role.

resource "aws_iam_role_policy_attachment" "pod_role_team_1_assume" {
  role       = aws_iam_role.pod_role_team_1.name
  policy_arn = aws_iam_policy.pod_role_team_1_assume.arn
}

Finally, we create a Pod Identity association. This maps the "team-1-sa" service account in the "team-1" namespace to the "pod_role_team_1" role, and crucially also specifies the target role in the workload account. Because we set target_role_arn, EKS Pod Identity performs the role chaining itself — it assumes the pod role, then assumes the target role, and delivers credentials for the target role directly to the pod.

resource "aws_eks_pod_identity_association" "team_1" {
  cluster_name    = module.eks.cluster_name
  namespace       = "team-1"
  service_account = "team-1-sa"
  role_arn        = aws_iam_role.pod_role_team_1.arn
  # Native cross-account role chaining — EKS assumes this target role for the pod
  target_role_arn = "arn:${local.partition}:iam::${var.workload_team_1_account_id}:role/Team1PodTargetRole"
}

This works by creating an association stored by the EKS control plane. On standard EKS clusters, Pod Identity requires the EKS Pod Identity Agent add-on. With EKS Auto Mode, this capability is built into the managed nodes, so there is no add-on or DaemonSet to install or manage. When a pod makes an AWS SDK request and requires credentials, the agent reads the pod's service account and namespace and queries EKS for a matching association. Finding one, EKS performs two assume-role calls in sequence on the pod's behalf:

It assumes pod-role-team-1 (the pod role in the cluster account).
Using those credentials, it assumes Team1PodTargetRole in the workload account, attaching the Kubernetes session tags as it does so.

The pod never makes either call — it simply receives the final target-role credentials. The second assume-role looks effectively like:

sts:AssumeRole(
  RoleArn = "arn:aws:iam::<workload-account>:role/Team1PodTargetRole",
  RoleSessionName = "eks-...",
  Tags = [                       # set by EKS via sts:TagSession on this call
    { Key: "kubernetes-namespace",       Value: "team-1" },
    { Key: "kubernetes-service-account", Value: "team-1-sa" },
    { Key: "eks-cluster-name",           Value: "shared-platform-cluster" },
    { Key: "eks-cluster-arn",            Value: "arn:aws:eks:..." }
  ]
)

As EKS sets these tags on the call the assumes the target role, the workload account validates them as aws:requestTag/.. (the tags on the incoming request), and locks the caller down to the pod role with aws:PrincipalArn.

In our workload account, we create the target role that EKS Pod Identity chains into. Its trust policy allows only the pod-role-team-1 role (via aws:PrincipalArn) to assume it — no other AWS principal can. The condition then validates the session tags that EKS attaches when it assumes the role: the pod must be running under the specific service account, in the specified namespace, on the exact EKS cluster. Because EKS sets these tags on the assume-role request itself, we match them with aws:RequestTag/... (not aws:PrincipalTag). This is what makes the authorization attribute-based (ABAC) rather than identity-based.

resource "aws_iam_role" "pod_target_role" {
  name = "Team1PodTargetRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        AWS = "arn:${local.partition}:iam::${var.platform_account_id}:root"
      }
      Action = [
        "sts:AssumeRole",
        "sts:TagSession"
      ]
      Condition = {
        # ABAC: validate the session tags EKS sets when assuming this role
        StringEquals = {
          "aws:RequestTag/kubernetes-service-account" = "${var.team_name}-sa"
          "aws:RequestTag/kubernetes-namespace"       = var.team_name
          "aws:RequestTag/eks-cluster-arn"            = var.cluster_arn
        }
        # Lock the caller down to exactly the pod role
        ArnEquals = {
          "aws:PrincipalArn" = var.pod_role_arn
        }
      }
    }]
  })
}

In our workload account, we can then assign a policy to this role to give it the permissions required to interact with any required AWS resources. In this case, it is just to a DynamoDB table.

The following sequence diagram highlights how this all hangs together.

With the native target-role flow, the application code needs no STS calls at all — boto3 picks up the target-role credentials from Pod Identity automatically, and the pod talks to DynamoDB directly.

Layer 4 — Network isolation

By default, all pods in a Kubernetes cluster can communicate freely — there is no restriction on pod-to-pod traffic until you add a NetworkPolicy. As soon as a policy selects a pod, the default flips: anything not explicitly allowed is denied. NetworkPolicies operate at L3/L4 (IP, port, and pod/namespace selectors)
— they cannot match an AWS resource identity such as a security group or an ALB.

One thing we discovered was that with EKS Auto Mode clusters, the Network Policy Controller is off by default. This means NetworkPolicy objects are accepted by the Kubernetes API but not enforced by the data plane, allowing cross-namespace traffic to flow freely and isolation silently fails. We enable the Network Policy Controller by applying a ConfigMap. Enforcement is then handled on each mode by an eBPF agent.

apiVersion: v1
kind: ConfigMap
metadata:
  name: amazon-vpc-cni
  namespace: kube-system
data:
  enable-network-policy-controller: "true"

Ingress — who can reach team-1 pods

We apply a Network Policy that allows pods within team-1 to talk to each other. Everything else — including other namespaces — is denied. In our example, we have a Deployment that tells Kubernetes to run 2 replicas of the team 1 app container. We expose the Pods through a Service, giving the workload a stable ClusterIP and DNS name. This is carried out as we test the applications locally using port forwarding.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-namespace-ingress
  namespace: team-1
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}

If we wanted to expose the workload externally via an Application Load Balancer, we would need to allow ingress from the ALB. To enforce that only the ALB could reach these pods in this namespace, we would use Security Groups for Pods.

This is carried out by putting the pods behind their own EC2 security group, so the network interface becomes governed by security group rules just like an EC2 instance. You can achieve this with EKS Auto Mode by setting the podSecurityGroupSelectorTerms on the NodeClass, and then EKS Auto Mode will attach the selected group to the pods branch network interface. This looks as follows:

apiVersion: eks.amazonaws.com/v1
kind: NodeClass
metadata:
  name: team-workloads
spec:
  # ... subnet / role config ...
  podSecurityGroupSelectorTerms:
    - matchLabels:
        Name: team-1-pod-sg     # selects the SG that allows ingress only from the ALB SG

The selected security group would allow inbound 8080 only from the ALB's security group. This provides the identity-based control that you cannot express with just NetworkPolicy. This means you can separate the division of responsibility as:

NetworkPolicy — pod-to-pod and namespace isolation inside the cluster (L3/L4).
Security groups for pods (via NodeClass on Auto Mode) — AWS-resource-level control (ALB→pod, pod→RDS), using security group identity rather than IP ranges.

Egress — what can team-1 pods reach out too

Locking down egress for the team-1 pods turned out to be an interesting experience. Originally, a default-deny egress policy still allowed team-1 pods to access DynamoDB in the workload account. However, this was before the Network Policy Controller had been enabled. This broke the application in a series of timeout errors, which we were slowly able to work through. The fix was to treat egress as an explicit allow-list of the platform plumbing the pod depends on, and then layer the applications own destinations on top.

The Network Policy for egress (ignoring DynamoDB which we deal with separately) is as follows:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-and-aws-egress
  namespace: team-1
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    # DNS — the cluster DNS service IP (CoreDNS is node-local; see note)
    - to:
        - ipBlock:
            cidr: 172.20.0.10/32
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    # EKS Pod Identity Agent — node-local, link-local. REQUIRED for any AWS
    # access: the SDK fetches credentials from http://169.254.170.23/v1/credentials
    # (ports 80 and 2703). Without this the whole credential chain fails (see note).
    - to:
        - ipBlock:
            cidr: 169.254.170.23/32
      ports:
        - protocol: TCP
          port: 80
        - protocol: TCP
          port: 2703
    # Same-namespace pod-to-pod, incl. the team's own ClusterIP service. The
    # service CIDR is needed because the eBPF egress hook evaluates the pre-DNAT
    # service IP; cross-namespace is still blocked at the destination's ingress.
    - to:
        - podSelector: {}
        - ipBlock:
            cidr: 172.20.0.0/16       # cluster service CIDR
    # AWS Interface Endpoints (STS, ECR, CloudWatch Logs) live in the private
    # subnets. With private DNS enabled, sts.<region>.amazonaws.com resolves to a
    # private VPC IP, so this stays internal — no internet egress.
    - to:
        - ipBlock:
            cidr: 10.0.0.0/16         # platform VPC — where endpoint ENIs live
      ports:
        - protocol: TCP
          port: 443

Before EKS Auto Mode, CoreDNS would run as a regular Kubernetes pod in the kube-system namespace and labelled as k8s-app: kube-dns. This meant you could use a pod selector such as k8s-app: kube-dns as a rule to allow workloads to perform DNS lookups while still having a default-deny egress rule. With EKS Auto Mode, the Auto Mode nodes use CoreDNS running as a system service directly on each node. Each pod is configured through /etc/resolv.conf to send DNS queries to the cluster DNS service IP, which in our case was 172.20.0.10. Every Pod receives a /etc/resolv.conf file when it starts, which tells the operating system's DNS resolver which nameserver to use. We match that in our Network Policy using an IP block and allowing traffic via port 53 for DNS queries.

With EKS Auto Mode, the Pod Identity capability is provided as a built-in node-local component rather than a DaemonSet you install or manage. It exposes a link-local HTTP endpoint at 169.254.170.23 on ports 80 and 2703 that is reachable only by pods on the same node. Because the address is link-local, the request never leaves the node. Every node runs its own instance of this capability, so each pod only ever talks to the component on the node it is scheduled on. We need to allow egress to this address to allow the AWS SDK in our application code to fetch credentials.

Our team-1 app needs to access DynamoDB which is exposed via a Gateway Endpoint. The pod resolves the public DynamoDB hostname, so the initial packet is destined for one of DynamoDB's public IP addresses. The eBPF egress hook evaluates that packet before the VPC route table redirects it via the Gateway Endpoint. These IP's rotate, so there is no stable CIDR range to match. EKS Auto Mode has an ApplicationNetworkPolicy (an AWS CRD that extends NetworkPolicy with a domainNames filter). EKS Auto Mode uses the domainNames rule to allow egress to the IP addresses resolved for that FQDN.

apiVersion: networking.k8s.aws/v1alpha1
kind: ApplicationNetworkPolicy
metadata:
  name: allow-dynamodb-egress       # must NOT clash with the NetworkPolicy name
  namespace: team-1
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - domainNames:
            - "dynamodb.eu-west-1.amazonaws.com"
      ports:
        - protocol: TCP
          port: 443

Layer 5 — Policy as code with Kyverno

Pod Security Admission (PSA) provides Kubernetes built-in enforcement of the Pod Security Standards, protection against common privilege escalation risks such as privileged containers, host networking and hostPath mounts. This sets out a minimum security baseline, on top of which we add Kyverno. Kyverno is a general-purpose policy engine. We use it to enforce specific standards that PSA cannot express:

Restricting container images to approved registries (e.g. Amazon ECR)
Mandating CPU/memory limits
Requiring team/app labels
Forbidding the use of the default service account so every workload runs under an explicit least-privilege identity

We also use it as defense-in-depth, re-asserting key pod-security controls (non-root, no privileged containers, no host namespaces) so we are not relying on a single enforcement mechanism.

An example of one of our Kyverno policies is shown below:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-default-serviceaccount
spec:
  validationFailureAction: Enforce
  rules:
    - name: deny-default-sa
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [team-1, team-2]
      validate:
        message: "Pods must not use the 'default' service account"
        pattern:
          spec:
            =(serviceAccountName): "!default"

This policy is used to ensure all pods in the team-1 and team-2 namespace name an explicit service account. Any pod that requested the default service account (or did not specify one at all) would be rejected before it was ever scheduled. Enforcing named ServiceAccounts ensures every workload has an explicit identity, allowing us to map it to the correct IAM role using EKS Pod Identity. EKS Pod Identity maps a Kubernetes ServiceAccount (within a namespace) to a per-team IAM role.

The use of Kyverno is an example of policy as code. By expressing these requirements as policies rather than documentation, every deployment is validated consistently by the Kubernetes API server, preventing non-compliant workloads from ever entering the cluster.

Layer 6 — GitOps deployment isolation

With a multi-tenant EKS cluster running in a platform account, it was critical to limit access to the account. Workload teams are given no capability to run kubectl against the cluster, or to assume a role into the platform account. They deploy to the cluster by committing manifests to their own Git repository, and Argo CD (running as a managed EKS capability) reconciles the cluster to match. Argo CD implements a GitOps workflow where you define your application configurations in Git repositories and Argo CD automatically syncs your applications to match the desired state. For application deployments, the only identity that applies workload manifests to the cluster is the EKS Capability for Argo CD. The workload team is restricted to the Argo CD AppProject. This is where the isolation lives, and it constrains the workload team as follows:

Source - the AppProject only permits applications sourced from the approved "team-1-gitops" repository. A team cannot point Argo CD at an arbitrary repository, and cannot sync a manifest that lives in someone else's.
Destination - each project is pinned to a single namespace on the one registered cluster. Even if team-2 specified a namespace of team-1, it would be refused when it was synced under their project.
Resource kinds - each project allow-lists only the resource kinds that an application requires such as Deployments, Services and ConfigMaps. Cluster-scoped resources are locked down, so a workload team cannot create their own ClusterRole or Namespace, or a CRD to escalate their privileges.

The applications sync with selfHeal: true and prune: true, which means Git acts as the source of truth. selfHeal automatically corrects drift if resources are modified outside Git, while prune removes Kubernetes resources that have been deleted from the Git repository. If the live state drifts from what has been committed to Git, Argo CD automatically reconciles it back to the desired state. This means every application change is auditable as a Git commit, with Git acting as the single source of truth for the deployed configuration.

Because we use the Amazon EKS Capability for Argo CD, AWS manages the operational aspects of Argo CD—including upgrades, high availability and scaling—allowing the platform team to focus on defining deployment policies rather than operating the GitOps platform itself

Layer 7 — Observability and audit isolation

Alongside preventing different workloads from affecting each other, it is equally important to ensure that operational data such as logs, metrics, traces and audit records are also appropriately isolated. Each workload team should be able to troubleshoot their own application without gaining visibility into another team's logs or operational data.

Application telemetry should be partitioned by tenant or namespace to ensure each team only has visibility into its own workloads. Application telemetry is commonly collected using Fluent Bit for logs and the AWS Distro for OpenTelemetry (ADOT), or another OpenTelemetry Collector, for metrics and traces. These collectors can route telemetry to a variety of backends, including Amazon CloudWatch, Amazon Managed Service for Prometheus, Amazon OpenSearch Service, Grafana, Elastic or third-party observability platforms. Regardless of the backend, logs, metrics and traces should be partitioned using Kubernetes metadata attributes such as namespace, workload, service or tenant identifiers, with access enforced by the observability platform's authorisation model.

At the platform level, Amazon EKS control plane audit logging provides a complete audit trail of every request made to the Kubernetes API server. These audit logs capture actions such as creating or deleting workloads, modifying namespaces, changing RBAC policies and accessing Kubernetes resources. Unlike application logs, audit logs are intended for the platform and security teams, providing cluster-wide visibility for operational monitoring, compliance and forensic investigations. AWS CloudTrail complements Kubernetes audit logging by recording AWS API activity, including IAM role assumptions made through EKS Pod Identity and access to AWS services. Together, Kubernetes audit logs and CloudTrail provide a complete audit trail spanning both the Kubernetes control plane and the AWS control plane.

Testing our Isolation Constraints

We are able to run a series of tests, to prove how effective a number of these layers are in providing secure isolation between tenants. Each of the following tests deliberately attempts to violate one of the isolation boundaries described earlier. The expected outcome is that the request is denied by the appropriate layer, demonstrating that no single control is relied upon in isolation.

Test 1a: Cross network traffic between namespace (expected failure)

This test checks that a pod within the team-1 namespace cannot open a TCP socket to a service inside the team-2 namespace. This results in a timeout error as the connection is dropped by the NetworkPolicy that denies all cross namespace ingress. Because a NetworkPolicy drop is silent, the connection attempt hangs until it is finally times out.

multi-account-eks-demo % kubectl exec -n team-1 deploy/team-1-app -i -- python - <<'PY'
import socket
socket.setdefaulttimeout(5)
socket.create_connection(('team-2-app.team-2.svc.cluster.local', 80))
print('CONNECTED')
PY
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/local/lib/python3.12/socket.py", line 865, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.12/socket.py", line 850, in create_connection
    sock.connect(sa)
TimeoutError: timed out
command terminated with exit code 1

Test 1b: Cross network traffic within namespace (expected success)

This test runs the same code as the previous test, but checking whether a pod within the team-1 namespace can open a TCP socket to a service inside its own namespace. This results in a "CONNECTED" response.

multi-account-eks-demo % kubectl exec -n team-1 deploy/team-1-app -i -- python - <<'PY'
import socket
socket.setdefaulttimeout(5)
socket.create_connection(('team-1-app.team-1.svc.cluster.local', 80))
print('CONNECTED')
PY
CONNECTED

Test 2a: Cross account IAM role chaining to own workload account (expected success)

This test confirms that the cross account IAM role chaining detailed in Layer 3 works as described. The test executed a script in a pod running in the "team-1" namespace, and returns the IAM identity being used to make the call. The returned identity shows that the pod is operating as the target role in the Team 1 workload account (not the platform account)

multi-account-eks-demo % kubectl exec -n team-1 deploy/team-1-app -i -- python - <<'PY'
import boto3, os
# Use the REGIONAL STS endpoint. A bare boto3.client('sts') targets the global
# sts.amazonaws.com (a public IP) which egress intentionally blocks, so it hangs.
# The regional host resolves to the private STS interface endpoint inside the VPC.
sts = boto3.client('sts', region_name=os.environ['AWS_REGION'])
print('IDENTITY:', sts.get_caller_identity()['Arn'])
PY
IDENTITY: arn:aws:sts::169928422290:assumed-role/Team1PodTargetRole/eks-shared-pla-team-1-app-51fa2855-9605-4a5e-95bd-3be8e8754e6a

Test 2b: Cross account IAM role chaining to another workload account (expected failure)

This test attempts to assume the Team 2 target role in the Team 2 workload account. It returns with Access Denied. This request is denied by the trust policy on the target role which will only allow an assume role if it comes from the Team 2 pod role. The trust policy also requires ABAC session tags that identify the caller as coming from the "team-2" namespace and the "team-2-sa" ServiceAccount.

multi-account-eks-demo % kubectl exec -n team-1 deploy/team-1-app -i -- python - <<PY
import boto3, botocore, os
try:
    boto3.client('sts', region_name=os.environ['AWS_REGION']).assume_role(
        RoleArn='arn:aws:iam::${ACCOUNT_3_ID}:role/Team2PodTargetRole',
        RoleSessionName='cross-team-attempt')
    print('ASSUMED - unexpected')
except botocore.exceptions.ClientError as e:
    print('DENIED:', e.response['Error']['Code'])
PY
DENIED: AccessDenied

Test 3: GitOps deployment into another team's namespace

This test attempts to deploy into the Team 2 namespace through the GitOps pipeline. It creates an Argo CD Application under the "team-1" project, sourced from Team 1's own repo, but with its destination set to the "team-2" namespace. The Application object is created successfully, but Argo CD refuses to sync it and marks it InvalidSpecError. The block comes from the Team 1 AppProject, whose destinations list only permits the "team-1" namespace on the shared cluster. A destination of "team-2" matches no allowed destination and is rejected before anything lands in Team 2's namespace.

multi-account-eks-demo % kubectl apply -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cross-team-attempt
  namespace: argocd
spec:
  project: team-1
  source:
    repoURL: "https://github.com/mlewis7127/team-1-gitops"
    targetRevision: main
    path: manifests/
  destination:
    server: arn:aws:eks:eu-west-1:424727766526:cluster/shared-platform-cluster
    namespace: team-2
EOF
application.argoproj.io/cross-team-attempt created
(.venv) multi-account-eks-demo % kubectl get application cross-team-attempt -n argocd -o jsonpath='{.status.conditions}'
[{"lastTransitionTime":"2026-06-30T10:55:42Z","message":"application destination server 'arn:aws:eks:eu-west-1:424727766526:cluster/shared-platform-cluster' and namespace 'team-2' do not match any of the allowed destinations in project 'team-1'","type":"InvalidSpecError"}]%

Test 4a: Deploying an image from an unapproved registry

This test attempts to deploy a Pod whose image comes from Docker Hub (docker.io/library/nginx) rather than an approved AWS registry. It is rejected at admission by Kyverno. The restrict-image-registries policy requires every container image to come from an approved ECR registry (*.dkr.ecr.*.amazonaws.com/* or public.ecr.aws/*), so the Docker Hub image fails validation and the Pod is never created.

multi-account-eks-demo % kubectl apply -n team-1 -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: rogue-registry
  labels: { team: team-1, app: rogue }
spec:
  serviceAccountName: team-1-sa
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile: { type: RuntimeDefault }
  containers:
    - name: c
      image: docker.io/library/nginx:latest      # <-- not an approved registry
      resources:
        requests: { cpu: 50m, memory: 64Mi }
        limits:   { cpu: 100m, memory: 128Mi }
      securityContext:
        runAsNonRoot: true
        allowPrivilegeEscalation: false
        capabilities: { drop: ["ALL"] }
EOF
Error from server: error when creating "STDIN": admission webhook "validate.kyverno.svc-fail" denied the request: 

resource Pod/team-1/rogue-registry was blocked due to the following policies 

restrict-image-registries:
  validate-image-registry: 'validation error: Images must come from approved ECR registries.
    rule validate-image-registry failed at path /spec/containers/0/image/'

Test 4b: Deploying a Pod without required team labels

This test attempts to deploy a Pod that is missing the mandatory team and app labels. It is rejected at admission by Kyverno. The require-team-labels policy requires every Pod in the tenant namespaces to carry both a team and an app label (used for ownership, cost allocation, and workload selection) so a Pod with no labels fails validation and is never created.

multi-account-eks-demo % kubectl apply -n team-1 -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: rogue-labels
spec:
  serviceAccountName: team-1-sa
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile: { type: RuntimeDefault }
  containers:
    - name: c
      image: public.ecr.aws/docker/library/busybox:1.36
      command: ["sleep", "3600"]
      resources:
        requests: { cpu: 50m, memory: 64Mi }
        limits:   { cpu: 100m, memory: 128Mi }
      securityContext:
        runAsNonRoot: true
        allowPrivilegeEscalation: false
        capabilities: { drop: ["ALL"] }
EOF
Error from server: error when creating "STDIN": admission webhook "validate.kyverno.svc-fail" denied the request: 

resource Pod/team-1/rogue-labels was blocked due to the following policies 

require-team-labels:
  require-labels: 'validation error: Pods must have ''team'' and ''app'' labels. rule
    require-labels failed at path /metadata/labels/app/'

Test 4c: Deploying a Pod using the default service account

This test attempts to deploy a Pod that runs under the default service account. It is rejected at admission by Kyverno. The disallow-default-serviceaccount policy forbids the default service account in the tenant namespaces, because every workload must run under an explicit, named service account. The service account is the anchor for the entire Pod Identity to IAM role chain. A Pod using the default ServiceAccount would not be associated with the intended Pod Identity mapping, so it is never created.

multi-account-eks-demo % kubectl apply -n team-1 -f - <<'EOF'           
apiVersion: v1
kind: Pod
metadata:       
  name: rogue-sa                      
  labels: { team: team-1, app: rogue }
spec:                                                              
  serviceAccountName: default                     # <-- not allowed
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile: { type: RuntimeDefault }
  containers:
    - name: c
      image: public.ecr.aws/docker/library/busybox:1.36
      command: ["sleep", "3600"]
      resources:
        requests: { cpu: 50m, memory: 64Mi }
        limits:   { cpu: 100m, memory: 128Mi }
      securityContext:
        runAsNonRoot: true
        allowPrivilegeEscalation: false
        capabilities: { drop: ["ALL"] }
EOF
Error from server: error when creating "STDIN": admission webhook "validate.kyverno.svc-fail" denied the request: 

resource Pod/team-1/rogue-sa was blocked due to the following policies 

disallow-default-serviceaccount:
  deny-default-sa: 'validation error: Pods must not use the ''default'' service account.

Test 4d: Deploying a container that does not enforce non-root at the container level

This test attempts to deploy a Pod that sets runAsNonRoot at the Pod level (enough to satisfy Pod Security Admission) but omits it on the container's own securityContext. It is rejected at admission by Kyverno. The require-non-root policy demands runAsNonRoot on the container itself, so this Pod fails validation and is never created. This is a good example of Kyverno layering a stricter check on top of the PSA baseline. The manifest would pass PSA, but our organisational policy requires the guarantee to be explicit at the container level.

multi-account-eks-demo % kubectl apply -n team-1 -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: rogue-nonroot
  labels: { team: team-1, app: rogue }
spec:
  serviceAccountName: team-1-sa
  securityContext:
    runAsNonRoot: true                            # pod-level satisfies PSA
    runAsUser: 1000
    seccompProfile: { type: RuntimeDefault }
  containers:
    - name: c
      image: public.ecr.aws/docker/library/busybox:1.36
      command: ["sleep", "3600"]
      resources:
        requests: { cpu: 50m, memory: 64Mi }
        limits:   { cpu: 100m, memory: 128Mi }
      securityContext:                            # no runAsNonRoot here
        allowPrivilegeEscalation: false
        capabilities: { drop: ["ALL"] }
EOF
Error from server: error when creating "STDIN": admission webhook "validate.kyverno.svc-fail" denied the request: 

resource Pod/team-1/rogue-nonroot was blocked due to the following policies 

require-non-root:
  run-as-non-root: 'validation error: Containers must run as non-root. rule run-as-non-root
    failed at path /spec/containers/0/securityContext/runAsNonRoot/'

Testing Results

Attempt	Layer that blocked it	Result
Connect to another namespace	NetworkPolicy	Timed out
Assume another team's role	IAM trust policy + ABAC	AccessDenied
Deploy to another namespace	Argo CD AppProject	InvalidSpecError
Use Docker Hub	Kyverno (restrict-image-registries)	Denied at admission
Use default ServiceAccount	Kyverno (disallow-default-serviceaccount)	Denied at admission
Omit non-root	Kyverno (require-non-root)	Denied at admission

Additional Considerations

All of the layers we have provided so far give a level of isolation between tenants on a shared EKS cluster. If you need to go even further, there are two other directions. The first is to offer greater network isolation beyond what a NetworkPolicy can provide using a service mesh. The second is to offer greater compute isolation for when your threat model includes untrusted code or you don't want to rely on a shared kernel.

Option 1 - Adopting a Service Mesh (Zero Trust Networking)

By default, pod-to-pod communication in Amazon EKS Auto Mode is not protected with application-layer encryption. Pods communicate using native VPC networking, so end-to-end encryption between workloads requires application-layer TLS or a service mesh providing mutual TLS (mTLS). With VPC Encryption Controls, traffic between pods on different Nitro-based worker nodes can be encrypted in transit at the AWS networking layer. This protects packets while they traverse the AWS network but does not provide workload authentication, certificate identity or mTLS between services.

A service mesh provides each workload with a cryptographic identity. Services authenticate each other using certificates rather than relying solely on network location, IP addressing or namespace-based trust. It also allows for application-aware (Layer 7) authorisation policies based on HTTP methods, paths, or headers, rather than just a layer 4 (Transport) connection policy. Istio is the most widely adopted full-featured service mesh for Kubernetes. Adopting a service mesh brings with it additional operational complexity and infrastructure overhead. Traditionally, service meshes such as Istio injected a sidecar proxy into every pod, increasing CPU and memory consumption as well as pod startup times. More recently, Istio introduced Ambient Mesh, which replaces per-pod sidecars with shared node-level ztunnel proxies for Layer 4 functionality and optional waypoint proxies for Layer 7 policies, significantly reducing this overhead.

Option 2 - Adopting Fargate

The hardware virtualisation boundary provided by a hypervisor offers significantly stronger isolation than Linux namespaces and cgroups alone. One option to achieve this is to run pods on AWS Fargate rather than on shared EC2 worker nodes. Fargate is a separate serverless compute option (configured through Fargate profiles) rather than part of EKS Auto Mode, so adopting it means running those workloads on a different data plane from the Auto Mode managed nodes used elsewhere in this post. Fargate runs each pod within its own Firecracker microVM, providing a dedicated kernel and virtual machine isolation boundary for that pod. Unlike EC2-backed worker nodes, pods do not share a Linux kernel with other workloads, reducing the impact of a potential container escape.

There are trade-offs. Fargate does not support DaemonSets or privileged containers, making it unsuitable for some infrastructure agents and security tooling. Resources are also sized per pod, so you cannot take advantage of the workload bin-packing that EKS Auto Mode performs across managed EC2 instances.

Note that AWS now recommend EKS Auto Mode with EC2 managed instances over EKS Fargate link

Option 3 - Deploying workloads on separate nodes

EKS Auto Mode allows customers to achieve a similar workload isolation model to Fargate, using standard Kubernetes scheduling capabilities to ensure each EC2 instance runs a single application pod. To replicate Fargate’s pod isolation model where each pod runs on its own dedicated instance, you can use Kubernetes topology spread constraints. This is the recommended approach for controlling pod distribution across nodes. EKS Auto Mode will automatically provision new EC2 instances as needed to satisfy this constraint, providing a similar isolation model as Fargate while giving you access to the full range of EC2 instance types and purchasing options. In this example setting a maxSkew of 1 ensures the difference in pod count between any two nodes is at most 1, effectively resulting in one pod per node as EKS Auto Mode provisions additional instances to satisfy the scheduling constraint. The whenUnsatisfiable: DoNotSchedule attribute prevents scheduling if the constraint can't be made.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: isolated-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: isolated-app
  template:
    metadata:
      labels:
        app: isolated-app
      annotations:
        eks.amazonaws.com/compute-type: ec2
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: isolated-app
        minDomains: 1
      containers:
      - name: app
        image: nginx
        ports:
        - containerPort: 80

You can also use use pod anti-affinity rules for stricter isolation. The podAntiAffinity rule with requiredDuringSchedulingIgnoredDuringExecution ensures that no two pods with the same label can be scheduled on the same node. This approach provides a stronger scheduling guarantee than topology spread constraints by preventing matching pods from being scheduled onto the same node.

The trade-off for this approach is efficiency and cost. Kubernetes achieves high utilisation through bin-packing multiple workloads onto shared nodes. Scheduling a single workload per EC2 instance reduces that density, resulting in more instances and higher infrastructure costs, although EKS Auto Mode continues to manage the provisioning, scaling and lifecycle of those instances automatically. This approach also preserves support for DaemonSets, privileged workloads and the broader EC2 feature set that are not available on Fargate.

Additional Security Considerations

Kubernetes Secrets

A strong security posture is to minimise the number of long-lived secrets stored within the cluster. Kubernetes Secrets are namespace-scoped, so a secret in "team-1" cannot be read by a pod in "team-2". Access is controlled through Kubernetes RBAC. EKS already encrypts etcd at rest using an AWS-owned key by default, and production clusters should add a further layer by enabling envelope encryption with a customer-managed AWS KMS key. Either way, every namespace's secrets remain stored within the same shared etcd datastore that forms part of the EKS control plane.

In our architecture, we used Pod Identity with short-lived role-based credentials. When we needed to interact with an RDS instance in a workload account, we used RDS IAM authentication, so there was no database password to hold.

The recommended approach when a secret is required is to store it in AWS Secrets Manager, with the secret scoped to the specific team that needs it. AWS Secrets Manager authorises access using IAM policies rather than Kubernetes RBAC. EKS Pod Identity allows pods to obtain AWS credentials without embedding static credentials in the cluster. This also provides additional capabilities such as automatic rotation, CloudTrail auditing, resource policies and centralised IAM-based access control.

In practice, many AWS-native workloads require few or no application secrets. Services such as Amazon S3, DynamoDB, SQS, SNS and RDS IAM authentication can all be accessed using short-lived IAM credentials delivered through EKS Pod Identity.

Image Provenance

Image provenance is another important consideration. Restricting workloads to approved registries ensures images originate from trusted locations, but it does not prove that an image has not been tampered with.

Amazon ECR supports managed container image signing using AWS Signer, automatically generating cryptographic signatures as images are pushed to the registry. Amazon EKS can then verify these signatures during deployment using its native image signature verification capability. Trusted signing profiles and verification policies define which signed images are permitted to run, helping ensure only trusted workloads are deployed. Combined with Amazon ECR image scanning, this establishes a trusted software supply chain from build through to deployment. This reduces the risk of one tenant deploying a malicious or tampered container image that could compromise the shared platform.

For organisations using a broader Kubernetes ecosystem, Sigstore Cosign remains a widely adopted alternative. Native Amazon EKS image verification currently validates Notation (Notary v2) signatures produced by AWS Signer, whereas Cosign signatures are typically verified during admission using a policy engine such as Kyverno's verifyImages.

Runtime threat detection

All of our controls described so far are preventative, and aim to stop a non-compliant workload from carrying out an activity it should not. To provide defence in depth, we are also interested in detective controls, and this is where Amazon GuardDuty EKS Protection comes in.

Amazon GuardDuty Runtime Monitoring supports Amazon EKS clusters running on Amazon EC2 instances and Amazon EKS Auto Mode. An automated agent configuration approach is available which allows GuardDuty to manage the deployment of the security agent on your behalf. With this approach, the VPC endpoint is created for you, and this is used to deliver the runtime events to the security agent. The security agent monitors process execution, file access and network activity to detect compromised containers and privilege escalation attempts, allowing them to be investigated and remediated.

Adding Memory to the Agent

Matt Lewis — Tue, 16 Jun 2026 08:54:17 +0000

This is the fourth in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

As mentioned in the first blog post, each session on AgentCore Runtime is assigned a dedicated Firecracker microVM with isolated CPU, memory and filesystem resources. When the session finishes, the entire microVM is destroyed. There is no shared state between sessions, which prevents any cross-session data leakage.

When a user accesses our AWS Briefing Agent service for the first time, they are asked a number of questions.

This includes asking about the primary AWS services the user is interested in, their experience level in AWS, and if there are specific AWS areas they want to track closely.

Without any memory capability, the user will have to provide the same information each time they start a new session. This is where AgentCore Memory comes into play. This post walks through setting up AgentCore Memory using Strands Agents.

Configuring Memory in AgentCore

The agentcore.json file is the primary configuration file used in Amazon Bedrock AgentCore to define and manage AI agents, gateways, memory stores and datasets. It acts as the central orchestrator to package up the agents infrastructure.

When we run the agentcore deploy command, the CLI reads this file and uses the AWS CDK to synthesize and deploy CloudFormation resources. We add long term to our agent in the memory section using a resource identifier of "BriefingAgentMemory". This is the identifier that is referenced in our handler.

AgentCore Memory itself consists of several key components that work together to provide both short-term context and long-term intelligence for agents as shown in the diagram below:

The interactions with the user are stored in short term for 90 days, as specified in the event expiry duration attribute. We then specify two distinct memory strategies that transform these short term raw events into long-term memory. Note that all strategies by default ignore personally identifiable information (PII) data from long-term memory records.

We define the following strategies in the agentcore.json file:

SEMANTIC - this memory strategy identifies and extracts key pieces of factual information and contextual knowledge from conversational data. For example, a user is running AWS Lambda in production.
USER_PREFERENCE - this memory strategy is designed to automatically identify and extract user preferences, choices and styles from conversations. For example, a user is interested in serverless and containers.

Each strategy stores its long-term memory in a hierarchical structure within a namespace. These namespaces act as distinct logical containers. We segregate them using the special {actorId} placeholder variable, so that we guarantee separation between each user.

The complete relevant memory section in our agentcore.json file is shown below:

  "memories": [
    {
      "name": "BriefingAgentMemory",
      "eventExpiryDuration": 90,
      "strategies": [
        {
          "type": "SEMANTIC",
          "name": "semantic_facts",
          "namespaces": [
            "/users/{actorId}/facts"
          ]
        },
        {
          "type": "USER_PREFERENCE",
          "name": "user_preferences",
          "namespaces": [
            "/users/{actorId}/preferences"
          ]
        }
      ]
    }
  ],

Integrating Cognito and AgentCore Runtime

At this point, we need to do a segway into how we authenticate requests to our agent. AgentCore Runtime supports two inbound authentication mechanisms:

AWS IAM SigV4 - where the request to the InvokeAgentRuntime API is SigV4-signed with valid AWS credentials that have the bedrock-agentcore:InvokeAgentRuntime IAM permission.
JWT Bearer Token Auth - which is configured with an Inbound JWT authoriser

When our frontend invokes the agent, it is sending a request to the agent's public endpoint URL:

https://bedrock-agentcore.eu-west-1.amazonaws.com/runtimes/<arn>/invocations

This URL is a special public-facing endpoint that AgentCore Runtime exposes. We specify this in the agentcore.json file:

  "runtimes": [
    {
      "name": "AWSBriefingAgent",
      "build": "Container",
      "entrypoint": "handler.py",
      ...
      "authorizerType": "CUSTOM_JWT",
      "authorizerConfiguration": {
        "customJwtAuthorizer": {
          "discoveryUrl": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_dshjdhskj/.well-known/openid-configuration",
          "allowedClients": [
            "dhjhdjskhdjkshdjkhsd"
          ]
        }
      }
    }

The discoveryUrl points to Cognito's OpenID Connect discovery document for the AWS Cognito User Pool with the specified ID that is being used to authenticate users to the frontend. When AgentCore Runtime wants to validate the JWT token, it retrieves information from this endpoint such as the issuer and JWKS endpoint (contains the public keys used to verify the JWT signature).

The allowedClients shows the Cognito Application Client ID. When a user logs in, Cognito stamps the token with the client_id. AgentCore validates the JWT’s client_id claim, so only tokens issued for one of the permitted application clients can invoke the runtime.

When the user logs into our frontend application with their email address and password, the frontend calls Cognito directly to verify, and receives back

Access token — proves who you are and what you're allowed to do.
ID token — contains profile info (email, name). Used by the frontend to display the username.
Refresh token — used to get new access/ID tokens when they expire (usually after 1 hour). These tokens are stored by the frontend auth library.

When we send a request to the agent, the frontend attaches the access token as a bearer token

POST /invocations
Authorization: Bearer eyJraWQi...
Body: {"prompt": "Give me a briefing"}

This is the JWT token that gets validated by AgentCore Runtime.

Returning Memory records in Handler function

The following code snippet shows how we retrieve the memory records to display in the sidebar of the frontend.

@app.entrypoint
async def invoke(payload: Dict[str, Any], context: Any = None):
    message = payload.get("prompt", payload.get("message", ""))

    # Derive actor_id from the JWT 'sub' claim (source of truth)
    actor_id = _extract_sub_from_jwt(context) or payload.get("user_id", "default-user")

    # Sanitize actor_id for AgentCore Memory
    actor_id = re.sub(r"[^a-zA-Z0-9\-_/]", "_", actor_id)

    # Retrieve memory records to include in the stream
    memory_used = get_memory_records(actor_id, message)

The @app.entrypoint decorator registers a function as the handler for POST requests to /invocations. AgentCore Runtime calls this handler function when a client invokes the agent. Our handler function is an async generator, which means that it automatically streams the response as Server-Sent Events (SSE) delivered to the client in real-time (more around this in the next blog post).

Within the handler, we get the message that has been sent in the payload. We then extract the user's identity from the JWT token that Cognito issued. One of the claims in the JWT token is the sub or subject, which is the unique user ID assigned by Cognito to a user when they first register. We know that the JWT token has been cryptographically signed by Cognito and validated by AgentCore Runtime before it reaches the handler function. We assign this sub value to be the actor_id. We apply some regex to the actual value to ensure it has no characters in it that are not supported.

We then call our get_memory_records function. This function calls the AgentCore retrieve memory records API to search the long-term memory for facts and preferences relevant to the promt that has just been passed in. We retrieve the 5 highest scoring results from the vector search and store them in a records array, which is streamed back to the frontend to be displayed in the sidebar.

def get_memory_records(actor_id: str, prompt: str) -> List[Dict[str, Any]]:
    """Retrieve long-term memory records relevant to the user's prompt.

    Searches both the facts and preferences namespaces and returns
    the records the agent would have seen for this invocation.
    """
    if not MEMORY_ID:
        return []

    try:
        client = boto3.client("bedrock-agentcore", region_name=REGION)
        records = []

        for namespace in [
            f"users/{actor_id}/facts",
            f"users/{actor_id}/preferences",
        ]:
            try:
                response = client.retrieve_memory_records(
                    memoryId=MEMORY_ID,
                    namespace=namespace,
                    searchCriteria={
                        "searchQuery": prompt,
                        "topK": 5,
                    },
                    maxResults=5,
                )
                for r in response.get("memoryRecordSummaries", []):
                    records.append({
                        "memoryRecordId": r.get("memoryRecordId", ""),
                        "text": r.get("content", {}).get("text", ""),
                        "score": r.get("score"),
                        "memoryStrategyId": r.get("memoryStrategyId", ""),
                        "namespaces": r.get("namespaces", []),
                    })
            except Exception as exc:
                logger.warning("Failed to retrieve from %s: %s", namespace, exc)

        return records
    except Exception as exc:
        logger.warning("Failed to retrieve memory records: %s", exc)
        return []

We can see an example of the sidebar in the frontend below:

Setting up Memory with Strands

Both short-term and long-term memory are handled for us automatically through the AgentCore Memory session manager integration for Strands.

The memory ID is retrieved in a module-level constant:

MEMORY_ID = os.environ.get("MEMORY_BRIEFINGAGENTMEMORY_ID")

This reads the memory resource ID that AgentCore Runtime automatically injects as an environment variable into your container at runtime. The naming convention is: MEMORY__ID. Given the memory was given a name of "BriefingAgentMemory" in the agentcore.json file, AgentCore sets MEMORY_BRIEFINGAGENTMEMORY_ID to the actual memory resource ID (something like AWSBriefingAgent_BriefingAgentMemory-q2iBfL64BS).

The following function in our code is called on every request. A new stateless Strands Agent instance is created on each invocation, configured with the relevant session manager that loads conversation history from AgentCore Memory, tools and model settings.

def _create_agent(session_id: str, actor_id: str, gateway_tools: list = None) -> Agent:
    """Create a Strands Agent with KB retrieval, AgentCore Memory, and Gateway tools."""
    session_manager = None

    if MEMORY_ID:
        try:
            from bedrock_agentcore.memory.integrations.strands.config import (
                AgentCoreMemoryConfig,
                RetrievalConfig,
            )
            from bedrock_agentcore.memory.integrations.strands.session_manager import (
                AgentCoreMemorySessionManager,
            )

            config = AgentCoreMemoryConfig(
                memory_id=MEMORY_ID,
                session_id=session_id,
                actor_id=actor_id,
                retrieval_config={
                    f"users/{actor_id}/facts": RetrievalConfig(
                        top_k=5, relevance_score=0.5
                    ),
                    f"users/{actor_id}/preferences": RetrievalConfig(
                        top_k=5, relevance_score=0.5
                    ),
                },
            )
            session_manager = AgentCoreMemorySessionManager(
                agentcore_memory_config=config,
                region_name=REGION,
            )
        except Exception as exc:
            logger.warning("Failed to initialise memory session manager: %s", exc)

    tools = [retrieve, format_slack_message] + (gateway_tools or [])

    return Agent(
        system_prompt=_load_system_prompt(),
        model=_create_model(),
        tools=tools,
        session_manager=session_manager,
        conversation_manager=SlidingWindowConversationManager(
            window_size=20,
            should_truncate_results=True,
            per_turn=True,
        ),
        callback_handler=None,
    )

In our code, if memory has been set, then we import the AgentCoreMemorySessionManager. This session manager integrates Strands agents with AgentCore Memory, which synchronises the short-term and long-term memory capabilities. Some of its features include loading the conversation history from short-term memory during agent initialisation, and integrating with long-term memory for context injection into agent state.

Next we create a AgentCoreMemoryConfig configuration object which will be passed to the session manager telling it:

memory_id - which AgentCore Memory resource to connect to
session_id - the identifier for the conversation session
actor_id - the unique identifier for the user
retrieval_config - a dictionary mapping of namespaces to retrieval configurations. This tells the session manage to search the two namespaces for relevant long-term memories, and to get the 5 most relevant facts and user preferences

Our use of AgentCore Memory is now handled automatically by Strands Agents session manager. Before each turn, it will load recent events from the same session to populate the agent's conversation context. The short-term memory is the raw event stream. The agent will see the last 20 turns in its context window, as this has been configured with the Sliding Window Conversation Manager. After (and during) invocations of the agent, new conversation messages are automatically persisted to AgentCore Memory.

With this in place, we have now successfully added long-term memory to our agent, personalising the briefing for each user based on their preferences.

Biography

As Chief AWS Architect at IBM in the UK, I am responsible for growing the AWS capability and community within one of the fastest growing AWS consulting partners globally. This gives me the opportunity to try out the latest features in preview before they go into general availability. You'll often find me blogging about my experience, but please reach out if there are services you'd like to know more about.

Strands Agents + AgentCore Runtime - a perfect match

Matt Lewis — Wed, 20 May 2026 21:17:54 +0000

This is the third in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

The initial implementation of the AWS Briefing Agent called the AWS News Feed RSS feed on every invocation. After setting up an Amazon Bedrock Knowledge Base, the next step was to refactor the code to take advantage of an agentic framework. The decision was made to adopt Strands Agents SDK as an open source SDK that helps you build and run AI agents in just a few lines of code. In our case, switching to the Knowledge Base and adopting Strands Agents SDK helped us to reduce the number of lines of code in our implementation logic by 75%.

Using Strands Agents SDK

The core of the Strands Agents code is straightforward and shown in the code snippet below:

from strands import Agent
from strands.models import BedrockModel
from strands.agent.conversation_manager import SlidingWindowConversationManager
from strands_tools import retrieve
from agent.tools.slack_formatter.tool import format_slack_message

model = BedrockModel(
    guardrail_id=GUARDRAIL_ID,
    guardrail_version=GUARDRAIL_VERSION,
    guardrail_trace="enabled",
)

agent = Agent(
    system_prompt=_load_system_prompt(),
    model=model,
    tools=[retrieve, format_slack_message] + gateway_tools,
    session_manager=session_manager,
    conversation_manager=SlidingWindowConversationManager(
        window_size=20,
        should_truncate_results=True,
        per_turn=True,
    ),
    callback_handler=None,
)

result = agent(message)

We start by importing a number of classes and functions from two packages (strands-agents and strands-agents-tools) and one local module. Agent is the core class for the agent itself, BedrockModel is the model provider, SlidingWindowConversationManager controls how conversation history is trimmed, and retrieve is a pre-built tool that is used to query a Bedrock Knowledge Base. The format_slack_message is a local custom tool within this project - a Python function decorated with the @tool annotation.

We instantiate the BedrockModel() without specifying a model_id. At this point, Strands uses its default model, which is current Claude Sonnet on Bedrock. We include details of a Bedrock Guardrail when we instantiate the model, purely to demonstrate the use of guardrails which we cover this later in the blog post.

Finally, we create the agent by wiring together its core components.

Deploy to Amazon Bedrock AgentCore Runtime

The AgentCore Runtime Python SDK provides a lightweight wrapper that helps to deploy your agent function as HTTP services

# Import the runtime
from bedrock_agentcore.runtime import BedrockAgentCoreApp

# Initialise the app
app = BedrockAgentCoreApp()

# Decorate the function
@app.entrypoint
def invoke(payload: Dict[str, Any], context: Any = None) -> Dict[str, Any]:
    """Entry point for AgentCore Runtime."""
    message = payload.get("prompt", payload.get("message", ""))
    ...
    return response

BedrockAgentCoreApp wraps your function in an HTTP server that listens om port 8080 with two endpoints:

/invocations - a POST endpoint for agent interactions. This gets invoked when customers call the InvokeAgentRuntime action with the payload in JSON format
/ping - a GET endpoint for health checks to verify your agent is operational and ready to handle requests

The @app.entrypoint decorator registers your invoke function as the handler for incoming requests. When AgentCore Runtime receives a request, it deserialises the JSON body into payload, provides a context object (with session_id, request_headers, etc.), calls your function, and serialises the returned dict back as the HTTP response.

Using the Container Build

When using the @aws/agentcore CLI and running agentcore deploy, the CLI needs to turn the Python source code into a runnable container image on AgentCore Runtime. This is controlled by the build field in the agentcore.json file. The default setting is CodeZip, in which the CLI zips up the Python source code, uploads it, and AgentCore resolves dependencies using uv --no-build. This is fast but has a hard constraint, as every dependency must have a pre-built wheel. In our code, we have a package that only ships source distributions, which required us to switch to the Container build setting. This also makes our build more production-ready.

When you run agentcore deploy with the Container build type, the CLI synthesis a CloudFormation stack that includes a CodeBuild project, an ECR repository, the AgentCore Runtime resource, and IAM roles. The CLI packages the codeLocation directory (agent/) and uploads it to S3 as the CodeBuild source artefact. CodeBuild pulls the provided Dockerfile and builds the container image. You can see all the steps in the CodeBuild project below:

After the image builds successfully, CodeBuild tags it and pushes it to the ECR repository as shown below:

The stack updates the Runtime resource to point at the new ECR image URI. AgentCore pulls the image from ECR the next time it starts a container for an invocation.

Built-In Conversation Managers

In the Strands Agents SDK, the user messages and agent responses are all added to the context. As the conversation grows within a session, this starting having a material impact on response times. We modified the default SlidingWindowConversationManager manager:

reducing the windowSize from the default of 40 to 20. This sets the maximum number of messages to keep
setting the per_turn parameter to false. This runs the sliding window before every model call within the same invocation, rather than waiting until after the agent loop completes.

This reduced the average response time from around 80 seconds down to 15 seconds.

Adding Bedrock Guardrails

Amazon Bedrock Guardrails are designed to help you safely build and deploy responsible generative AI applications with confidence. We decided to include a guardrail in the architecture, to understand where it fits in and what it can provide.

The guardrail itself was defined in CDK with content filters (sexual, violence, hate, insults, misconduct and prompt attack), a topic policy (deny off-topic sports questions), and a managed profanity word list:

# ----------------------------------------------------------------
# Bedrock Guardrail — content safety for the agent
# ----------------------------------------------------------------
guardrail = bedrock.CfnGuardrail(
    self,
    "BriefingAgentGuardrail",
    name="briefing-agent-guardrail",
    description="Content safety guardrail for the AWS Briefing Agent",
    blocked_input_messaging="I'm sorry, I can't process that request. Please rephrase your question about AWS announcements.",
    blocked_outputs_messaging="I'm sorry, I can't provide that response. Let me try a different approach.",
    content_policy_config=bedrock.CfnGuardrail.ContentPolicyConfigProperty(
        filters_config=[
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="SEXUAL",
                input_strength="HIGH",
                output_strength="HIGH",
            ),
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="VIOLENCE",
                input_strength="HIGH",
                output_strength="HIGH",
            ),
            # HATE, INSULTS, MISCONDUCT, PROMPT_ATTACK
        ],
    ),
    topic_policy_config=bedrock.CfnGuardrail.TopicPolicyConfigProperty(
        topics_config=[
            bedrock.CfnGuardrail.TopicConfigProperty(
                name="Sports",
                definition="Questions about sports scores, match results, player transfers, league standings, fixtures, or any sporting events.",
                type="DENY",
            ),
        ],
    ),
    word_policy_config=bedrock.CfnGuardrail.WordPolicyConfigProperty(
        managed_word_lists_config=[
            bedrock.CfnGuardrail.ManagedWordsConfigProperty(
                type="PROFANITY",
            ),
        ],
    ),
)

When the agent is invoked, the request first reaches the AgentCore Runtime and runs the handler code first. The guardrail itself is only applied when the handler makes the Bedrock inference call. Bedrock evaluates the input before running the model inference, and then inspects the output before returning it. We did encounter some interesting behaviour when implementing the guardrail.

IAM Permission Gap

The first invocation after adding the guardrail failed with:

AccessDeniedException: User is not authorized to perform: bedrock:ApplyGuardrail
on resource: arn:aws:bedrock:eu-west-1.xxx

The AgentCore execution role (auto-created by the @aws/agentcore-cdk construct) includes bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream, but not bedrock:ApplyGuardrail. The construct doesn’t know about guardrails — they’re a Bedrock feature, not an AgentCore feature. We ended up having to use the aws iam put-role-policy CLI command to add the missing permission

Topic policies can false-positive on legitimate queries

The initial topic policy denied "questions not related to AWS services, cloud computing, or technology". The intention was that it would be easy to demonstrate, and would ensure that the user input was relevant. However, when the user asked questions such as "what are the top announcements today", the classifier ended up deciding this was a blocked topic. In the end, to demonstrate how topic policies work, we changed it to explicitly deny sporting questions.

Guardrail versions can be deleted by CDK updates

When we updated the topic policy, we changed the version description for the guardrail. The CDK stack updated the guardrail version resource, so that CloudFormation deleted version 1 and created version 2. Unfortunately, the version number is also defined in the agentcore.json file. This meant that the AgentCore Runtime container still had version 1 baked into its environment, which meant calls now failed with the following exception:

ValidationException: The guardrail identifier or version provided in the request does not exist.

In the end it was a case of having to update the version number in agentcore.json, redeploy the agent, and start a new session.

Biography

Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering

Matt Lewis — Wed, 20 May 2026 21:16:06 +0000

This is the second in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

When I started building the AWS Briefing Agent, the first version queried the AWS What's New RSS feed on every invocation. This worked in terms of showing the agent could return tailored information back to the client. However, it was costly and wasteful, with the same data fetched repeatedly, which added latency to every invocation. The RSS feed also only covers recent information, and it was likely we would want to start searching for releases that had been launched in the past 6 months or more. The next step therefore, was to separate the retrieval by the agent from the ingestion.

Amazon Bedrock Knowledge Base

One of the key design goals was to allow the agent to match a natural language query "what's new in Bedrock this week?" against a large corpus of documents to return the most semantically similar results. This is where Amazon Bedrock Knowledge Base comes into its own. It allows the agent to use RAG (Retrieval-Augmented Generation). By querying the Knowledge Base, we can retrieve relevant documents at query time, and then inject them into the prompt as context. The LLM then generates a response from this retrieved information which we know to be factual.

The python CDK code that creates the Knowledge Base is shown below:

knowledge_base = bedrock.CfnKnowledgeBase(
    self,
    "AnnouncementKnowledgeBase",
    name="aws-briefing-agent-announcements",
    ...
    knowledge_base_configuration=bedrock.CfnKnowledgeBase.KnowledgeBaseConfigurationProperty(
        type="VECTOR",
        vector_knowledge_base_configuration=bedrock.CfnKnowledgeBase.VectorKnowledgeBaseConfigurationProperty(
            embedding_model_arn=f"arn:aws:bedrock:{self.region}::foundation-model/amazon.titan-embed-text-v2:0",
        ),
    ),
    storage_configuration=bedrock.CfnKnowledgeBase.StorageConfigurationProperty(
        type="S3_VECTORS",
        s3_vectors_configuration=bedrock.CfnKnowledgeBase.S3VectorsConfigurationProperty(
            index_name="announcements",
            vector_bucket_arn=f"arn:aws:s3vectors:{self.region}:{self.account}:bucket/briefing-agent-vectors",
        ),
    ),
)

This declares the embeddings model to be used as amazon.titan-embed-text-v2:0 and the vector store as being of type S3_VECTORS. There is no code required to handle aspects such as embeddings. Instead, Bedrock manages all of this for us.

Amazon S3 Vectors

Amazon Bedrock Knowledge Bases support several vector stores. A vector store is the retrieval engine that makes RAG work. It stores documents as numerical embeddings (vectors) that are generated by an embeddings model. At query time, the user's question is embedded, and the vector store finds documents whose embeddings are closest in meaning.

The prototype uses Amazon S3 Vectors as the underlying vector store. S3 Vectors provides cost-effective, elastic, and durable vector storage at up to 90% lower costs for uploading, storing, and querying vectors than alternatives such as OpenSearch Serverless. There is no infrastructure to manage, and it still provides a sub-second query latency which is acceptable for this use case.

Scheduling the Ingestion

The ingestion pipeline is run every 6 hours using Amazon EventBridge Scheduler. This service provides capabilities such as built-in retry policies, time zone support, and dead-letter queues. The schedule triggers an AWS Lambda function that carries out the required processing. This includes:

Lists existing document hashes in S3
Fetches the AWS What’s New RSS feed (~100 announcements)
Fetches 13 AWS blog RSS feeds (aws, machine-learning, compute, security, database, containers, devops, networking, storage, infrastructure-and-automation, developer, big-data, iot)
Fetches the AWS Security Bulletins RSS feed
For each new blog post, fetches the canonical URL and extracts the full article body using a stdlib HTML parser
Parses publication dates into YYYYMMDD integers
Writes .txt and .metadata.json files per new item to S3
Triggers a Bedrock KB ingestion job

Deduplication and Incremental Writes

When the ingestion pipeline runs, most of the content in the various RSS feeds is not new. It was important to find a way to prevent re-fetching and re-writing hundreds of announcements every 6 hours.

To support this, we created an MD5 hash of the blog posts URL, truncated to 12 hex characters. This hash is used as the S3 filename. The sample code snippet is shown below:

def write_to_s3(items, existing_keys=None):
    existing = existing_keys or set()
    for item in items:
        url_hash = hashlib.md5(item["link"].encode()).hexdigest()[:12]
        if url_hash in existing:
            continue # Already in S3, skip
        # ... write doc + metadata files

At startup, get_existing_keys() lists all the .txt files in S3 and extracts the hash from each filename into a set. When processing the blog posts, the Lambda functions computes the URL hash and checks to see if it is already in the set. If it already exists, then it has been ingested in a previous run, and there is no need to re-fetch the page. If the hash does not exist, then the function fetches the page, extracts the content, and writes to S3. The hash gives a stable, deterministic filename derived from the URL. The same URL always produces the same hash.

Chunking Strategy

The chunking strategy is set on the Data Source resource in the CDK stack as shown below:

data_source = bedrock.CfnDataSource(
    self,
    "AnnouncementDataSource",
    name="aws-announcements-s3",
    knowledge_base_id=knowledge_base.attr_knowledge_base_id,
    data_source_configuration=bedrock.CfnDataSource.DataSourceConfigurationProperty(
        type="S3",
        s3_configuration=bedrock.CfnDataSource.S3DataSourceConfigurationProperty(
            bucket_arn=data_bucket.bucket_arn,
        ),
    ),
    vector_ingestion_configuration=bedrock.CfnDataSource.VectorIngestionConfigurationProperty(
        chunking_configuration=bedrock.CfnDataSource.ChunkingConfigurationProperty(
            chunking_strategy="SEMANTIC",
            semantic_chunking_configuration=bedrock.CfnDataSource.SemanticChunkingConfigurationProperty(
                breakpoint_percentile_threshold=92,
                buffer_size=1,
                max_tokens=600,
            ),
        ),
    ),
)

We utilise a SEMANTIC chunking strategy. This uses the embedding model itself to decide where to split. The following three parameters control this behaviour:

breakpoint_percentile_threshold=92 - controls the percentile threshold that will result in a split. A higher threshold requires sentences to be more distinguishable to split the document into different chunks.
max_tokens=600 - the maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.
buffer_size=1 - for a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.

Filtering by Date

One of the goals in writing the agent was that a user could ask to constrain information by how recent it is e.g. "what is new in the past 7 days?".

To help achieve this, at ingestion time for each document, we create an associated metadata.json sidecar file that attaches structured, filterable attributes to a document so the agent can narrow search results without relying only on semantic similarity. An example companion file is shown below:

{
  "metadataAttributes": {
    "published_date": 20260415,
    "service": "amazon-bedrock",
    "category": "artificial-intelligence",
    "source_type": "announcement"
  }
}

During the Knowledge Base sync, Bedrock reads this sidecar and attaches those attributes to every vector chunk generated from that document. At query time, the agent can combine semantic search with metadata filters:

"What's new in Bedrock this week?" → vector similarity for "Bedrock" + greaterThanOrEquals filter on published_date
"Show me security bulletins" → vector similarity + equals filter on source_type: "security-bulletin"
"Lambda announcements from the last month" → vector similarity + filters on both service and published_date

Without the metadata file, the agent would get the most semantically similar results regardless of date or service — so a question about "this week" might return announcements from 3 months ago that happen to be textually similar. The metadata filters let the agent constrain results to the correct time window or service before ranking by relevance.

The naming convention (.metadata.json) is a Bedrock KB convention — it automatically associates the sidecar with its parent document during ingestion. No code links them; the filename pattern is enough.

Bedrock Knowledge Base metadata supports four types: STRING, NUMBER, BOOLEAN and STRING_LIST. There is no native data type. The comparison operators (greaterThan, greaterThanOrEquals, lessThan, lessThanOrEquals) only work with NUMBER. Our original implementation stored published_date as a string ("2026-05-14"). When the agent tried to filter, we got back the following exception:

ValidationException: The filter value type provided isn't supported
for the given operation: GREATER_THAN_OR_EQUALS

The fix was to store dates as YYYYMMDD numbers (so using "20260514" instead of "2026-05-14"). We also inject today's date into the system prompt at runtime so the LLM can easily calculate relative dates.

Note that Amazon S3 Vectors has a strict 2 KB limit on filterable metadata per vector. We found the Bedrock Knowledge Base internal metadata keys (AMAZON_BEDROCK_TEXT and AMAZON_BEDROCK_METADATA) were set as filterable by default, which caused frequent ValidationException errors. The fix was mark both of these keys as non-filterable when creating the vector index:

vector_index = s3vectors.CfnIndex(
    self, "AnnouncementVectorIndex",
    index_name="announcements",
    vector_bucket_name=vector_bucket.vector_bucket_name,
    dimension=1024,  # Titan Embed Text v2
    distance_metric="cosine",
    data_type="float32",
    metadata_configuration=s3vectors.CfnIndex.MetadataConfigurationProperty(
        non_filterable_metadata_keys=[
            "AMAZON_BEDROCK_TEXT",
            "AMAZON_BEDROCK_METADATA",
        ],
    ),
)

This meant the only filterable metadata is contained in the .metadata.json fields, which are the only fields we filter on.

The next post covers how we used an agentic framework (Strands Agents SDK) in combination with AgentCore to really start bringing the briefing agent to life.

Biography

Building a Full-Stack AI Agent on Amazon Bedrock AgentCore

Matt Lewis — Wed, 20 May 2026 21:14:23 +0000

This is the first in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

Why build an agent?

The last few years have seen a rapid shift from Generative to Agentic AI. Most of us will remember our first experience with ChatGPT where we entered a prompt and got a response back. This was impressive at the time, but was reliant on a user typing a prompt and reacting to the response. We then saw the emergence of early AI agents that could break down tasks into smaller steps and execute them independently. Over the past year, this has evolved into fully autonomous multi-agent systems capable of completing complex tasks with minimal or even no human supervision.

This shift is accelerating quickly. Gartner predicts that by 2028, more than a third of all enterprise software apps will include Agentic AI, and at least 15% of day-to-day work decisions will be made autonomously by AI agents. For organisations, the question is no longer whether agents will become part of enterprise systems, but how to build them securely, reliably and operate them at scale. From an AWS perspective, Amazon Bedrock AgentCore provides a way to help enterprises achieve this goal.

I decided to build an agent utilising AgentCore and its supporting capabilities and which served a purpose ... helping me keep up to date with all the latest announcements from AWS. This agent brings together Memory, Observability, Gateway, Identity, Evaluations and Registry alongside AgentCore Runtime. It allows the agent to personalise briefings just for me from 13 different RSS feeds including What's New, Blog Posts and Security Bulletins. I can get a daily update, as well as automatically post any briefings I'm really interested in to a Slack channel. And I learnt a lot in the process. This blog series covers my experience in building out this agent.

Why AgentCore Runtime?

Amazon Bedrock AgentCore is an AWS service that has been designed specifically for the task of hosting agents. A common saying I keep on hearing is that Bedrock AgentCore is to agentic applications what AWS Lambda is to event driven applications.

At the heart is AgentCore Runtime, which provides the secure runtime for executing the agent code. AgentCore Runtime provides session-based isolation, where every session is assigned a dedicated Firecracker microVM with isolated CPU, memory and filesystem resources (the same lightweight virtualisation technology that underpins AWS Lambda and AWS Fargate). When the session finishes, the LLM's state information is copied to long-term memory and the entire microVM is destroyed. There is no shared state between sessions, which prevents any cross-session data leakage.

AgentCore Runtime is framework-agnostic and supports all popular frameworks such as Strands Agents, LangGraph and CrewAI. It also works with any LLM, such as models offered by Amazon Bedrock, Anthropic Claude, Google Gemini and OpenAI or even hosted on-premises. It supports long sessions up to 8 hours, which means it can handle complex multi-step tasks or time-consuming background processes. Unlike traditional compute services that charge for pre-allocated resources, AgentCore Runtime uses consumption-based pricing where you only pay for active CPU and memory usage. With this, I/O wait and idle time is free, and you're only charged for actual resource consumption calculated at per-second increments. The runtime automatically scales from zero to thousands of concurrent sessions on demand, with no capacity planning needed, and includes reliability features like checkpointing to recover gracefully from interruptions.

AWS Briefing Agent Architecture

A high-level architecture overview of the AWS Briefing Agent is shown below:

AWS Briefing Agent Client is a next.js static site hosted on AWS Amplify Hosting. It integrates directly with Amazon Cognito using the amazon-cognito-identity-js SDK, implementing a full sign-in, sign-up and email verification flow.
AWS Briefing Agent itself is a Python application built with the Strands Agents SDK and deployed to AgentCore Runtime as a Docker container. The @aws/agentcore CLI handles the full deployment lifecycle. When you run agentcore deploy, the CLI triggers AWS CodeBuild to build the Docker image (ARM64), pushes it to Amazon ECR, and deploys it to AgentCore Runtime.
AgentCore Memory provides persistent user knowledge across sessions using two built-in memory strategies. The SEMANTIC memory strategy extracts factual information and knowledge from conversations that have taken place e.g. that a user works with Lambda and EKS. The USER_PREFERENCE memory strategy identifies and extracts user preferences from conversations e.g. that the user prefers technical deep dives. The agent retrieves relevant memory records at the start of each invocation and injects them as context, enabling personalised briefings from the first message of a new session.
AgentCore Observability is used to instrument all Bedrock API calls, tool invocations and memory operations. This is carried out entirely by setting enableOtel: true in the runtime config and using the opentelemetry-instrument wrapper command. Spans show up in CloudWatch Transaction Search and the CloudWatch GenAI Observability dashboard is populated with the sessions and traces, and provides the ability to drill into individual invocations.
AgentCore Evaluations is configured to run online quality assessments against agent responses using built-in evaluators for Helpfulness, Goal Success Rate, and Correctness. These are shown in the front-end to give an indication on how well the agent is performing for each user.
Bedrock Knowledge Base is created and backed by Amazon S3 Vectors that stores all announcements, blog posts and security bulletins. An ingestion Lambda runs every 6 hours that writes each item as a .txt file alongside a metadata.json file to the S3 bucket, before triggering a Knowledge Base sync. The agent queries the KB via the Strands retrieve tool with metadata filters for date ranges and service names, enabling questions like "what's new in Bedrock this week?"
AgentCore Gateway exposes a managed MCP (Model Context Protocol) endpoint that the agent connects to at runtime for tool discovery. The Slack integration is defined as an OpenAPI spec pointing at the Slack chat.postMessage API, and is registered as a Gateway target. The agent discovers available tools dynamically via the MCP protocol. The Gateway handles authentication and credential injection for this integration with Slack, attaching the stored bot token as a Bearer header on outbound Slack API calls.
AgentCore Identity stores the Slack bot token as an API key credential in its token vault (encrypted at rest via Secrets Manager). When the agent calls the tool to send a briefing to Slack, AgentCore Identity retrieves the bot token and injects it into the outbound request automatically. The agent code never sees or handles the token directly.
AgentCore Registry is a governed catalog for agents, MCP servers, tools, skills, and custom resources. Teams can publish resources, control access through approval workflows, and enable both humans and AI agents to discover tools using semantic and keyword search. Once the Slack integration was working, the briefing agent and the Slack tool where registered in the AgentCore Registry. This makes the tool discoverable by other agents in the organisation.

AWS Briefing Agent in Action

We create a new user and login to the home screen for the AWS Briefing Agent front end. The first time we use the agent, we are asked to provide information about our interests and the type of briefing style we are interested in. These get added to memory, so that the agent can personalise its responses:

We can provide the details of the services we are most interested to the agent. At this point, the agent will pull back the top announcements that it has retrieved from the Knowledge Base, and display them in a briefing summary.

We have also integrated with Slack through Gateway. This means we can ask the Briefing Agent to post the details to our Slack channel:

This means that when we go to our Slack channel, we can see a new message with our briefing, alongside all the links we can click to take us to the original blog posts and announcement articles.

In the next post we cover design decisions made to ingest the data into a Bedrock Knowledge Base to support the agent

Biography

Moving from Node Groups to NodePools on Amazon EKS

Matt Lewis — Sat, 21 Feb 2026 17:38:16 +0000

Background

In November 2019, AWS introduced the concept of Amazon EKS Managed Node Groups. With this, Amazon EKS would provision and manage the underlying EC2 instances as worker nodes, as part of an EC2 Auto Scaling Group. You could create, update or terminate a node with a single operation. When updating or terminating a node, EKS would handle these operations gracefully by automatically draining nodes to ensure applications stayed available. Futher enhancements allowed for node configuration and customisation through EC2 Launch Templates and custom AMIs, alongside support for EC2 spot instances.

However, the modern trend in Kubernetes is moving away from static node groups to dynamic node provisioning with tools like Karpenter for more flexible and cost-effective infrastructure management. With Amazon EKS Auto Mode, the recommendation is no longer to create traditional node groups. Instead, you create a Karpenter NodePool that defines the compute requirements. Amazon EKS Auto Mode provides two built-in node pools - system and general-purpose - which you cannot modify, but you can enable or disable. The general-purpose node pool provides support for launching nodes for general purpose workloads. It supports only amd64 architecture and uses only on-demand EC2 capacity in the C, M or R instance families.

What happens if you want to take advantage of spot instances?
What happens if you want to take advantage of Graviton?

Let's show how you can create a node pool to do just that.

Creating a Karpenter NodePool

The complete configuration files for this post can be found in the k8s\node-pool section of the code repository here. We can create it using the following command.

$ kubectl apply -f arm-nodepool.yaml
nodepool.karpenter.sh/arm-mixed-capacity created

The start of the arm-nodepool.yaml configuration file is as follows:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: arm-mixed-capacity

This tells us we are using the NodePool API with Karpenter. This uses the nodepools.karpenter.sh CRD which is installed by default with Auto Mode. The spec element provides the contract with Karpenter. It has the following high-level structure, and we will go through each one in order.

spec:
  disruption:   # when and how nodes can be replaced
  template:     # what a node looks like
  limits:       # optional safety rails
  weight:       # optional priority

Disruption

The disruption section describes the ways in which Karpenter can disrupt and replace nodes. This is used when Karpenter wants to remove empty nodes, replace under-utilised nodes with better fitting ones, or shrink the cluster to save money.

  # Disruption settings for node lifecycle management
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 10m  # Wait 10 minutes before consolidating

    # Disruption budgets to control how many nodes can be disrupted
    budgets:
      # During business hours: more conservative
      - nodes: "2"
        schedule: "0 9 * * mon-fri"  # 9 AM Mon-Fri
        duration: 8h
      # Outside business hours: more aggressive
      - nodes: "10%"

The consolidationPolicy describes which types of nodes Karpenter should consider for consolidation. There are 2 options:

WhenEmptyOrUnderutilized - Karpenter will consider all nodes for consolidation and attempt to remove or replace nodes when it discovers that the node is empty or underutilised and could be changed to reduce cost
WhenEmpty - Karpenter will only consider nodes for consolidation that contain no workload pods

The consolidateAfter field is the amount of time Karpenter should wait to consolidate a node after a pod has been added or removed from the node. We set this to 10 minutes to make sure the behaviour is not too aggressive, and gives the scheduler time to stabilise.

Disruption budgets are used to control how many nodes can be disrupted. There are two rules defined in this section. The first rule states that between 09:00 and 17:00 on Monday to Friday, Karpenter may disrupt at most 2 nodes at a time. The second rule states that Karpenter may disrupt up to 10% of all nodes at any time. This will not apply between 09:00-17:00 on Monday to Friday as the first rule is more restrictive and so wins out.

Template

The template section defines the exact shape, rules and constraints of every node that Karpenter is allowed to create as part of this NodePool.

  # Node template specification
  template:
    spec:
      # Termination grace period (24 hours)
      terminationGracePeriod: 24h

      # Node requirements
      requirements:
        # ARM architecture
        - key: kubernetes.io/arch
          operator: In
          values: ["arm64"]

        # Support spot and on-demand (prefer spot for cost)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

        # ARM instance types (Graviton) - diverse selection
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            # General purpose (M7g)
            - "m7g.medium"
            - "m7g.large"
            - "m7g.xlarge"
            - "m7g.2xlarge"
            # Burstable (T4g) - cost-effective for variable workloads
            - "t4g.medium"
            - "t4g.large"
            - "t4g.xlarge"
            # Compute optimized (C7g)
            - "c7g.large"
            - "c7g.xlarge"
            # Memory optimized (R7g)
            - "r7g.large"
            - "r7g.xlarge"

      # Node class reference (Auto Mode creates this automatically)
      nodeClassRef:
        group: eks.amazonaws.com
        kind: NodeClass
        name: default


      # Taints (optional - for dedicated ARM workloads)
      taints:
        - key: arch
          value: arm64
          effect: NoSchedule

The terminationGracePeriod field defines the amount of time that a node can be draining before Karpenter forcibly cleans it up.

The spec.requirements section provides more details about the nodes that can be created. There are a specified here as an example.

The kubernetes.io/arch key sets out the architecture for the node. Karpenter supports amd64 and arm64 nodes. This is how we support Graviton.

The karpenter.sh/capacity-type key is analogous to EC2 puchase options. The general-purpose NodePool only supports on-demand as a value, whereas he we specify both spot and on-demand. As multiple capacity types are specified, Karpenter will prioritise spot where available, but fallback to on-demand.

Note: AWS automatically applies Amazon EC2 Reserved Instance discounts to matching running on-demand EC2 usage, regardless of how these instances were launched. This means that you will get these discounts for instances launched by Karpenter

There are a number of instance type options

key: node.kubernetes.io/instance-type
key: karpenter.k8s.aws/instance-family
key: karpenter.k8s.aws/instance-category
key: karpenter.k8s.aws/instance-generation
key: karpenter.k8s.aws/instance-capability-flex

Note: Generally, instance types should be a list and not a single value. Leaving these requirements undefined is recommended, as it maximizes choices for efficiently placing pods.

Each NodePool must reference a NodeClass. A Node Class defines infrastructure-level settings that apply to groups of nodes in your EKS cluster, including network configuration, storage settings, and resource tagging. When you need to customize how EKS Auto Mode provisions and configures EC2 instances beyond the default settings, creating a Node Class gives you precise control over critical infrastructure parameters. For example, you can specify private subnet placement for enhanced security, configure instance ephemeral storage for performance-sensitive workloads, or apply custom tagging for cost allocation. In this case, we just reference the default Auto Mode NodeClass.

There is also an example shown on how to apply a taint to a NodePool. When a taint is applied to a NodePool, Karpenter will only place pods on the nodes that explicitly tolerate the taint. In the example, Karpenter will only place a workload on the node that explicitly states that it supports the ARM architecture.

# Toleration for the taint (if you added one)
tolerations:
- key: arch
    operator: Equal
    value: arm64
    effect: NoSchedule

Limits

The limits section is used to constrain the total size of the NodePool. The limits that are set prevent Karpenter from creating new instances, once they have been exceeded. This is done to prevent runaway costs.

  # Limits for this node pool
  limits:
    cpu: "1000"
    memory: 1000Gi

Weight

The weight field controls prioritisation when Karpenter has multiple NodePools to choose from for scheduling a pod. When multiple NodePools can satisfy the requirements for a pod, Karpenter will give priority to the NodePool with the highest weight. If the weight attribute is not specified, it will default to 0.

  # Weight for prioritization (higher = preferred)
  weight: 10

Karpenter will look to choose the cheapest feasible instance. It prefers NodePools where it can pack the pod more efficiently with other pending pods, and minimise wasted CPU / memory on the node.

Note: Based on the way that Karpenter performs pod batching and bin packing, it is not guaranteed that Karpenter will always choose the highest priority NodePool given specific requirements. For example, if a pod can’t be scheduled with the highest priority NodePool, it will force creation of a node using a lower priority NodePool, allowing other pods from that batch to also schedule on that node. The behavior may also occur if existing capacity is available, as the kube-scheduler will schedule the pods instead of allowing Karpenter to provision a new node.

Targetting the NodePool with a Deployment

In order to test the NodePool and show it working, we created a Deployment, which is a simple Nginx container. It can be deployed using the following command from the code repository.

$ kubectl apply -f arm-deployment.yaml
deployment.apps/arm-app created

We define a Deployment and give it the name of arm-app, which is also assigned a label of the same name.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: arm-app
  labels:
    app: arm-app

The next part of the manifest file tells Kubernetes to run 3 copies of the application, and to make sure they are labelled as app=arm-app

spec:
  replicas: 3
  selector:
    matchLabels:
      app: arm-app
  template:
    metadata:
      labels:
        app: arm-app

The manifest file then defines a nodeSelector which is a rule that states that these pods can only run on nodes with an architecture type of arm64. This matches the architecture of our NodePool. Kubernetes will only schedule the Pod onto nodes that match the labels specified.

# Node selector for ARM architecture
nodeSelector:
  kubernetes.io/arch: arm64

The next part of the manifest file moves onto affinity. Node affinity functions like the nodeSelector field but is more expressive and allows you to specify soft rules. In this case, we use preferredDuringSchedulingIgnoredDuringExecution with a weight of 100 to state that we want the Pod to run on a Spot instance, but if this cannot be scheduled, then it is fine to drop back to on-demand. This means that the Pod will not remain in a pending state if a Spot instance was not available, and so it is considered a soft rule.

# Prefer spot instances for cost savings
affinity:
nodeAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
          - key: karpenter.sh/capacity-type
            operator: In
            values: ["spot"]

Finally, we use the containers section to say that we want to run a copy of nginx in each Pod, which half a CPU and 512 MB of memory reserved, but this can grow to a whole CPU and 1 GB or memory.

containers:
  - name: nginx
    image: nginx:latest  # Multi-arch image supports ARM64
    ports:
      - containerPort: 80
        name: http
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        cpu: 1000m
        memory: 1Gi

We open up a number of additional terminal windows as we apply the Deployment, to give us more information on what exactly is happening in the background.

The first command lists all the nodes in the EKS cluster including a column showing their architecture and a column showing their capacity type. We can see that a node is in the Ready state which uses ARM and is a spot instance.

$ kubectl get nodes -L kubernetes.io/arch,karpenter.sh/capacity-type
NAME                  STATUS   ROLES    AGE     VERSION               ARCH    CAPACITY-TYPE
i-0d336c28e588123ae   Ready    <none>   2m22s   v1.34.3-eks-3c60543   arm64   spot

The second command lists the NodeClaim resources. A NodeClaim is a custom resource created by Karpenter. Here we can see the generated NodeClaim name is taken from the name of the NodePool with a random suffix. We can also see it is using spot capacity, and a supported instance family type.

kubectl get nodeclaims
NAME                       TYPE         CAPACITY   ZONE         NODE                  READY   AGE
arm-mixed-capacity-zw6vh   m7g.xlarge   spot       eu-west-2a   i-0d336c28e588123ae   True    3m

The next command describes all pods that have the label app=arm-app. This is the label that gets applied as part of the deployment. It filters the output to show the pod lifecycle events. Again, we can see from this that the pod is running on an ARM-based Graviton spot instance. The event timeline shows the lifecycle involved here. The pod is bound to a compatible node, it then downloads the latest nginx image from the container registry, the container is then created, and finally started.

kubectl describe pod -l app=arm-app | grep -A 20 Events

Name:             arm-app-6674bd9849-ld6fm
Namespace:        default
Priority:         0
Service Account:  default
Node:             i-0d336c28e588123ae/10.1.3.225
Start Time:       Tue, 27 Jan 2026 11:42:06 +0000
Labels:           app=arm-app
                  pod-template-hash=6674bd9849
Annotations:      <none>
Status:           Running
IP:               10.1.3.97
--
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m14s  default-scheduler  Successfully assigned default/arm-app-6674bd9849-ld6fm to i-0d336c28e588123ae
  Normal  Pulling    2m12s  kubelet            spec.containers{nginx}: Pulling image "nginx:latest"
  Normal  Pulled     2m9s   kubelet            spec.containers{nginx}: Successfully pulled image "nginx:latest" in 3.874s (3.874s including waiting). Image size: 61200811 bytes.
  Normal  Created    2m8s   kubelet            spec.containers{nginx}: Created container: nginx
  Normal  Started    2m8s   kubelet            spec.containers{nginx}: Started container nginx

We ran a similar command to list all of the pods running, and to show that the 3 replicas as specified in the deployment are running.

kubectl get pods -l app=arm-app -w
NAME                       READY   STATUS             RESTARTS   AGE
arm-app-6674bd9849-ld6fm   0/1     Pending             0          0s
arm-app-6674bd9849-76wzg   0/1     Pending             0          0s
arm-app-6674bd9849-2vnwx   0/1     Pending             0          0s
arm-app-6674bd9849-ld6fm   0/1     ContainerCreating   0          0s
arm-app-6674bd9849-76wzg   0/1     ContainerCreating   0          0s
arm-app-6674bd9849-2vnwx   0/1     ContainerCreating   0          0s
arm-app-6674bd9849-2vnwx   0/1     Running             0          7s
arm-app-6674bd9849-ld6fm   0/1     Running             0          7s
arm-app-6674bd9849-76wzg   0/1     Running             0          7s
arm-app-6674bd9849-ld6fm   1/1     Running             0          13s
arm-app-6674bd9849-76wzg   1/1     Running             0          13s
arm-app-6674bd9849-2vnwx   1/1     Running             0          13s

Get started with the Argo CD EKS Capability

Matt Lewis — Fri, 23 Jan 2026 14:49:00 +0000

Argo CD Overview

EKS Capabilities was announced at re:Invent 2025. These are Kubernetes-native platform features managed by AWS, that provide higher-level functionality. This post looks at the Argo CD capability.

Argo CD is a GitOps based continuous deployment tool. Your git repository becomes the source of truth, and Argo CD ensures that your cluster state matches what you have defined in git. AWS have been consistently guiding their customers towards GitOps for a number of years. AWS describe GitOps as being like a reference implementation of best practice with these 4 characteristics:

Desired state expressed declaratively
Desired state is immutable and versioned
Desired state is automatically applied from source
Desired state is continuously reconciled

Argo CD is used by most of AWS customers practicing GitOps in 2025, and has really emerged as its own de facto standard. More than 45% of Kubernetes end-users reported production or planned production use of Argo CD in the 2024 CNCF survey.

Setting up Argo CD Capability via Console

The quickest way to get up and running with Argo CD is via the console. In the EKS console there is a capabilities tab that shows which managed capabilities are deployed in the cluster and which are available.

From here, we can click the Create capabilities button, and tick the checkbox against Argo CD in the Deployment section, before clicking next.

This brings up the page where we configure the selected capabilities. The first thing to do is to either select an existing role for the capability role, or select the button to create a new role.

Finally, the Argo CD managed capability integrates with AWS Identity Centre for authentication, and uses RBAC roles for authorization. You select the existing instance of IAM Identity Centre which should be pre-populated when you click the drop down. The next step is to assign RBAC roles. This involves specifying an AWS user or group from AWS Identity Centre and assigning them to an Argo CD RBAC role of "Admin", "Editor" or "Viewer".

At this point, you can review and create the managed capability for Argo CD. To understand this process in more detail, we can step through how to setup the capability using IaC.

Setting up Argo CD Capability via IaC and CLI

The code samples required for this section are contained in the CloudFormation section of this GitHub repository.

Create an IAM Capability Role

The first step is to create an IAM Capability Role. EKS Capabilities use this role to act on your behalf, running controllers in EKS. EKS Capabilities introduced a new service principle called capabilities.eks.amazonaws.com. When you create the capability role, you need to ensure the trust policy trusts this new service principle. An example of the required trust policy is shown below which is available in a file named "argocd-trust-policy.json":

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "capabilities.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}

We can create an IAM role with this trust policy using the following command:

aws iam create-role \
  --role-name ArgoCDCapabilityRole \
  --assume-role-policy-document file://argocd-trust-policy.json

Next we need to attach permissions to this Capability IAM Role based on the capability needs and what integrations are required. For example, for Argo CD you may need to give permissions to ecr or codecommit or codeconnection, depending on where your source is coming from. When choosing "Create Argo CD role" through the console, an IAM role with the AWSSecretsManagerClientReadOnlyAccess managed policy pre-selected is created. This managed policy provides read access to all secrets stored in Secrets Manager in your AWS account and is intended for getting started quickly. You have the flexibility to modify these permissions by unselecting this policy or adding different policies as needed.

We can this managed policy to the created role by using the following command:

aws iam attach-role-policy \
  --role-name ArgoCDCapabilityRole \
  --policy-arn arn:aws:iam::aws:policy/AWSSecretsManagerClientReadOnlyAccess

We can achieve the same in CloudFormation but in a single template. The AWS::IAM::Role config required is shown below, although we will use this as part of the CloudFormation template that creates the EKS Capability, so we can control it in a single stack.

  # IAM Capability Role for ArgoCD
  ArgoCDCapabilityRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: "RoleName"
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: capabilities.eks.amazonaws.com
            Action:
              - sts:AssumeRole
              - sts:TagSession
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AWSSecretsManagerClientReadOnlyAccess

Create EKS Capability

Finally, we can create the EKS Capability for Argo CD itself. The Argo CD capability is integrated with AWS Identity Centre (IDC). This ensures that single sign-on is enabled for the fully managed and hosted Argo UI instance and for Argo CLI. For this, you need to ensure that your identity centre configuration is passed into the capability when creating it.

The Argo CD IDC instance identifies the name of the IAM Identity Center instance that is used by your organization to get permissions for Argo CD to access your EKS cluster. Once the capability has been created, the Argo CD IDC instance cannot be edited.

We can retrieve the IDC Instance ARN directly from the console, or by running the following command:

aws sso-admin list-instances --region eu-west-2 --query 'Instances[0].InstanceArn' --output text

We need to identify the users or groups that should be assigned to be a particular Argo CD role. The users and groups you identify from your IDC instance defines the permissions that the Argo CD capability has to access your EKS cluster. In the command below I retrieve the User ID for the user called mattlewis:

aws identitystore list-users \
  --region eu-west-2 \
  --identity-store-id $(aws sso-admin list-instances --region eu-west-2 --query 'Instances[0].IdentityStoreId' --output text) \
  --query 'Users[?UserName==`mattlewis`].UserId' --output text

If I wanted to assign a Group rather than a User, then I need to retrieve a Group ID from Identity Centre. The following command retrieves the Group ID for a group called Admin:

aws identitystore list-groups \
  --region eu-west-2 \
  --identity-store-id $(aws sso-admin list-instances --region eu-west-2 --query 'Instances[0].IdentityStoreId' --output text) \
  --query 'Groups[?DisplayName==`Admin`].GroupId' \
  --output text

There is a JSON file called aws-identity-centre-configuration.json which is made available for convenience. The configuration below assigns the User ID to the Argo CD role of ADMIN, and the Group ID to the Argo CD role of VIEWER.

{
    "argoCd": {
      "awsIdc": {
        "idcInstanceArn": "REPLACE_WITH_IDC_INSTANCE_ARN",
        "idcRegion": "REPLACE_WITH_REGION"
      },
      "rbacRoleMappings": [{
        "role": "ADMIN",
        "identities": [{
          "id": "REPLACE_WITH_USER_ID",
          "type": "SSO_USER"
        }]
      },{
        "role": "VIEWER",
        "identities": [{
          "id": "REPLACE_WITH_GROUP_ID",
          "type": "SSO_GROUP"
        }]
      }
    ]}
  }

We can then create the Argo CD capability using the following AWS CLI command:

aws eks create-capability \
  --capability-name argocd-capability \
  --cluster-name eks-test-cluster\
  --type ARGOCD \
  --role-arn arn:aws:iam::{ACCOUNT_ID}:role/ArgoCDCapabilityRole \
  --delete-propagation-policy RETAIN \
  --configuration file://aws-identity-centre-configuration.json

This is equivalent to the following that can be used in a CloudFormation template.

ArgoCDCapability:
  Type: AWS::EKS::Capability
  Properties:
    CapabilityName: !Ref CapabilityName
    ClusterName: !Ref ClusterName
    Type: ARGOCD
    RoleArn: !GetAtt ArgoCDCapabilityRole.Arn
    DeletePropagationPolicy: !Ref DeletePropagationPolicy
    Configuration:
      ArgoCd:
        AwsIdc:
          IdcInstanceArn: !Ref IdentityCenterInstanceArn
          IdcRegion: !Ref IdentityCenterRegion
        RbacRoleMappings:
          - Identities:
              - Id: !Ref AdminUserId
                Type: SSO_USER
            Role: ADMIN
          - Identities:
              - Id: !Ref ViewerGroupId
                Type: SSO_GROUP
            Role: VIEWER

You can deploy the full CloudFormation stack using the following command:

aws cloudformation create-stack \
  --stack-name argocd-capability-stack \
  --template-body file://argocd-capability.yaml \
  --parameters \
    ParameterKey=ClusterName,ParameterValue=eks-test-cluster \
    ParameterKey=IdentityCenterInstanceArn,ParameterValue=arn:aws:sso:::instance/ssoins-{REPLACE} \
    ParameterKey=IdentityCenterRegion,ParameterValue=eu-west-2 \
    ParameterKey=AdminUserId,ParameterValue={REPLACE_WITH_USER_ID} \
    ParameterKey=ViewerGroupId,ParameterValue={REPLACE_WITH_GROUP_ID} \
  --capabilities CAPABILITY_NAMED_IAM \
  --region eu-west-2

Some of the properties include:

Type - This defines the type of EKS Capability to create with valid values of
- ACK - Amazon Web Services Controllers for Kubernetes (ACK), which lets you manage resources directly from Kubernetes
- ARGOCD – Argo CD for GitOps-based continuous delivery
- KRO – Kube Resource Orchestrator (KRO) for composing and managing custom Kubernetes resources
DeletePropagationPolicy - This only supported value is RETAIN, which keeps all resources managed by the capability when the capability is deleted
Configuration - This property defines the configuration settings, with the structure depending on the capability type
Role - The role under the RbacRoleMappings property defines the Argo CD role to be assigned. The value values are:
- ADMIN - Full administrative access to Argo CD
- EDITOR - Edit access to Argo CD resources
- VIEWER - Read-only access to Argo CD resources

Once the capability has been created (which will take a while), the capabilities tab for the cluster in the EKS console will provide the Argo API endpoint and a link to go to the managed hosted Argo UI:

We can click on the link to open up the managed Argo UI. At this point, we will need to click the button to LOG IN VIA SSO.

In most cases, this will automatically log you directly into the console. At this point, we can see there are no applications currently available.

We can also check the Argo CD Role Based Access Control (RCAC) assignments in the console, and make sure that it matches what we set up in the previous JSON file or in the CloudFormation template.

Argo CD Capability

Now that the Argo CD capability has been created, we can take a quick look at the same of the changes that have been made to our cluster.

Firstly, we run a command to look at the EKS Access Entries for the cluster.

aws eks list-access-entries \
  --cluster-name eks-test-cluster

EKS Access Entries are the recommended way to grant users access to the Kubernetes API. Fundamentally, it associates a set of Kubernetes permissions with an IAM identity such as an IAM Role. Running the command above generated the following output.

{
    "accessEntries": [
        "arn:aws:iam::{ACCOUNT_ID}:role/ArgoCDCapabilityRole",
        "arn:aws:iam::{ACCOUNT_ID}:role/aws-reserved/sso.amazonaws.com/eu-west-2/AWSReservedSSO_Developer_b664db2de4791f77",
        "arn:aws:iam::{ACCOUNT_ID}:role/aws-service-role/eks.amazonaws.com/AWSServiceRoleForAmazonEKS",
        "arn:aws:iam::{ACCOUNT_ID}:role/eks-test-cluster-eks-auto-20260116165710199100000002"
    ]
}

We can see that four access entries currently exist in the cluster:

The SSO Developer role entry, which is the role I assumed to create the EKS cluster
An EKS Auto Mode generated role that is used to enable Auto Mode to make authenticated Kubernetes API calls
An EKS service-linked role that is used to manage the control plane and AWS-side resources of the cluster
The ArgoCDCapabilityRole role entry, which has been created when enabling the Argo CD EKS Capability which allows the capability to authenticate to the cluster

You can see the permissions that have been automatically granted to each entry in the console.

By enabling the EKS Argo CD Capability, the following Custom Resource Definitions (CRDs) have been added to the cluster.

kubectl get crds | grep argo  

applications.argoproj.io                        2026-01-16T17:10:56Z
applicationsets.argoproj.io                     2026-01-16T17:10:56Z
appprojects.argoproj.io                         2026-01-16T17:10:57Z

Deploy a sample application

Register your EKS cluster with Argo CD

In order to deploy a sample application, the first step is to register the EKS cluster where we want to deploy an application. We do this by by creating a Kubernetes secret, and passing the label of argocd.argoproj.io/secret-type: cluster. We give the cluster a name, and this is where the mapping between the actual cluster and the ARN happens. With EKS Capabilities you only need to provide the ARN and not the Kubernetes API Server URL as with a self-managed instance. In our case, we are registering a local cluster, as it is the same one as Argo CD is running. The following manifest file is found in the code repository under k8s/argocd.

apiVersion: v1
kind: Secret
metadata:
  name: local-cluster
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
stringData:
  name: local-cluster
  server: arn:aws:eks:eu-west-2:{ACCOUNT_ID}:cluster/eks-test-cluster
  project: default

We then apply this to the cluster

kubectl apply -f local-cluster.yaml

Register an Argo CD Application

Now we can register an Argo CD application. In this case, we will use the guestbook example from the Argo CD project itself.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: guestbook
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/argoproj/argocd-example-apps
    targetRevision: HEAD
    path: guestbook
  destination:
    server: arn:aws:eks:eu-west-2:{ACCOUNT_ID}:cluster/eks-test-cluster
    namespace: default

This is a Kubernetes manifest file for a custom resource (an Argo CD Application). Our cluster knows how to handle this as it has installed the CRD when deploying the capability itself. This application tells Argo CD to deploy the guestbook application from the specific public Git repository into the default namespace of the EKS cluster. We can run this file using the following command.

kubectl apply -f guestbook-application.yaml

We should get back the information that the application has been created application.argoproj.io/guestbook created.

At this point we can log in to the Argo UI, and see something similar to the following:

If we go into settings, we can see there is a failed connection status with our cluster.

And if we go further and click into the application itself, we see that there are 3 errors in the application conditions.

The issue here is that Argo CD cannot build its cache of what exists in the cluster, because it does not have permission to list the cluster-scoped resource PersistentVolume. It also cannot list resource "ingressclassparams" in API group "eks.amazonaws.com" at the cluster scope to connect to the cluster.

Associate an access policy

This issue is caused because the two access policies associated with the Argo CD Capability role (AmazonEKSArgoCDPolicy and AmazonEKSArgoCDClusterPolicy) do not give the required permissions to access or mutate the cluster resources.

The best practice in this case is to determine the minimum permissions required, and add these to an access policy. However, the fastest solution is to add the AmazonEKSClusterAdminPolicy with cluster scope to the access entry, which we can do using the following command:

aws eks associate-access-policy \
  --cluster-name eks-test-cluster \
  --principal-arn arn:aws:iam::424727766526:role/ArgoCDCapabilityRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=cluster

Once we have run this command, Argo CD will now be able to connect successfully to the cluster:

And the application can be synced and is healthy.

AWS Certified Machine Learning Engineer Core Concepts

Matt Lewis — Sat, 04 Oct 2025 13:32:38 +0000

Overview

Earlier this year I passed the AWS Machine Learning Engineer - Associate exam. I spent time making sure I understood the core concepts before taking the exam, and made a lot of notes. The intent of this post is to summarise the concepts essential to pass the exam. Based on my experience, knowing these concepts will get you at least half way there. Layer on knowledge of the AWS AI Services, and focus on SageMaker and all its capabilities, and the certification will be yours.

Data Preprocessing
SageMaker Built-In Algorithms
Model Development
Evaluating Model Accuracy
Improving Model Accuracy
Additional Topics to Study
Other Study Guides

Data Preprocessing

Data preprocessing ensures that data is in the right shape and of the right quality to be used for training.

Labelling data is important for models to learn effectively, and this is where services such as Mechanical Turk and SageMaker Ground Truth come in. Mechanical Turk is an online marketplace to access an on-demand global workforce. SageMaker Ground Truth provides built-in workflows to automate data labelling, and can use your own workforce, third-party vendors from the AWS marketplace, or Mechanical Turk.

Cleaning data can include removing outliers and duplicates, replacing inaccurate or irrelevant data, and correcting missing data.

Approaches for inputting missing data include:

Mean replacement – replacing the missing values with the mean value from the rest of the column. The mean value is the average, which means it can be distorted by outliers. Therefore, the median value (which is the middle value when data is sorted) may be a better choice
KNN (K-Nearest Neighbour) – Find the ‘K’ nearest’ most similar rows and average their values. This assumes numeric data.

Balancing data (for datasets with underrepresented categories) can be achieved using one of the following methods:

Random oversampling – this method randomly duplicates samples from the minority category. For example, if you were building a fraud detection model and you had 1000 examples of normal transactions and only 50 of fraudulent transactions, you would duplicate the fraudulent transactions until you had an equal proportion
Random Undersampling – this method randomly removes samples from the overrepresented category to achieve the equal proportion. This would typically be used when you have a large dataset, or you would want to reduce the size of your dataset to make training the model quicker
Synthetic Minority Oversampling Technique (SMOTE) – this approach generates new synthetic samples of the minority category by interpolating between existing samples using nearest neighbours.

Encoding is the concept around converting data (typically categorical data where the data represents a category or group) into a numerical format that can be well understood by a model. The main types of encoding are the following:

Label Encoding – assigns each category a unique number e.g. Red=0, Green=1, Blue=2. There is no order implied by this encoding
One-Hot Encoding – creates binary columns for each category. This means if there is a category called colour there would be additional columns created for each unique value such as a column for Red and one for Blue and for Green. The value for each column would be assigned 1 if it is true, else 0.
Ordinal Encoding – this is similar to label encoding but is used when there is a ranked ordering between values in a category. For a category called ‘size’ you could map Small to 0, Medium to 1 and Large to 2.

Outliers are data points in a data set that deviate significantly from the general patterns. One way of detecting outliers in training data is to measure how many standard deviations a data point is from the mean of a dataset. This is often called a z-score or standard score. Data points that lie more than one standard deviation from the mean can be considered unusual.

Outliers can be handled in different ways:

Delete the record - if the outlier is clearly an error and there is enough training data, you can just delete that record.
Feature scaling or normalisation – this aims to transform the numeric values so that are all values are on the same scale, often between 0 and 1. This rescaling makes the values more comparable
Standardisation – is similar to normalisation but instead of scaling values from 0 to 1, it rescales the features to have a mean of 0 and standard deviation of 1
Binning – takes a continuous numerical feature and splits into a set of intervals or bins. Each value is then assigned to a bin which can cover up imprecision or uncertainty e.g. someone aged 110 could end up in a bin which is “70+”.

After data has been cleaned up and encoded, you can fine tune or create new features in your dataset through feature engineering. There are other methods such as bag-of-words and N-gram that can be used to extract key information from text data.

In machine learning, dimensionality refers to the number of features in your dataset. If you have a dataset with 3 features (age, income and height), it exists in a 3-dimensional space. The curse of dimensionality refers to the problems that arise when you have too many features. When this happens, the data becomes sparse (e.g. spread out too thinly in the feature space), and it is hard to find meaningful patterns.

There are a number of unsupervised reduction techniques that can help to distil many features into a smaller more manageable number:

Principal component analysis (PCA) – this technique retains most of the variation in the original features but reduces the overall number of features. It works by transforming features into a new set of features called principal components
K-Means – this technique uses a clustering algorithm to group similar data points into K clusters. It does not create new features but assigns a cluster label to each data point.

SageMaker Built-In Algorithms

Before you can train your model, you need to select a machine learning algorithm to use. Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. It is worth understanding each of these algorithms at a high-level, understanding which ones are used for supervised versus unsupervised learning, and their main uses.

Linear Learner - A supervised learning algorithm suited for general classification (Logistic Regression) and regression (Linear Regression) tasks. It makes predictions based on labelled data. In simpler terms, the model is given examples where each example has some features (like size, price, or age) and an outcome (like a house price or a category). For classification tasks, the model sorts data into categories, such as whether a house is expensive or not. For regression tasks, it predicts a specific value, like the actual price of a house. It assumes linear regression and must pre-process missing values.

K-Means - An unsupervised algorithm designed for clustering or grouping data points based on their features (chosen attribute), without needing labelled data

BlazingText - A supervised algorithm used for Natural Language Processing tasks like text classification.

Seq2Seq - A supervised algorithm specifically designed for sequence-to-sequence tasks, such as predicting the next word in a sequence, making it ideal for tasks like language translation or text generation

DeepAR - A supervised algorithm used to forecast time-series predictions by using recurrent neural networks (RNN)

XGBoost - A supervised learning algorithm used for both classification and regression tasks — especially when you care about speed and performance. It is an optimized, scalable implementation of gradient boosting that builds an ensemble of decision trees in a sequential manner. Often outperforms other models in competitions (e.g., Kaggle)

Random Cut Forest - An unsupervised algorithm used to identify abnormal data points within a data set e.g. anomaly detection

Semantic Segmentation - A supervised algorithm that provides pixel-level classification but does not label objects with bounding boxes. It is typically used to classify individual pixels by tagging each pixel with a specified class

Principal Component Analysis (PCA) - An unsupervised algorithm used for dimensionality reduction

Image Classification - A supervised algorithm that is used to label entire images, not individual objects. It simply assigns a single label to an entire image, categorising it based on the predominant features. It cannot identify or count multiple objects within a single image.

Object Detection - A supervised algorithm used to identify and classify multiple objects within an image, assigning bounding boxes and confidence scores. It draws bounding boxes around detected objects and classifies them into different categories, making it very useful for tasks where you need to recognise what is in the image and determine the exact location of each object. This algorithm is well-suited for scenarios that require counting specific items, such as animals, in drone imagery, as it can distinguish between individual objects even in complex scenes.

Object2Vec - A supervised algorithm primarily used to learn vector embeddings of discrete objects. Its typically used in recommendation systems, document classification or semantic similarity tasks, not computer vision or image processing.

IP Insights - An unsupervised algorithm used to detect anomalies in IP address usage patterns. It captures associations between these IP addresses and various entities, such as user IDs or account numbers. For instance, you can use it to detect a user attempting to log into a web service from an anomalous IP address. Additionally, it helps identify accounts that create computing resources from unexpected IP addresses.

Latent Dirichlet Allocation (LDA) - An unsupervised learning technique designed to represent a collection of documents as a combination of various topics. LDA is primarily used to identify a specified number of topics within a set of text documents. The LDA algorithm is a powerful tool for text mining and natural language processing tasks. It allows companies to sift through vast amounts of textual data and discern patterns that might take time to be apparent. Since LDA is an unsupervised method, the topics are not specified up front, and the discovered topics may not necessarily match human categorisations. Instead, LDA learns the topics as a probability distribution over the words in the documents, and each document is characterised as a mixture of these topics.

Neural Topic Model (NTM) - An unsupervised algorithm used for organising documents into topics. It is just like LDA — but it's based on neural networks rather than probabilistic graphical models.

Factorization Machines - A supervised algorithm designed to handle sparse data, making it ideal for recommendation systems where user-item interactions are often sparse. It is primarily used for recommendation systems and ranking predictions.

Text Classification – TensorFlow algorithm - A supervised algorithm designed to classify text into predefined categories.

Model Development

Hyperparameters are external configuration variables used to control a training model, improve model performance and the model outcome. Hyperparameters are set before training. This can be done manually, although SageMaker offers an intelligent version of hyperparameter tuning methods based on Bayesian search theory designed to find the best model in the shortest time. Amazon SageMaker AI automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. Amazon SageMaker AI automatic model tuning (AMT) is also known as hyperparameter tuning and supports Hyperband, a new search strategy.

Common hyperparameters include:

Epoch – the number of times the entire training dataset is shown to the network during training. A smaller epoch value means faster training, but the model might not learn enough patterns and end up underfitting. A larger epoch value gives more opportunity to refine weights and give better convergence, but will take longer to train and may end up memorising training data and so overfitting
Learning rate – the rate at which an algorithm updates estimates. Too high a learning rate means you might overshoot the optimal solution. Too small a learning rate will take too long to find the optimal solution
Batch size – how many batch training samples are used within each batch of each epoch. Large batch sizes are faster per epoch as it will fill the GPU, but risk worse generalisation and can end up getting stuck in the wrong solution. Small batch sizes are slower per epoch but can provide better generalisation.

Note that hyperparameters are not related to inference parameters. Inference parameters are settings you can adjust during inference, that influence the response from the model. The most common are:

Temperature: Temperature is a value between 0 and 1, and it regulates the creativity of the model's responses. Use a lower temperature if you want more deterministic responses, and use a higher temperature if you want creative or different responses for the same prompt
Top K: The number of most-likely candidates that the model considers for the next token. Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.
Top P: The percentage of most-likely candidates that the model considers for the next token. Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.

Evaluating Model Accuracy

Metrics are used to measure the performance and accuracy of a machine learning model. These metrics can typically be broken down into classification metrics and regression metrics.

With classification, the goal of the model is to predict a label or class (category) for the given input. With binary classification, there are only two possible outputs (positive or negative). This is used to predict whether an image is a dog or not, or whether an email is spam or not. With multi-class classification, there are more than two possible outputs, such as predicting whether an animal is a dog, cat or cow.

With regression, the goal of the model is to predict a numerical value. This could be predicting a house or stock price, or a person's annual income given certain inputs.

Classification Metrics

The confusion matrix is a great way to help understand common classification metrics.

Recall - Recall is the percentage of positives correctly predicted. It is focused on how many of the actual positives did the model get right. You use it when you prefer to catch as many positives as possible, even if some are incorrect. It is a good metric when false negatives are costly e.g. fraud detection or cancer screening where it is better to flag more cases (even if some are wrong) than to miss a true positive

It is calculated as: Recall = TP / (TP + FN)

Precision - Precision is all around correct positives. All positive predictions include both true positives and false positives (those predicted as positive but are actually negative). This means it is a good metric when false positives are costly e.g. spam email when you don't want to mark legitimate emails as spam, or object detection in autonomous vehicles, where a false positive can induce sudden unnecessary braking.

It is calculated as: Precision = TP / (TP + FP)

A model with high precision and low recall catches few positives but is rarely wrong.

A model with high recall and low precision catches most positives but includes many false alarms.

F1 Score - The F1 Score is the harmonic mean of precision and recall. It is used when you need a balance betweeb both.

It is calculated as: F1 Score = 2 x ((Precision x Recall) / (Precision + Recall))

Accuracy - Accuracy measures overall correctness — how often the model was right, regardless of class. It considers all predictions (true positives, true negatives, false positives, and false negatives).

It is calculated as: Accuracy = TP + TN / TP + TN + FP + FN

AUC and ROC - The ROC curve is a graphical plot that helps visualise how well a binary classification model performs across different threshold values. It is a plot of true positive rate (recall) versus false positive rate and helps you see this trade off between True Positives and False Positives.

The Area under the Curve (AUC) is a single scalar value between 0 and 1 of how well the classification model can separate the positive and negative predictions. A value of 0.5 means the model performs no better than a random classifier. A value of 1.0 is a perfect classifier.

Regression Metrics

If you are using regression where you are predicting a number and not just a classification, then there are other metrics:

Mean Absolute Error (MEA) - MEA measures the average absolute difference between the predicted values and the actual values. It tells you on average, how much your models predictions are off from the true values. A lower score means a better model. It is simple to understand and is not affected by outliers. MAE is robust to outliers because it handles them linearly, not exponentially. This makes it a good choice when you don’t want a few bad predictions to dominate the error metric.

Mean Squared Error (MSE) - MSE averages the squared difference between actual and predicted values. Because it squares the values, outliers become amplified. This makes MSE more sensitive to outliers. You would choose MSE (Mean Squared Error) over MAE (Mean Absolute Error) when you want to penalize large errors more heavily and are more concerned with model performance on extreme values.

RMSE (Root Mean Square Error) - RMSE is a metric used to measure the differences between predicted values and actual values in a regression problem. It calculates the square root of the average squared differences between the predicted and actual values. A lower Root Mean Square Error value indicates better model performance. Since the errors are squared before averaging, larger errors have a bigger impact (this makes RMSE sensitive to outliers).

You would use RMSE (Root Mean Squared Error) over MSE (Mean Squared Error) when you want the error metric to be in the same units as the target variable, making it more interpretable. For example, if you're predicting house prices in dollars, RMSE is in dollars, while MSE is in squared dollars, which is less intuitive.

R Squared - R-Squared measures the square of the correlation coefficient between observed outcomes and predicted. It measures how well your regression model explains the variability of the target (dependent) variable. A score of 1 means the model explains all the variance perfectly. A score of 0 means the model explains none of the variance.

Improving Model Accuracy

Understanding model fit is important when understanding the root cause for poor model accuracy.

Two common terms that come up to describe model performance are:

Overfitting – a model that is overfitting has learned patterns in the training data that don’t generalise out to the real world. This means that it has high accuracy on the training data set, but lower accuracy on evaluation data sets
Underfitting – a model that is underfitting performs poorly on the training data and in the real world. This is because the model is unable to capture the relationship between the input examples and the target values.

Regularization techniques are intended to prevent overfitting. Common techniques include:

Dropout – this is a technique where random neurons are temporarily dropped out (e.g. ignore) during each training iteration. This means the network can’t rely too heavily on any specific neuron or connection
Early Stopping – this is a technique where you stop training the neural network before it overfits the training data. It works by monitoring validation loss and accuracy and stopping training when the model stops improving.
L1 and L2 Regularization

L1 and L2 Regularization are techniques used to prevent overfitting by penalising large model weights. In a machine learning model, a weight is a numeric parameter that connects an input feature to an output. A large weight value means the model is putting an extremely strong emphasis on that specific feature, which means the model becomes very sensitive to small changes in inputs, which can lead to overfitting. These techniques add a penalty term to the loss function to discourage large weights:

L1 Regularization (Lasso) – the penalty is the sum of the absolute value of the weights. It shrinks some weights entirely to zero to create sparse models. This is a form of feature selection (removing irrelevant features). You should use this when you suspect only a few features are important. It is computationally inefficient.
L2 Regularization (Ridge) – the penalty is the sum of the square of the weights. This shrinks the weights but does not make them zero. It helps keep the model simpler and reduces sensitivity. It is computationally efficient.

Additional Topics to Study

The two other main topic areas you need to understand are:

AWS AI Services - these are the managed AWS services that offer a simpler entry point than building your own model. You will need to a good understanding of each service and what it is used for, so you can distinguish between Amazon Lex and Amazon Polly, and between Amazon Translate and Amazon Transcribe.
Amazon SageMaker - Amazon SageMaker is a service that provides a whole host of features and capabilities you need to be aware of. You need to understand which feature you can use to detect bias; which to import, prepare and transform data; which to share curated features, and so on.

Other Study Guides

AWS Certification Page - the AWS certification home page for this exam includes the study guide and links to additional resources
AWS SkillBuilder - the AWS official learning plan which is available for free alongside an official set of practice exam questions. Additional material including longer review sections, extra questions and labs are available with a subscription.
Udemy - the certification course provided by Stephane Maarek and Frank Kane comes highly recommended
Pluralsight - this certification course by Pluralsight also includes labs. Pluralsight offer a 10 day free individual trial and monthly subscriptions which may work for some
Tutorials Dojo - this set of practice questions is a great way to get used to the style of exam in various modes

Next Gen Developer Experience with Amazon Q Developer

Matt Lewis — Thu, 08 May 2025 08:55:15 +0000

There have been massive advances in the capabilities and features supported by Amazon Q Developer over the last few months. A number of these really stood out for me:

At the beginning of March the new enhanced Amazon Q Developer CLI agent was released with the power of Claude 3.7 Sonnet step-by-step reasoning. This also gave the agent access to tools such as the AWS CLI. Read more in the blog post
At the end of April the Amazon Q Developer CLI was further enhanced with support for Model Context Protocol (MCP) to provide even more context. Read more in the blog post
At the beginning of May, Amazon Q Developer support was integrated into GitHub in preview. Read more in the blog post

With all of these improvements, I wanted to see if there was a way of bringing them together to meet a coherent use case. This use case was to:

Take a task assigned to me from a Jira Kanban board
Implement the requested functionality
Push the code up to GitHub as the source code repository
Run a check for security vulnerabilities and code quality issues
Raise a Pull Request
Move the task along on the Kanban board

The goal was to show how these tools can make life easier for a software engineer, and greatly increase their productivity. Let's see how I got on.

Step 1: Set up Jira

I am using a hosted version of Jira Cloud using the free tier provided by Atlassian. The first thing I did was to create a new Jira project that sets up a Kanban board using a software development supporting template.

Next I created a new Jira task. The task was to "create a classic snake game written in python using pygame", and I assigned it to myself. Although this is a contrived example, you could easily equate this to a new feature on an existing service.

Once created, we can see this task in to "To Do" section of the Kanban board.

I also created a new GitHub repository which is cloned to my workspace.

Step 2: Configure MCP Servers

The next step was to setup the Amazon Q CLI to access both Atlassian and GitHub. Amazon Q CLI acts as an MCP Client, and it can access MCP Servers that have been configured in a mcp.json file. This files needs to be located in ~/.aws/amazonq. You can find out more details in this blog post by Ricardo Sueiras.

I want to run these MCP Servers in a container, and without access to Docker Desktop I configure them to use Podman. My configuration is shown below.

{
  "mcpServers": {
    "github-mcp-server": {
      "command": "podman",
      "args": [
        "run",
        "--rm",
        "--interactive",
        "--env",
        "GITHUB_PERSONAL_ACCESS_TOKEN",
        "ghcr.io/github/github-mcp-server"
      ],
      "env": {"GITHUB_PERSONAL_ACCESS_TOKEN": "github_pat_xxx"}
    },
    "mcp-atlassian": {
      "command": "podman",
      "args": [
        "run",
        "-i",
        "--rm",
        "-e", "JIRA_URL",
        "-e", "JIRA_USERNAME",
        "-e", "JIRA_API_TOKEN",
        "ghcr.io/sooperset/mcp-atlassian:latest"
      ],
      "env": {
        "JIRA_URL": "https://xxx.atlassian.net/",
        "JIRA_USERNAME": "xxx@email.com",
        "JIRA_API_TOKEN": "XXXXX"
      }
    }
  }
}

This required creating a Personal Access Token in GitHub, and an API Token in Atlassian.

Step 3: Run Amazon Q Developer CLI

After making the changes to the mcp.json configuration, I launched Amazon Q CLI from the terminal window in my Visual Studio Code IDE. You can see that the two MCP Servers have been loaded and are accessible.

I start by asking the Amazon Q CLI to “get the latest task from Jira that is assigned to me”. The Amazon Q CLI returns wanting to know more details, before it can use the configuration to retrieve information about the task.

I tell the Amazon Q CLI that I am using Jira Cloud and to search across all projects. I am then told that the Amazon Q CLI wants to run a tool provided by the mcp_atlassian MCP Server. I am prompted to either press t to always trust the tool for the session, y to allow the tool to be executed this time without trusting for the session, or n to not let the tool be executed. I will be answering y to all of these prompts.

After running the tool, the Amazon Q CLI has found the task and displays all of the details.

I ask Amazon Q CLI to help me implement the functionality, and it goes away and generates all the code required. At this point, the code is in memory, and I am asked if I want to save the code to a file in my current directory.

I tell the Amazon Q CLI to create a new branch and add the file to this new branch. After running the relevant git commands to create a new branch, the Amazon Q CLI switches to this branch, and then writes the code to a new file in this branch.

After successfully writing the code to a new branch, the Amazon Q CLI commits the changes with a message referencing the specific Jira task number that it still has in its current session context.

At this point, I could prompt to make more enhancements to the game. The Amazon Q CLI even gives suggestions of areas of the game it could improve. Instead, I just ask it to update the README file with instructions on how to play the game, and then make another commit.

The Amazon Q CLI now asks if I want to make any other changes. I'm happy with those that have been made so far, so ask it to make a pull request for these. Notice the first request to create a pull request fails. Amazon Q CLI apologises, and tries another approach, this time pushing the branch to GitHub and then creating the pull request which succeeds.

After creating the pull request, the Amazon Q CLI knows from its session context and its reasoning, that we should also update the original task in Jira. It interacts with tools in the Atlassian MCP Server to transition the task to the "In Progress" state and add a relevant comment.

Step 4: Jira Kanban Board

At this point in time, I go across to the Kanban board in Jira and can see that the task has been transitioned to "In Progress".

Clicking into the task, I can see the comment that has been added to the task. This gives details about the functionality implemented to meet the task description, alongside a working link to the open PR in GitHub.

Step 5: Code Scanning in GitHub

The final task I wanted to carry out was to run a check for security vulnerabilities and code quality issues. I have already configured my GitHub account with the Amazon Q Developer application. This means that as soon as the pull request was raised, the amazon-q-developer application automatically scanned the changes in the PR. Happily, there were no security or code quality issues found. If there were, the new application would have automatically generated code suggestions to fix the findings.

Conclusion

I can't remember the last time I tested out new features and services and genuinely felt so convinced we are now starting to see a change in how we will engineer applications in the software industry. The value of software engineering still remains, even more so when working on complex problems for which generative AI solutions do not have the corpus of knowledge to be trained on. However, this showed to me how these new capabilities can help in reducing the context switching, and the need to move between various tools, copying data between them. The next generation of developer experience is well and truly upon it, so I'd urge everyone to try it out.

Biography

As Chief AWS Architect at IBM in the UK, I am responsible for growing the AWS capability and community within one of the fastest growing AWS consulting partners globally. This often gives me the opportunity to try out the latest features in preview before they go into general availability. You'll often find me blogging about my experience, but please reach out if there are services you'd like to know more about.

Java code transformation using the Amazon Q Developer GitHub Integration

Matt Lewis — Tue, 06 May 2025 10:30:37 +0000

AWS have launched the Amazon Q Developer integration with GitHub. I was keen to try this out, and in this post, I walk through how to get started and use the integration to upgrade a Java project.

Getting Setup

The first step is to install the Amazon Q Developer application from the GitHub Marketplace found at this URL - https://github.com/apps/amazon-q-developer.

Then click on Install and select which repositories you want to allow the application to access.

You can also optionally register the application installation with your AWS account to increase your usage limits. This is a two-step process. A landing page in your AWS console allows you to authorise Amazon Q Developer to access your GitHub account.

This redirects you to GitHub to complete the authorisation process.

You are then returned the AWS Console to provide a registration name and complete the registration process.

Setup GitHub Actions

Before you can get started on running a transformation request, you need to enable GitHub Actions for the repository and ensure a runner is available online.

This involves creating a main.yml file within a .github/workflows/ folder structure. The workflow I used is shown below:

name: Q Code Transformation

on:
  push:
    branches:
      - 'Q-TRANSFORM-issue-*'

env:
   MAVEN_CLI_OPTS: >-
     -Djava.version=${{ contains(github.event.head_commit.message, 'Code transformation completed') &&  '17' || '11' }}

jobs:
  q-code-transformation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-java@v3
        with:
          java-version: ${{ contains(github.event.head_commit.message, 'Code transformation completed') && '17' || '11' }}
          distribution: 'adopt'

      - name: Build and copy dependencies
        run: |
          mvn ${{ env.MAVEN_CLI_OPTS }} clean install -U
          mvn ${{ env.MAVEN_CLI_OPTS }} verify
          mvn ${{ env.MAVEN_CLI_OPTS }} dependency:copy-dependencies -DoutputDirectory=dependencies -Dmdep.useRepositoryLayout=true -Dmdep.copyPom=true -Dmdep.addParentPoms=true

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: q-code-transformation-dependencies
          path: dependencies

This creates a workflow named Q Code Transformation that is triggered when code is pushed to a branch that matches the pattern Q-TRANSFORM-issue-*. The job itself runs on Ubuntu, and starts by checking out the code in the repository into the GitHub Action runner.

It then installs and sets up a Java installation using the AdoptOpenJDK distribution on the runner. The choice of Java version is chosen dynamically. If the commit message contains "Code transformation completed", Java version 17 is installed, else Java 11.

The script then uses maven commands to clean and rebuild the project, builds the projects and runs all tests, and finally copies all project dependencies into a specific folder. These are then uploaded as a project artifact from the runner machine into GitHub.

Triggering the Code Transformation

Triggering the Java code transformation is as simple as raising a GitHub issue, and applying the Amazon Q transform agent label to the issue.

The amazon-q-developer application then takes over, adding comments to the issue to keep you up to date. It starts by running the GitHub Actions workflow required to transform the code. You can view the progress of the runner in the Actions tab.

Once successful, your code to be transformed is uploaded, and a transformation plan created. This code transformation plan sets out the changes that the agent is initially expecting to apply.

The agent then starts upgrading the code to Java 17. Once complete, the agent updates the issue with a comment and opens a pull request.

Viewing the transformed code

The pull request opened by the agent starts off with a code transformation summary detailing the changes made during the transformation process.

One area that the transform agent has significantly improved over the past few months is in summarising these changes. Not only are these nicely laid out in a list format, but for each significant change, there are details provided why it has been made and the benefits it offers.

Security vulnerabilities and code quality issues

One benefit that the GitHub integration brings is the scanning of all pull requests for security vulnerabilities and code quality issues. The application will raise a comment in the pull request, and then highlight any findings it discovers. In my case, the application generates a high severity warning.

For each vulnerability found, a detailed description of the fix needed alongside code that can be committed is provided.

If multiple issues are discovered, you can go into the Files Changed and select to batch up all of the suggestions into a single commit.

Observations and Conclusion

There are a few areas that are worth drawing out that initially caught me out.

Supported Target Code Versions
The transform agent in the IDE currently only supports Java 17. The transform agent in the IDE has recently enabled support for Java 21 as a target code version as well as Java 17.
Code Review
The code review run on the pull request is only carried out against the changes made in the diff, and not against unchanged content in these files, or the entire repo. To do this, you will need to go back to the IDE and run the /review agent. It would be great to trigger a full code review on an entire repo through the GitHub integration
Code Suggestions for Security Vulnerabilities
Ironically, as the code review is only carried out against new changes, the vulnerability detected by the agent is against code that the agent itself created. I'm not entirely sure how I'm supposed to feel about this. In fairness, its most likely a reflection of the wider codebase and the code suggestion following the existing styling. In addition, the code suggestion made to resolve the security vulnerability which I automatically accepted failed compilation. This was detected almost straight away as it triggered another run of the GitHub Action. It turned out simpler to fix this compilation error manually and add this as a commit to the pull request. I'm not sure why this was the case, but I feel confident this will be fixed soon.

For Java transformations this now provides an alternative approach to using the agent in the IDE. Using the transform agent in the IDE feels more synchronous, as you watch the progress taking place in the Transformation Hub terminal. I really enjoyed the asynchronous nature of the GitHub integration. I can simply create a new issue, use a label to assign to the transform agent, and then wait for the email notification that it is complete. This frees up time to carry on with other value add activities.

Running the transform agent in the IDE, you are also responsible for creating a separate branch, committing the changes to this branch, and raising the PR, all of which is taken care of for you with the GitHub integration. Even better, once merged into the main branch, the full conversation history is still maintained in the closed Pull Request for transparency.

The project I used can be found here - https://github.com/mlewis7127/bicycle-licence-GH-integration. If you have a requirement to transform old Java code projects, I would definitely recommend checking out this integration, and see how it fits into your developer workflow.

Biography

Amazon Q Developer transform for .NET

Matt Lewis — Mon, 20 Jan 2025 10:19:13 +0000

Introduction

A fantastic use case for AI Coding Assistants is upgrading applications to modern and supported versions of programming languages, libraries and frameworks. All too often, engineering effort is spent building new features, whilst existing applications are left untouched, until they become unsupported with all kinds of inherent vulnerabilities.

Amazon Q Developer introduced a code transformation agent for Java when launched in preview back in November 2023. This agent has undergone multiple iterations and become more accurate and powerful since then. On 3rd December 2024 at AWS re:Invent, the public preview of new transformation capabilities for .NET, mainframe, and VMware workloads was announced.

In this blog post, I take a look at the .NET transformation capability to get a better understanding of how it works. This feature is available in the Visual Studio IDE. However, Visual Studio for Mac has been retired, which gave me an opportunity to try out the Amazon Q Developer transformation web experience.

Why port .NET Framework?

.NET Framework is the original implementation of .NET by Microsoft. It supports running websites, services, desktop apps and more, but only on Windows.

.NET (sometimes called .NET Core) is a more modern, open-source version of .NET that can run on multiple operating systems including Windows, Linux and MacOS. The reasons to continue using .NET Framework are specific and limited according to Microsoft, and relate to use cases where your application is using third-party libraries or NuGet packages or .NET Framework technologies that are not available for .NET.

Where possible you should look to utilise .NET.

Getting started with web experience

To get started with the web experience, I had to subscribe to Amazon Q Developer from your management (root) account, as it is not currently possible to use a delegated administrator account. Note that you need to be in the us-east-1 region at this point.

This takes you to a Getting started with Amazon Q page. I was prompted to switch back to my home region (eu-west-2) which then automatically connected my AWS Organization instance of IAM Identity Center to Amazon Q. At this point, I clicked the button to "subscribe" and added a user from IAM Identity Center. It is also possible to add a Group instead of an individual user or users.

This successfully created an Amazon Q Developer Pro subscription for my chosen user. At this point, I clicked on the button which took me to the Amazon Q Developer console to complete the setup.

Opening up Amazon Q Developer in the AWS Console gave the option to click on a settings button.

In the settings, I enabled the transform settings, which is required to give access to the transformation web experience.

At this point, I navigated in a browser window to https://transform.developer.q.aws.com/ and signed in using IAM Identity Centre.

Once logged in, I was presented with the option of creating my first transformation job with Amazon Q.

Running transformation for .NET

Once I asked Q to create a transformation job, I was given the choice of the type of transformation to work on. There are three options available:

Modernize .NET applications to cross-platform .NET
Migrate VMware applications to Amazon EC2
Perform mainframe modernization (z/os to AWS)

I chose the option to modernise a .NET application. Amazon Q then populated a number of details about the .NET modernisation project. I could change these details, or in this case, confirm they are correct and let Amazon Q create the job itself.

At this point I had to connect Amazon Q to my source repository for the project I want to transform. I have forked the SharpZeroLogon GitHub repository to my own profile. This is an archived repository that was a rework of the NCC Group's tool specifically for .NET Framework 3.5.

The connection is made using AWS CodeConnections. Within an AWS account, you use CodeConnections to create a connection to a third party Git-based source provider. Currently, the only supported provider is GitHub. To create a connection, you need to go to AWS CodeArtifact, click on Settings and then Connections. I am using GitHub which installs the AWS Connector for GitHub as an application in GitHub. You can configure the connector with access to only specific repositories.

To setup the Amazon Q transformation job you first specify the account number for the AWS account where the connection is configured.

You then specify the AWS CodeConnection ARN.

You then go back into the AWS console to approve this connection request.

Once the connection request has been approved, you click on Send to Q, which will allow Amazon Q to access the repositories in the connected account.

Amazon Q analyses all of the repositories it has access too, to discover which ones run a .NET Framework application that is capable of being transformed. Amazon Q Developer transformation capabilities for .NET supports porting C# code projects of the following types: console application, class library, unit tests, web API, Windows Communication Foundation (WCF) service, and business logic layers of Model View Controller (MVC) and Single Page Application (SPA). Types of jobs that Amazon Q currently cannot transform include WebUI, SQLServer and ASP.NET.

In my example, the SharpZeroLogin repository has been detected as a supported project, and I am given the option to specify a target version (.NET 8.0). I can also specify the name of the new branch that will be created or keep the default.

Note that the web experience gives you the option of carrying out a .NET transform of multiple repositories. This is something not available within the IDE, which only allows one .NET solution at a time.

Amazon Q now automatically ports the selected .NET application to the target version following a transformation plan it has created. It commits all of the transformed code to a new branch in my GitHub repository, preserving the original source code.

I can click on the Dashboard tab to monitor the progress. In this case, I am told that the application has been transformed with no issues.

I can now go to my GitHub repository and look at the new branch that has been created. I can also view the diffs to see what changes have been made. In the file below, we can see that the target framework version has been updated from "3.5" to "net8.0".

Conclusion

The goal of this blog post is to show you how to simple it is to get up and running with the new Amazon Q Developer transformation web experience. If you have existing .NET Framework applications that you want to port to .NET to gain performance improvements and cross-platform support, it is definitely worth giving this feature a go.

Amazon GuardDuty Extended Threat Detection

Matt Lewis — Mon, 02 Dec 2024 15:19:53 +0000

Introduction

I was lucky to get the opportunity to try out the new "Extended Threat Protection" feature for Amazon GuardDuty whilst in beta. With the announcement of this new feature, I wanted to share more around my experience, and the value this new feature brings. Before jumping into this, let's start by providing some background to Amazon GuardDuty and the benefits it provides, to those who may not be familiar with the service.

Why Amazon GuardDuty?

Amazon GuardDuty is a threat detection service that continuously monitors, analyses and processes AWS logs and other data sources for malicious and abnormal activity. It uses its own internal feeds, alongside other intelligence feeds from CrowdStrike and Proofpoint to detect the latest threats and attack techniques. As someone who has worked for many years in heavily regulated industries processing sensitive data sets in areas of critical national infrastructure, I have been a huge advocate of Amazon GuardDuty.

In modern cloud environments, the quantity of logs and events that is captured is enormous. When it comes to threat detection, you require real-time and accurate visibility into this data. When your workloads reside on AWS, shipping this data externally to another cloud provider or back on-premises adds significant egress costs and latency. This is why I always look to use GuardDuty, so the data can be analysed at source, and threat detection can be consumed as a managed service.

GuardDuty uses a baseline of foundational data sources, and processes these logs using independent streams of data so it does not affect existing configurations. These foundational data sources are:

AWS CloudTrail - showing a history of AWS API calls and management events.
VPC Flow Logs - showing details of IP traffic going to and from network interfaces attached to your EC2 instances.
Route 53 Resolver DNS logs - showing a history of DNS queries.

On top of this baseline, you have the option to enable protection plans, which are specialised features within GuardDuty that provide enhanced threat detection for specific AWS services, such as:

S3 Protection - helps detect risks such as data exfiltration and destruction in your S3 buckets.
EKS Protection - monitors EKS audit logs to identify potential security issues such as unauthenticated actor attempts to collect secrets or AWS credentials, and suspicious container deployments with images not commonly used in the cluster.
Runtime Monitoring - observes and analyses operating-system level, networking, and file events to help detect potential threats for EC2 instances and container workloads in EKS and ECS including Fargate.
Malware Protection for EC2 - detect the potential presence of malware by scanning the EBS volumes attached to EC2 instances.
Malware Protection for S3 - detect the potential presence or malware by scanning newly uploaded objects to selected S3 buckets.
RDS Protection - profile and monitor access activity to Aurora databases in your AWS account without impacting database performance, to detect potential threats such as high severity brute force attacks, suspicious logins, and access by known threat actors.
Lambda Protection - identifies potential security threats when an AWS Lambda gets invoked in your AWS environment by monitoring Lambda network activity logs.

Integration with AWS Services

Amazon GuardDuty is tightly integrated with other AWS services to enable fast responses.

Amazon Detective

Amazon Detective ingests GuardDuty findings and allows you to quickly analyse and investigate these events.

AWS Security Hub

AWS Security Hub is a cloud security posture management (CSPM) service. It collects findings from the security services enabled across your AWS accounts, such as intrusion detection findings from GuardDuty, vulnerability scans from Inspector, and sensitive data identification findings from Macie. It runs continuous and automated account and resource-level configuration checks against the controls in the AWS Foundational Security Best Practices standard and other supported industry best practices and standards such as NIST and PCI DSS. The screenshot below shows GuardDuty findings in Security Hub.

Amazon EventBridge

GuardDuty creates an event whenever a new finding occurs. These are routed to the default event bus in Amazon EventBridge. You can configure an EventBridge rule with a pattern that listens for GuardDuty findings in order to automatically respond to these events.

Common use cases include sending automatic alerts for high severity findings, or automating remediation (e.g. disabling a compromised access key)

Extended Threat Detection with Attack Sequences

GuardDuty Extended Threat Detection is a new feature of Amazon GuardDuty that uniquely identifies attack sequences spanning multiple AWS data sources and resources within a 24-hour time window within an AWS account.

This addresses the risk where an attack could be comprised of a number of related suspicious activities over a period of time. Each of these suspicious activities may generate their own individual finding. However, these may be of a lower severity and act as a weak signal, and so not seen as presenting a real threat. However, when these weak signals are considered together, and the sequence of these activities align to a more suspicious activity, GuardDuty will generate an attack sequence finding.

In this case, we have triggered a finding of type AttackSequence:IAM/CompromisedCredentials. We can see looking at the summary of findings that this has been given a critical severity level.

Clicking into the finding and selection "View details" brings up the overview page. This provides a compact view of the attack sequence details, including signals, MITRE tactics, and potentially impacted resources. In the screenshot below, (1) shows the signals, (2) shows the MITRE tactics, and (3) shows the indicators.

Signals displays a timeline of events that are involved in the attack sequence. Each individual signal could be an API activity or finding that GuardDuty used to detect the attack sequence. Each signal, that is a GuardDuty finding, has it's own severity level and value assigned to it. In the GuardDuty console, you can select each signal to view the associated details.

One of my favourite aspects of the new feature is the mapping of the finding to both MITRE ATT&CK(™️) tactics and techniques. This "compromised credentials" attack sequence was comprised of the following 3 MITRE ATT&CK tactics. GuardDuty uses the MITRE ATT&CK framework to add context to the entire attack sequence. The colours GuardDuty uses to specify the threat purposes used by the threat actor, align with the colours that indicate the critical, high, medium, and low findings severity level.

The indicators section shows observed data that matches the pattern of a security issue, and is the reason why this collection of signals was identified as an attack sequence. For example, the "High risk API" indicator is flagged as the cloudtrail:DeleteTrail and iam:CreateUser API calls were made, which are actions commonly used by threat actors.

I setup a rule in EventBridge to capture an attack sequence finding. A small subset of the JSON event message is shown below. This message also provides details of the associated signals.

Overall, this is a fantastic new feature in GuardDuty and I am excited to see more attack sequence detections being added over time.