DEV Community: Markus Toivakka

My notes on AWS IPAM

Markus Toivakka — Fri, 19 Dec 2025 05:15:08 +0000

AWS IPAM helps you avoid IP address conflicts by providing a single source of truth for CIDR allocation. It also offers features such as monitoring usage and compliance across accounts and regions—but those are out of scope here.

In this post, I focus on a very basic (and very common) use case: using AWS IPAM to allocate a non‑overlapping CIDR for a newly created VPC. How hard can that be? As it turns out, not very—but there are some practical gotchas worth knowing about. This post is a collection of those lessons learned.

What works well

Seamless VPC integration

Once AWS IPAM is configured, CIDR allocation for VPCs is refreshingly simple. To appreciate this, consider some non‑seamless alternatives:

A CloudFormation custom resource that triggers on VPC creation and pulls the next free CIDR from an on‑prem IPAM
A mix of bash scripting in CI/CD pipelines and IaC templates
Manual Excel bookkeeping combined with click‑ops (😱)

With AWS IPAM, none of that is needed. VPC creation can look as simple as this:

resource "aws_vpc" "test" {
  ipv4_ipam_pool_id   = "my-ipam-pool-id"
  ipv4_netmask_length = 21
}

AWS IPAM automatically picks the next available /21 from the pool. When the VPC is deleted, the CIDR is returned to the pool (typically within ~10 minutes) and becomes available for reuse. If the account holding the VPC is deleted, VPC allocation stays in the pool until 90-days post-closure period has passed.

IP pools

IP pools are a core IPAM concept and work largely as expected. You can:

Build a hierarchical structure (e.g. company → department → region → environment)
Use a flat structure(e.g. all-eu-west-1-cidrs)
Combine both approaches.

Each pool contains one or more provisioned CIDR blocks. When a pool runs out of free address space, you can add additional CIDRs, and AWS IPAM will automatically start using them for future allocations.

Sharing pools across accounts

IPAM pools can be shared with other AWS accounts using AWS RAM. If your AWS Organization is structured with OUs (as it should be), this works nicely:

Share production pools with production OUs
Share non‑prod pools with non‑prod OUs

Note: Accounts that receive a shared pool can see all allocations within that pool, including VPC ID, CIDR, and Account ID.

Importing existing VPC CIDRs

If you already have VPCs with manually assigned CIDRs—or CIDRs allocated via other automation—AWS IPAM can import those existing VPCs into pools.

Think of this as syncing your current state into AWS IPAM. From that point on, you can confidently allocate CIDRs for new VPCs, knowing that overlaps are prevented.

What doesn’t work so well

Pricing

AWS IPAM pricing is based on managed IP addresses per hour. When integrated with AWS Organizations, you are billed for every IP address in use.

You can mark CIDR blocks in IPAM as ignored, in which case AWS skips managing those ranges—but any IP addresses actively in use are still billed.

Missing features

The core functionality works well, but several advanced (yet very reasonable) features are missing:

Configurable de‑allocation grace periods:
When a VPC is deleted, its CIDR is released back to the pool after ~10 minutes. In some environments, you might want:
- A multi‑day grace period (e.g. 10 days)
- Manual approval before reuse
More useful pool‑level metrics:

Pool utilization is reported only as a percentage. Metrics such as "IP addresses remaining" would make it much easier to build meaningful alarms across pools of different sizes.
Events for allocations and de‑allocations:

VPC CIDR allocation and release events would be extremely useful for automation and auditing purposes. Those would make it easier also to implement custom workarounds for missing features like grace period. Now only way is to periodically poll IPAM pools for changes. Polling will be discussed further on the next section.
Weighted CIDR blocks at pool level:

In most cases, you are having multiple CIDRs on the pool. It would be beneficial to have control which one is used first for VPC allocations. This would enable more efficient CIDR usage.

Things to be aware of

Regions and locale

AWS IPAM itself is created in a single region, but it can manage pools across multiple regions.

For example:

IPAM is created in eu-west-1 home region.
Pools exist in eu-west-1 and eu-north-1

In IPAM terminology, each pool has a locale (its region).

UI vs API behavior

AWS Console:
All pools and allocations are visible in the region where IPAM is created, regardless of pool locale. Same behaviour continues on share recipient accounts. All pool shares are visible on RAM in the home region but the pool can be used only on the region specified by pool locale.
AWS CLI / SDK:
This is where things get tricky. You must query pool allocations from the same region as the pool’s locale. Pool CIDRs then again must be queried from the same region where AWS IPAM is deployed. The result is sligthly awkward control flow. You end up juggling multiple boto3 clients, each valid only for a specific subset of API calls.

Example using boto3:

ec2 = boto3.client("ec2", region_name="eu-west-1")

# Works. Pool is on the same region as AWS IPAM
ec2.get_ipam_pool_cidrs(IpamPoolId="EU-WEST-1-ID")
ec2.get_ipam_pool_allocations(IpamPoolId="EU-WEST-1-ID")

# Does NOT work. 
ec2.get_ipam_pool_allocations(IpamPoolId="EU-NORTH-1-ID")

# Correct approach. Get boto3 client on the same region as pool locale.
ec2_eun1 = boto3.client("ec2", region_name="eu-north-1")
ec2_eun1.get_ipam_pool_allocations(IpamPoolId="EU-NORTH-1-ID")

Resource discovery quirks

AWS IPAM Resource Discovery keeps an inventory of IP address usage across accounts, but its behavior is not always consistent.
Overall there is a strict separation of duties between IPAM Pools and IPAM Resource Discovery. For example, you can't use Resource Discovery to query the IPAM pool from which a VPC was allocated. Conversely, if you need to access VPC tags, those are exposed only via Resource Discovery and are not available at the IPAM pool level.

Because of these inconsistencies and syncing issues, getting VPC configuration which contains pool level data requires active polling from both IPAM pool and Resource Discovery.

For common VPC operations IPAM behaves as follows:

When a VPC is created using an IPAM pool, the CIDR allocation is immediately visible in the pool, but it may take several minutes before the corresponding VPC shows up in Resource Discovery.
When an AWS account is closed:
- VPCs disappear from Resource Discovery almost immediately
- CIDR allocations remain associated with the pool
- After the ~90 days post-closure period after the account is permanently deleted, the CIDRs are released back to the pool

Conclusion

AWS IPAM solves the core problem it’s meant to solve: allocating non-overlapping CIDRs for VPCs without custom automation or human bookkeeping. However, there is still a somewhat unfinished feel, with missing events for automation and region/locale quirks that are easy to trip over. This pushes you to build workarounds for missing or incomplete features—something that ideally shouldn’t be necessary for a managed AWS service.

Pricing can also become noticeable in large environments. Billing is based on managed IP addresses per hour, so organizations with many accounts and large CIDR ranges should do the math upfront.

And hey, since it’s almost Christmas, here’s my wish: ditch AWS IPAM Resource Discovery altogether. Keep monitoring at the VPC / subnet level, no need to track ENI-level details. Attach VPC metadata (like tags) directly to pool allocations. And emit pool lifecycle and allocation events to Eventbridge. Keep it simple. And please, make it free — or at least charge per VPC.

Using AWS Identity Center (SSO) tokens to script across multiple accounts

Markus Toivakka — Sat, 11 Oct 2025 12:30:50 +0000

Short version: AWS Identity Center (formerly AWS SSO) stores a short-lived accessToken in a local JSON cache after you run aws sso login. You can exchange that token for per-account IAM temporary credentials using the sso.get_role_credentials API and then use those credentials with boto3.Session to run operations across multiple accounts. This post explains how the token flow works, security implications (yes — plain text cache), and gives a hands-on Python example you can adapt.

How Identity Center tokens work (high level)
Token → IAM credentials: what happens under the hood
Security considerations
What an attacker can do with a stolen accessToken
Full example: running a ReadOnly check in all your SSO accounts using AWSReadOnlyAccess
Hardening recommendations and alternatives

1) How Identity Center tokens work (high level)

When you run aws sso login the CLI performs an OIDC/OAuth flow and writes a small JSON token file to the SSO cache directory (usually ~/.aws/sso/cache/). The JSON contains an accessToken (a Bearer token) and an expiresAt timestamp.

That accessToken represents your authenticated Identity Center session (your user + any required MFA). It is not long-lived — the token has an expiry (1-12h, depending on your org's config).

You cannot directly use that accessToken like AWS credentials; instead you exchange it for IAM credentials for a specific account and role via the Identity Center sso API.

2) Token → IAM credentials: what happens under the hood

The key API is sso.get_role_credentials (available in boto3). You call it with:

accountId — the target AWS account you want access to
roleName — the Identity Center role name assigned in that account (for example, AWSReadOnlyAccess). You can also get available roles from sso.list_account_roles.
accessToken — the cached token from ~/.aws/sso/cache/*.json

The get_role_credentials call returns a short-lived set of credentials: accessKeyId, secretAccessKey, and sessionToken. These are standard STS-style credentials and you use them to create a boto3.Session that will act in the target account with the permissions associated to the role.

3) Security considerations

Important: the JSON files under ~/.aws/sso/cache/ are plain text JSON. The accessToken inside them is effectively a bearer token that can be exchanged for AWS credentials until it expires.

That means:

Treat the accessToken like a password / secret bearer token.
Protect the ~/.aws/sso/cache directory with strict file permissions (e.g., chmod 600).
Monitor and rotate your sessions: use the shortest practical session duration configured in your Identity Center settings.

4) What an attacker can do with a stolen accessToken

If a bad actor gains access to your SSO accessToken file, they can effectively impersonate you in AWS until that token expires. Here’s what happens:

The attacker can use your accessToken on their own computer to call AWS SSO APIs (ListAccounts, ListAccountRoles, and GetRoleCredentials).
Using GetRoleCredentials, they can obtain temporary IAM credentials for every account and role you have access to through Identity Center.
Those temporary credentials give them the same level of access that your assigned roles do — including read or write permissions, depending on your configuration.
Once they have those IAM keys, they can use standard AWS APIs (e.g., S3, EC2, IAM) from anywhere, even outside your corporate network.

This is why the SSO token must be treated like a highly sensitive credential — it’s effectively a universal key for your Identity Center session.

Even though the keys are short-lived, this can still result in data exfiltration, resource modification, or privilege escalation depending on your roles.

5) Full example: running a ReadOnly check in all your SSO accounts using `AWSReadOnlyAccess`

This example hardcodes the role name AWSReadOnlyAccess. It lists account IDs via SSO, then requests role credentials for AWSReadOnlyAccess in each account and performs an sts.get_caller_identity() check.

#!/usr/bin/env python3
"""Run a ReadOnly boto3 action in all AWS SSO accounts using a fixed role name"""
import json
import glob
import os
from datetime import datetime, timezone
import boto3
from botocore.exceptions import ClientError

AWS_SSO_CACHE_DIR = os.path.expanduser("~/.aws/sso/cache")
FIXED_ROLE_NAME = "AWSReadOnlyAccess"  # fixed role name

SSO_CLIENT = boto3.client("sso")

def find_latest_sso_token():
    files = glob.glob(os.path.join(AWS_SSO_CACHE_DIR, "*.json"))
    if not files:
        raise FileNotFoundError(f"No SSO cache JSON files found in {AWS_SSO_CACHE_DIR}")
    latest = max(files, key=os.path.getmtime)
    with open(latest, "r") as fh:
        data = json.load(fh)
    token = data.get("accessToken")
    expires_at = data.get("expiresAt")
    exp_dt = datetime.fromisoformat(expires_at.replace('Z', '+00:00')) if expires_at else None
    return token, exp_dt

def get_role_credentials(account_id, role_name, access_token):
    resp = SSO_CLIENT.get_role_credentials(
        accountId=account_id,
        roleName=role_name,
        accessToken=access_token,
    )
    creds = resp.get("roleCredentials")
    if not creds:
        raise RuntimeError(f"No roleCredentials for account {account_id}")
    return {
        "aws_access_key_id": creds["accessKeyId"],
        "aws_secret_access_key": creds["secretAccessKey"],
        "aws_session_token": creds["sessionToken"],
    }

def list_sso_account_ids(access_token):
    account_ids = []
    paginator = SSO_CLIENT.get_paginator("list_accounts")
    for page in paginator.paginate(accessToken=access_token):
        for acct in page.get("accountList", []):
            account_ids.append(acct["accountId"])
    return account_ids

def main():
    token, exp = find_latest_sso_token()
    if exp and exp <= datetime.now(timezone.utc):
        raise SystemExit("SSO token expired — run `aws sso login` again.")

    account_ids = list_sso_account_ids(token)

    if not account_ids:
        print("No accounts discovered for this SSO session.")
        return

    for acct_id in account_ids:
        try:
            creds = get_role_credentials(acct_id, FIXED_ROLE_NAME, token)
            sess = boto3.Session(**creds, region_name="eu-west-1")
            sts = sess.client("sts")
            who = sts.get_caller_identity()
            print(f"\n--> Account {acct_id}, caller identity:", {who['Arn']})
        except ClientError as e:
            print("  failed to get role credentials or call STS:", e)
        except RuntimeError as e:
            # Could indicate the role isn't assigned to you in that account
            print("  runtime error:", e)

if __name__ == "__main__":
    main()

Notes:

If AWSReadOnlyAccess is not assigned to your Identity Center principal in a particular account, get_role_credentials will fail for that account.

6) Hardening recommendations and alternatives

File permissions: make sure ~/.aws/sso/cache is only readable by your user (chmod 700 ~/.aws/sso/cache && chmod 600 ~/.aws/sso/cache/*.json).
Avoid long sessions: Keep AWS Identity Center session duration as small as practical.
Audit & monitor: enable CloudTrail and monitor GetRoleCredentials/AssumeRole patterns for suspicious usage.
Least privilege: maybe you don’t need that AWSAdministratorAccess role in every account — start with AWSReadOnlyAccess and escalate only when necessary.
Use credential helpers: Consider using third-party tools for AWS credential handling, such as granted or aws-vault , to isolate and securely manage session tokens.

Terraform AWS multi-region deployments: region meta-argument in Beta

Markus Toivakka — Sat, 07 Jun 2025 12:48:07 +0000

Terraform holds a solid position in the ADOPT category of SOK Tech Radar, and for a good reason. Most our teams rely on it to define and provision cloud infrastructure across AWS, Azure, and other platforms.

One of the Terraform’s biggest strengths lies in its modular architecture, which allows teams to use and share best practices through reusable modules.

Another reason to love Terraform? The frequent updates to HCL that often make me go “wooh”! This post dives into one such update — still in beta but offering a much cleaner approach to multi-region deployments.

The Traditional Approach: Provider Aliases

Managing AWS resources across multiple regions has traditionally required separate provider configurations for each of the regions. If you’ve ever needed to deploy something like an SNS topic in multiple regions, you probably know the spiel:

provider "aws" {
  alias  = "eu_north_1"
  region = "eu-north-1"
}

provider "aws" {
  alias  = "eu_west_1"
  region = "eu-west-1"
}

resource "aws_sns_topic" "test_eu_north_1" {
  provider = aws.eu_north_1
}

resource "aws_sns_topic" "test_eu_west_1" {
  provider = aws.eu_west_1
}

This approach works, but it’s verbose and repetitive. The real limitation becomes apparent when you try to scale: the provider argument couldn't be used dynamically withfor_each, making regional deployments an exercise in copy-pasting and configuration clutter.

Enter the Region Meta-Argument

The Terraform AWS Provider 6.0.0 (currently in beta) introduces a new region meta-argument that overrides the provider-level region configuration. This allows you to specify the region directly at the resource level, eliminating the need for multiple provider aliases.

More importantly, the region meta-argument enables dynamic for_each loops for regional deployments, transforming how we handle multi-region infrastructure.

Here’s how you can now deploy the same SNS topics across multiple regions with significantly less boilerplate:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "= 6.0.0-beta2" # Ensure we are using the latest beta version
    }
  }
}

provider "aws" {}

locals {
  regions = ["eu-west-1", "eu-north-1"]
}

resource "aws_sns_topic" "test" {
   for_each = toset(local.regions)
   region = each.key
}

This is cleaner and easier to maintain. No more juggling multiple provider aliases or duplicating resource blocks.

Final Thoughts

While the AWS Provider 6.0.0 is still in beta, this feature represents a significant step forward in simplifying multi-region deployments. As teams increasingly adopt multi-region architectures for resilience and compliance, having cleaner, more maintainable Terraform configurations becomes crucial. This feature is definitely worth experimenting with!

Building Bedrock Agents for AWS Account Metadata and Cost Analysis

Markus Toivakka — Sat, 04 Jan 2025 07:01:12 +0000

Introduction

Small language models (SLMs) often take a back seat to their larger GenAI counterparts, but I’ve been pondering how they could be used in practical, everyday scenarios like AWS account management. While it’s true that GenAI is the industry’s hype word, applying these tools to familiar problems can offer an excellent framework for learning and innovation. That’s exactly what I aim to focus on in this post.

The goal? To enable the Bedrock Agent to handle queries like:

"Get July's costs for Bill's AWS accounts." or
"List all prod - accounts"

Here’s how it works: The agent will first fetch all AWS accounts where Bill is tagged as the owner using the Organizations API. It will then use the Cost Explorer API to perform cost analysis for each of those accounts. By orchestrating these steps, the Bedrock Agent acts as a practical assistant for AWS account management tasks.

In the next session, We’ll take a deeper dive into both of these AWS APIs to understand how they work together to achieve our goal.

AWS Organizations API

AWS Organizations Tags offer a powerful way to structure and manage AWS accounts by attaching metadata in the form of tags. These tags can be used to categorize accounts and answer queries like, “Which department owns this account?” or “What environment does this account belong to?”

By leveraging tags such as Owner and Environment, you can create a simple framework for managing accounts. In this POC, these tags play a central role in enabling the Bedrock Agent to fetch additional information about AWS accounts. That metadata is then translated into AWS account IDs, which is essential for the next step: cost analysis. For more details, see the AWS Organizations Tagging documentation.

Cost Explorer API

For cost analysis, we’re keeping things simple. The Bedrock Agent queries the Cost Explorer API for account costs over a specified period. While this method is far from comprehensive—anyone familiar with AWS cost management knows there’s much more nuance—it serves as a solid starting point.

One particularly cool aspect of using a language model interface is its ability to interpret natural language into precise API parameters. For example, the Cost Explorer API expects a "Start date for filtering costs in YYYY-MM-DD format." When a query like "last month's costs" is submitted, the model intelligently converts "last month" into the correct date range and passes it as a parameter to the API. This highlights the value of combining language models with AWS APIs to streamline complex workflows.

About Foundation Models

The first version of this solution came to life in the fall of 2024. Initially, I used Claude 3.5 Haiku and Sonnet models. While they delivered excellent performance, it felt like overkill for the small, straightforward prompts in this project. So, when Amazon introduced the Nova models at re:Invent 2024, I jumped at the chance to see how well they’d perform for this use case. For this demo, I opted to use the Amazon Nova Lite model, which proved to be a solid fit.

However, Nova Lite's smaller size did reveal a few limitations:

Accuracy Issues: When validating cost analysis results (e.g., "All dev accounts' costs in November") using a basic calculator, the numbers often didn’t match. Even for simpler queries like "Get account count by environment," the model sometimes returned totals that exceeded the actual number of accounts in the mock dataset.
Token Limitations: Nova Lite has a maximum output of 5,000 tokens. For queries involving tens of accounts with metadata, the responses often exceeded this limit, causing the output to stop mid-sentence. While you can prompt the agent to continue with commands like "go on," this disrupts the workflow.

That said, Nova Lite's speed is a standout feature. For short prompts, the quick response time makes for a snappy and efficient experience. It also integrates seamlessly with Bedrock Agents and Lambda functions, making it an excellent choice for building lightweight solutions.

For organizations with hundreds of accounts or more complex queries, however, using a larger model would make more sense, as the data volume and accuracy demands increase.

Overview of the Solution

Architecture

This workflow diagram demonstrates how a user's query is processed and translated into API calls.

`invoke_agent.py` cli

To interact with the Bedrock Agent, I created a simple Python script called invoke_agent.py. This script serves as a command-line interface, making it easy to submit queries to the agent. Additionally, it prints the input and output token counts for each query, offering insights into the efficiency and resource usage of the interactions.

Implementation Walkthrough

If you’d like to follow along, all source code is available in this GitHub repository: https://github.com/markymarkus/bedrock-agent-accounts.

To use this solution, you’ll also need access to the amazon.nova-lite-v1:0 Bedrock foundation model. Currently, Amazon Nova models are only available in the us-east-1 region, so be sure to deploy your resources there.

Using Mock vs. Real Data

In my private life, I only manage a handful of accounts—and I’d rather not share their details publicly. To address this, I developed functions to generate mock account and billing data. For this POC, I worked with a simulated AWS Organization containing 30 accounts, each enriched with mock metadata and billing information to provide a realistic testing environment.

By default, this solution uses mock data for account and cost queries. If you’d like to experiment with real account and cost data, you can disable the mock data by deploying the CloudFormation template with the parameter: EnableMockData: false.

Deployment

aws s3 mb s3://temp-bedrock-cf-staging

aws cloudformation package --s3-bucket temp-bedrock-cf-staging --output-template-file packaged.yaml --region us-east-1 --template-file template.yaml

aws cloudformation deploy --stack-name dev-bedrock-accounts-agent --template-file packaged.yaml --region us-east-1 --capabilities CAPABILITY_IAM

At this point the Bedrock account agent is ready. Next update created agent's ID to invoke_agent.py:

aws cloudformation describe-stacks --stack-name dev-bedrock-accounts-agent --query 'Stacks[0].Outputs[?OutputKey==`AgentId`].OutputValue'

And then copy-paste the ID to `invoke_agent.py` agent_id variable.

...And ACTION!

Here we have finally recording of running queries to Bedrock Agent.

DIRECT LINK to the image, in case dev.to doesn't allow animated GIF

Key takeaways

Throughout this project, several important points emerged. While seasoned AI professionals might find some of these familiar, they proved invaluable for my work and could benefit others as well:

LLMs Don’t Know the Current Date

By default, Claude’s knowledge is limited to 2023, meaning prompts like "Get last month's costs" return results based on outdated information. To resolve this, I modified the invoke_agent.py script to append the current date to promptSessionAttributes.

Bedrock Agents Are Token-Hungry

It’s impressive to observe how Foundation models reasons through API calls, orchestrating actions like retrieving account lists and fetching costs for each account. However, each step of this reasoning process requires the model to articulate its logic in the prompt, consuming a significant amount of tokens in the process.

Let the LLM Handle the Work

In earlier iterations of the project, I required the agent to pass an exact email (e.g., "list all accounts where the owner is markus_toivakka@myowndomain.fi") to Lambda function. Lambda function then did filtering and returned corresponding account list. However, by shifting the filtering logic to the LLM, I enabled it to retrieve a full account list and filter accounts itself, allowing for more flexible queries like even "give all markus's accounts".

Conclusion

This exercise gave me a clearer understanding of the key pieces of the puzzle: how AI Agents can automate tasks and how seamlessly an LLM interface can be added to an existing API. However, it also highlighted a crucial point—LLMs and AI Agents consume potentially a lot of data. There's an important trade-off to consider: Should filtering be handled in a Python function, or should the data be ingested and processed by the LLM itself?

While getting the solution described in this blog to work was fairly straightforward, optimizing it for cost efficiency and grasping the broader dimensions of data processing in AI systems presents a much more complex challenge.

Implement mTLS on AWS ALB with Self-Signed Certificates

Markus Toivakka — Tue, 17 Sep 2024 16:34:13 +0000

In this post we'll walk through a step-by-step guide to implement mutual TLS (mTLS) configuration on AWS Application Load Balancer(ALB) and verifying the setup using curl.

mTLS requires both the server and the client to have trusted certificates, issued by trusted CAs. If you are building mTLS on production use, I suggest to take a look at AWS Private CA.

In this blog post, we'll be using self-signed certificates, which are certificates generated and signed locally without the involvement of a trusted Certificate Authority (CA). This allows us to create all necessary certificates directly on your local machine and upload them to AWS services. While self-signed certificates are typically not used in production due to their lack of trust from external entities, they offer a practical way to understand how mTLS works when a client initiates a session with a server.

Let's dive in and get started!

Generate self-signed certificates

Step 1: Generate x.509v3 Configuration Files

When working with mTLS on ALB, it's important to use x.509 Version 3 certificates, as they are required for proper mutual authentication. OpenSSL certificates are by default Version 1 unless you explicitly specify otherwise. To ensure compatibility with ALB, we first need to create specific configuration files, which will be referenced during the certificate creation process.

Below, we'll outline the steps to create these x.509v3 configuration files.

# openssl-ca.cnf
[ req ]
default_bits        = 2048
default_md          = sha256
default_keyfile     = client-private.key
prompt              = no
distinguished_name  = req_distinguished_name
x509_extensions     = v3_ca

[ req_distinguished_name ]
C                   = FI
ST                  = State
L                   = Tampere
O                   = MyOrg
OU                  = IT
CN                  = MyCertificateAuthority

[ v3_ca ]
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid:always,issuer
basicConstraints = critical, CA:true
keyUsage = critical, digitalSignature, keyCertSign, cRLSign

# openssl-client.cnf
[ v3_req ]
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment

Step 2: Generate x.509v3 Certificates

In this step, we will generate the necessary certificates for both the server (ALB with Lambda function) and the client (your laptop). These certificates will enable secure communication through mutual authentication.

Generate Certicates:

Generate the Private Key
The private key is used to sign the certificates.
```
openssl genrsa -out client-private.key 4096
```
Create the CA Certificate
The Certificate Authority (CA) certificate will be uploaded to AWS Load Balancer's Trust Store to establish trust.
```
openssl req -new -x509 -days 3650 -key client-private.key -out client-ca-cert.pem -config openssl-ca.cnf
```

Generate Client Certificates
Now, create two client certificates. These will be signed by the CA certificate created in the previous step.

Generate Certificate Signing Request (CSRs) for each client:

openssl req -new -key client-private.key -out client1.csr -subj  "/C=FI/ST=State/L=Tampere/O=MyOrg/OU=IT/CN=MyClient001"
openssl req -new -key client-private.key -out client2.csr -subj 
"/C=FI/ST=State/L=Tampere/O=MyOrg/OU=IT/CN=MyClient002"

Sign each CSR to create the certificates:

openssl x509 -req -in client1.csr -CA client-ca-cert.pem -CAkey client-private.key -set_serial 01 -out client-public-1.pem -sha256 -extensions v3_req -extfile openssl-client.cnf
openssl x509 -req -in client2.csr -CA client-ca-cert.pem -CAkey client-private.key -set_serial 01 -out client-public-2.pem -sha256 -extensions v3_req -extfile openssl-client.cnf

Key Points

At this stage, we have successfully generated three x509.v3 certificates:

CA Certificate (client-ca-cert.pem): This certificate will be uploaded to AWS ALB's Trust Store to establish trust between the ALB and clients.
Client Certificates (client-public-1.pem and client-public-2.pem): These will be used by the clients (e.g., your laptop) to authenticate with the ALB during the mutual TLS handshake.

Create an Application Load Balancer configured for mTLS

Before starting, ensure that your AWS CLI is properly configured. You'll also need a Cloudformation template to provision the required infrastructure.

You can find a ready-made Cloudformation template and necessary resources here: https://github.com/markymarkus/cloudformation/tree/master/alb_lambda_mtls

Step 1: Upload the CA Certificate to S3

For mutual TLS (mTLS) authentication, ALB requires the CA certificate chain to be stored in an S3 bucket. This S3 bucket, along with the certificate object, will be referenced when the ALB's Trust Store is created.

Run the following command to create a new S3 bucket(replace dev-trust-store-certs with your preferred bucket name). After that is done, CA Certificate is copied to the bucket.

aws s3 mb s3://dev-trust-store-certs --region eu-west-1
aws s3 cp client-ca-cert.pem s3://dev-trust-store-certs

Step 2: Provision the ALB infrastructure

Now that the CA certificate is stored in S3, we can proceed to provision the Application Load Balancer and Lambda function using a CloudFormation template.

Run the following command to deploy the ALB and backed Lambda function infrastructure. Make sure to update parameters.json with the necessary configuration values(e.g., VPC, certificate and S3 bucket details).

aws cloudformation deploy --stack-name mtls-demo --template-file alb_lambda_mtls.yaml --parameter-overrides file://parameters.json --capabilities CAPABILITY_IAM

Use cURL to Test mTLS

The final step is to verify the mutual TLS (mTLS) handshake using cURL with the newly created ALB. We'll first attempt a standard TLS connection without client certificates, followed by a successful mTLS connection with the necessary certificates.

Step 1: Test mTLS Connection Without Client Certificate

Run the following cURL command to test a simple TLS connection to the ALB without providing a client certificate:

curl https://mtls-server.XXXXXXXX.fi

This command will fail with the error:

curl: (35) Recv failure: Connection reset by peer

This failure occurs because the client did not present a certificate to prove its identity, which is required for mutual TLS authentication.

Step 2: Test mTLS with Client Certificates

Now, let's add the client’s private key and public certificate to authenticate the client and complete the mTLS handshake

curl --key client-private.key --cert client-public-1.pem https://mtls-server.XXXXXXXX.fi

This time, the command should succeed, and you should receive a response similar to the following to indicate that authorized client has invocated it:

Hello: CN=MyClient001,OU=IT,O=MyOrg,L=Tampere,ST=State,C=FI

The response from the Lambda function confirms that the client has successfully authenticated with the ALB, indicating that mTLS is functioning as expected. This response includes details from the client’s certificate, verifying that the ALB has properly verified client's identity.
To further test the setup, try updating the previous curl command to use the client-public-2.pem certificate. You'll notice that the response changes accordingly, reflecting the new client certificate being used for authentication.

Conclusion

In this tutorial, we walked through the process of generating x.509v3 certificates, configuring an AWS Application Load Balancer for mutual TLS (mTLS), and successfully testing the setup using cURL. By securing communication between clients and the ALB with mTLS, you ensure that both parties authenticate each other, enhancing the security of your application.

Multi-account AWS deployments with Terragrunt

Markus Toivakka — Thu, 28 Dec 2023 13:18:35 +0000

Terragrunt is a thin wrapper around Terraform that provides extra layer to handle Terraform configurations. It makes it easier to manage .tf - remote states.

In this blog post I'm focusing on using Terragrunt in the context of multi-account provisioning. Call it AWS account bootstrapping or landing zone, idea is the same: provision same identical resource to multiple AWS accounts and regions. I want to make it with least amount of copy-pasting and as dynamic as possible.

I'm going to show how to use Terragrunt to:

Provision a resource to multiple AWS accounts.
Manage all accounts' remote states in a single S3 - bucket.

I'm going to use following account setup for the sample:

AWS management account with AWS Organizations - enabled. I'm running terragrunt using short time credentials from this account. Resources are provisioned to member accounts and Terraform states are stored to S3 and DynamoDB on this account.
Three member accounts. Note: AWS Organizations automatically creates default role(OrganizationAccountAccessRole) to every member account. Organizations default role is used to provision resources with Terraform. (Note: Default role used here is admin role. If you are considering to use this in production, make sure to use a role with scoped down access rights.)

Walkthrough

Prerequisites: Install terraform and terragrunt

To get started with multi-account deployment, I'm using very minimal terragrunt structure.
You can clone this project from: https://github.com/markymarkus/terragrunt_aws_multi_account



├── deployment       # Terragrunt configuration files
│   ├── accounts
│   │   ├── sandbox1
│   │   │   ├── account.hcl
│   │   │   └── eu-west-1
│   │   │       ├── infra
│   │   │       │   └── terragrunt.hcl
│   │   │       └── region.hcl
│   │   ├── sandbox2
│   │   │   ├── account.hcl
│   │   │   └── eu-west-1
│   │   │       ├── infra
│   │   │       │   └── terragrunt.hcl
│   │   │       └── region.hcl
│   │   └── sandbox3
│   │       ├── account.hcl
│   │       ├── eu-north-1
│   │       │   ├── infra
│   │       │   │   └── terragrunt.hcl
│   │       │   └── region.hcl
│   │       └── eu-west-1
│   │           ├── infra
│   │           │   └── terragrunt.hcl
│   │           └── region.hcl
│   └── terragrunt.hcl
└── modules       # Terraform module for S3 - bucket
    ├── main.tf
    ├── outputs.tf
    ├── s3.tf
    └── vars.tf

Terragrunt

Configuration in /deployment - folder defines which modules in /modules- folder are deployed to which account and which region. My configuration creates S3 - bucket to eu-west-1 region in every sandbox - account. Sandbox3 gets additional bucket to eu-north-1 to show how this configuration can be extended to multiple regions.

Most of the magic happens in /deployment/terraform.hcl.

generate block injects Terraform provider configuration into account - modules. Variables defined in account.hcl and region.hcl configuration files are used to implement dynamic provider block.



generate "provider" {
    path    = "provider.tf"
    contents  = <<EOF
provider "aws" {
    region = "${local.aws_region}"
    assume_role {
        role_arn = "arn:aws:iam::${local.aws_account_id}:role/OrganizationAccountAccessRole"
    }

Run

To provision S3 bucket from /modules - folder to all sandbox - accounts in my configuration, I'm doing following:



cd deployment/accounts
terragrunt run-all apply

terragrunt automatically calls terragrunt init so it is not needed separately.
If Terraform state bucket and DynamoDB table(as defined in /deployment/terragrunt.hcl) do not exist, terragrunt creates those.

Output

I'm copy-pasting quite complete output from terragrunt run here.

Each module(here account / region) is provisioned one by one. We are not executing one "BIG terragrunt plan" but instead set of separate terraform plans. By default these tf plans do not contain information about AWS account to which is being provisioned to. To make sure I understand correctly multi-account context here, I added account default tags to provisioned resources.

So, take a look:



➜  accounts git:(master) ✗ terragrunt run-all apply                                                  <aws:markus-sso-master> <region:eu-west-1>
INFO[0000] The stack at /terragrunt_aws_multi_account/deployment/accounts will be processed in the following order for command apply:
Group 1
- Module /terragrunt_aws_multi_account/deployment/accounts/sandbox1/eu-west-1/infra
- Module /terragrunt_aws_multi_account/deployment/accounts/sandbox2/eu-west-1/infra
- Module /terragrunt_aws_multi_account/deployment/accounts/sandbox3/eu-north-1/infra
- Module /terragrunt_aws_multi_account/deployment/accounts/sandbox3/eu-west-1/infra

Are you sure you want to run 'terragrunt apply' in each folder of the stack described above? (y/n) y

Initializing the backend...
Initializing the backend...
Initializing the backend...
Initializing the backend...

Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using previously-installed hashicorp/aws v5.31.0

Terraform has been successfully initialized!

Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file

Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file

Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using previously-installed hashicorp/aws v5.31.0

Terraform has been successfully initialized!
- Using previously-installed hashicorp/aws v5.31.0

Terraform has been successfully initialized!
- Using previously-installed hashicorp/aws v5.31.0

Terraform has been successfully initialized!

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_s3_bucket.bucket will be created
  + resource "aws_s3_bucket" "bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = (known after apply)
      + bucket_domain_name          = (known after apply)
      + bucket_prefix               = "sandbox3-dev-eu-north-1"
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = true
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = {
          + "account"     = "333333333333"
          + "environment" = "dev"
        }
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + bucket_name = (known after apply)
aws_s3_bucket.bucket: Creating...

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_s3_bucket.bucket will be created
  + resource "aws_s3_bucket" "bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = (known after apply)
      + bucket_domain_name          = (known after apply)
      + bucket_prefix               = "sandbox2-dev-eu-west-1"
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = true
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = {
          + "account"     = "222222222222"
          + "environment" = "dev"
        }
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + bucket_name = (known after apply)

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_s3_bucket.bucket will be created
  + resource "aws_s3_bucket" "bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = (known after apply)
      + bucket_domain_name          = (known after apply)
      + bucket_prefix               = "sandbox1-dev-eu-west-1"
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = true
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = {
          + "account"     = "1111111111111"
          + "environment" = "dev"
        }
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + bucket_name = (known after apply)

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_s3_bucket.bucket will be created
  + resource "aws_s3_bucket" "bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = (known after apply)
      + bucket_domain_name          = (known after apply)
      + bucket_prefix               = "sandbox3-dev-eu-west-1"
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = true
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = {
          + "account"     = "333333333333"
          + "environment" = "dev"
        }
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + bucket_name = (known after apply)
aws_s3_bucket.bucket: Creation complete after 1s [id=sandbox3-dev-eu-north-120231228125652554400000001]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

bucket_name = "sandbox3-dev-eu-north-120231228125652554400000001"
aws_s3_bucket.bucket: Creating...
aws_s3_bucket.bucket: Creating...
aws_s3_bucket.bucket: Creating...
aws_s3_bucket.bucket: Creation complete after 3s [id=sandbox1-dev-eu-west-120231228125655120500000001]
aws_s3_bucket.bucket: Creation complete after 3s [id=sandbox2-dev-eu-west-120231228125655134100000001]
aws_s3_bucket.bucket: Creation complete after 3s [id=sandbox3-dev-eu-west-120231228125655260000000001]
Releasing state lock. This may take a few moments...
Releasing state lock. This may take a few moments...

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

bucket_name = "sandbox1-dev-eu-west-120231228125655120500000001"

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

bucket_name = "sandbox2-dev-eu-west-120231228125655134100000001"
Releasing state lock. This may take a few moments...

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

bucket_name = "sandbox3-dev-eu-west-120231228125655260000000001"
➜  accounts git:(master) ✗

After terragrunt run-all finishes, every sandbox account has S3 bucket in specified region.

There isn't any final extra output by terragrunt which would combine results on how many resources were created/updated in total.

Conclusion

Ok, that's all! I wanted to test Terragrunt and how it would perform on multi-account environment. Based on this trial, few key takeaways:

(Lack of )concurrency. Terraform/Terragrunt combination handles modules(here AWS accounts) sequentially, one after another. For few accounts this would work but for hundreds of accounts you may want solution with parallel deployments(AWS Control Tower, ADF etc)
Unclear TF plan. I love terraform plan. It is very precise, very clear on the changes it is going to perform. Adding multi-account structure with Terragrunt is not by default clear on which AWS account it is working on. Also, running terragrunt run-all destroy just warns that it is going to run destroy on all /deployment/accounts/ - folder. Resource specific information is shown only after destroy - command has been approved and run.

I think terragrunt is very useful tool for handling example separate dev / qa / prod environments in Terraform. But for multi-account management tasks it may turn out too lightweight.

I'm keeping my eyes on Hashicorp side and what is happening on Terraform stacks.

Timezone aware Lambda cron schedule

Markus Toivakka — Fri, 28 Oct 2022 11:22:07 +0000

The most straightforward way to trigger Lambda function on schedule is to create scheduled Eventbridge Rule. With Eventbridge Rule, one caveat is that all scheduled events are triggered using UTC +0 time zone. In many use cases that does not matter but there are tasks like:

Scheduled data loading
Backups
Environment scans and other "housekeeping tasks"

Where working with static UTC +0 is at least little bit annoying. You know how it goes, Lambda task that used to trigger 8:00 local time is one morning triggering 9:00 local time.

To implement timezone aware cron scheduled Lambda, Systems Manager Maintenance Window can be used. Maintenance Window is a technology overkill for such a simple task and I recommend to also check other features it offers for task scheduling. However, for now we are happy with just basic Lambda triggering.

How to

Full Cloudformation with timezone aware Lambda trigger can be downloaded HERE

Main resources for implementation are described as follows:

  StartWindow:
    Type: AWS::SSM::MaintenanceWindow
    Properties: 
      AllowUnassociatedTargets: true
      Cutoff: 0
      Duration: 1
      Name: CronTrigger
      Schedule: 'cron(0 18 ? * MON-FRI *)'
      ScheduleTimezone: 'Europe/Helsinki'

Minimum precision of Schedule is one minute and scheduled tasks are using ScheduleTimezone (here Europe/Helsinki).

  StartTask:
    Type: AWS::SSM::MaintenanceWindowTask
    Properties: 
      Name: StartTask
      Priority: 5
      TaskArn: [__LAMBDA_FUNCTION_ARN__]
      WindowId: !Ref StartWindow
      ServiceRoleArn: !GetAtt AutomationExecutionRole.Arn
      TaskType: LAMBDA    # also STEP_FUNCTIONS
      TaskInvocationParameters:
        MaintenanceWindowLambdaParameters:
          Payload: !Base64 '{"message": "Hello World!"}'

MaintenanceWindowTask defines a task that is performed on MaintenanceWindow cron schedule. Outside of Systems Manager automations, TaskType covers integrations to LAMBDA and STEP_FUNCTIONS. Other AWS or external services can be triggered with Systems Manager automation runbooks.

Lambda function Payload must be Base64 encoded string.

Conclusion

That's it! As Finland sets clocks back one hour in 30.10.2022, so does my Lambda cron schedule.

Debugging failed Eventbridge invocation

Markus Toivakka — Sat, 01 Oct 2022 14:54:21 +0000

When Eventbridge tries to send an event to a target and the delivery fails, by default only way to notice this is from FailedInvocation Cloudwatch Metric. The metric itself is not enough to get the actual reason why the event delivery is failing.
In general there are two options to debug FailedInvocaton issues:

Debug on the resource level. If your Eventbridge Rule is targeting Lambda function, try to search for failed Lambda invocations from Cloudtrail logs.
Forward failed deliveries to DLQ(Dead Letter Queue).

On this blog post I'm showing how to configure DLQ to Eventbridge target and how to write error logs to Cloudwatch Logs.

You can get full template from: https://github.com/markymarkus/cloudformation/blob/master/eventbridge-debug-dlq/template.yml

Walkthrough

We are starting with very basic AWS::Events::Rule on account 111111111111 which forwards events from custom.source to event bus on account 222222222222. FailedInvocation metrics shows that all the invocations are failing.(See Fig.1)

Enable error logging

To get better understanding why events are not reaching a target eventbus, following resources are added:

Configure DLQ(SQS) for failing target.
Set Eventbridge Target retry count to 0. Depending on the error type involved, Eventbridge retries to send event 24h before failing and sending the event to DLQ. Setting retry count to zero ensures that failed event is sent to DLQ asap.
Create Lambda function to get error messages from the DLQ(SQS) queue and writing error logs to Cloudwatch Logs.

Fig.1 Architecture

And this is how the configuration looks in Cloudformation template:

  CustomEventsRule:
    Type: AWS::Events::Rule
    Properties:
      EventBusName: !GetAtt CustomEventBus.Arn
      EventPattern:
        source:
          - custom.source
      State: ENABLED
      Targets:
        - Id: 'customtarget'
          Arn: 'arn:aws:events:eu-west-1:222222222222:event-bus/default'
          RetryPolicy:
            MaximumRetryAttempts: 0
          DeadLetterQueue:
            Arn: !GetAtt DLQueue.Arn

After dead letter queue setup is in place, wait for next failing invocation and open DLQ handler Lambda's execution logs from Cloudwath Logs. ERROR_MESSAGE and ERROR_CODE fields have human readable reason why the sending is failing.

....
            "messageAttributes": {
                "RULE_ARN": {
                    "stringValue": "arn:aws:events:eu-west-1:111111111111:rule/custom_event_bus/dev-eb-debug-CustomEventsRule-3GTDO9NDN1Q9",
                    "stringListValues": [],
                    "binaryListValues": [],
                    "dataType": "String"
                },
                "TARGET_ARN": {
                    "stringValue": "arn:aws:events:eu-west-1:222222222222:event-bus/default",
                    "stringListValues": [],
                    "binaryListValues": [],
                    "dataType": "String"
                },
                "ERROR_MESSAGE": {
                    "stringValue": "Lack of permissions to invoke cross account target.",
                    "stringListValues": [],
                    "binaryListValues": [],
                    "dataType": "String"
                },
                "ERROR_CODE": {
                    "stringValue": "NO_PERMISSIONS",
                    "stringListValues": [],
                    "binaryListValues": [],
                    "dataType": "String"
                }
            },

This time the delivery was failing because of terminated Eventbridge Policy on the receiving AWS account.

Conclusion

In general DLQs require some logic to handle failed events. Adding alarm for failed Eventbridge invocation and logging via DLQ is the first step to understand if that logic should be developed further.

Tracking grocery price trends on AWS - Part 2 - Analytics

Markus Toivakka — Thu, 28 Jul 2022 05:22:00 +0000

In Part 1, we implemented ingestion pipeline for grocery receipts and if you have followed through the instructions, you should now have test data from two grocery receipts extracted to JSON files in S3.

In this second part, I'm going to show you basic steps on how to run analysis on the extracted grocery data. AWS services we are going to use are AWS Athena and Amazon Quicksight.

Setup Athena

First we are creating Athena table for extracted data. Data schema is a simple one. Just remember to change the bucket name to match your case and import partitioned data with MSCK REPAIR.

CREATE EXTERNAL TABLE grocery_items (
    name string,
    price float,
    currency string,
    unit string,
    date string
  )
PARTITIONED BY (store string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-grocery-tracking-output/' 

MSCK REPAIR TABLE grocery_items

Query the data with SELECT * FROM grocery_items and you should get 37 rows of grocery data. Nice, that means the solution is end-to-end working.

Next we need more data.

Backfill the purchase history

If you were to start collecting receipts from this day onwards, it can be weeks or more likely months before there is a price change on the item. Luckily it's possible to get grocery purchase receipt history to see price changes leading to the current point.

S-Group and Kesko, biggest players in Finnish grocery market, both have applications with opt-in services for digitised receipts(apps: S-Mobiili, K-Ruoka). Both applications can export receipts in PDF. All we need to do is to shovel the PDF receipts into the ingestion pipeline and visualise the results. You can also scan/photo receipts into JPG and that is working as well.

Some words about the data

In analysis, I have treated grocery item name as unique identifier. For example, KAURAJUOMA(Oat Milk) in following trends means The Oat Milk. It is very possible that actual receipt line KAURAJUOMA can refer to BrandA Kaurajuoma(1.5e) or BrandB Kaurajuoma(1.95e). In such case, same item name in the data would have two different prices. That is very possible anomaly in the price item analysis if your buying pattern is not consistent.
If there is a data point missing on the graph that doesn't mean there has not been a price change. It just indicates there is no data on my receipts about the price change.

Query the grocery data

I have used S-Group's app and Omat Ostot service for about two years and on the following examples I am showing some trends based on that data. Basically I have first exported Omat Ostot - grocery receipts to PDF and then synced everything to the ingestion pipeline input - bucket.

I have about 6000 rows of data in grocery_items table.

First query is to check which grocery items are occurring most in the data. Which items are purchased most frequently:

SELECT name,COUNT(name) AS item_line_count FROM items GROUP BY name ORDER BY item_line_count DESC

1   KAURAJUOMA  161
2   LUOMU RASVATON MAITO    125
3   MAITOJUOMA LAKTON RASVATO   124
4   PAPRIKA PUNAINEN IRTO   87
5   PEHMEÄ MAITORAHKA  84
6   RUISPUIKULAT    79
7   KURKKU SUOMI    79
8   BANAANI LUOMU   74
9   TOMAATTI SUOMI  73
10  AVOKADOPUSSI    70

Huh, no beer on that list. Just milk, rye bread and vegetables.

Visualize the price trends

Next, I created a dataset from Athena grocery data to Amazon Quicksight. Following graphs are created from the dataset, data filtered to S-Market Kaleva supermarket.

First price trends are for dairy and meat products:

Next, vegetables and fruits. Seasonal price fluctuation is clearly the trend with the fresh vegetables. Bananas though..

Cereal products, bread, flour, misc. Coffee was a first product that caught my attention of the price increases early spring.

Based on these, talk about grocery inflation 2022 is a real deal. Prices are increasing in almost every category of tracked grocery products. It is interesting to see how the trends will be a few months down the line.

Conclusion

By following the steps detailed in these posts, you can implement described receipt ingest pipeline and start doing analyses on the data.

For this quick exercise, I wanted to just focus on the grocery inflation. Another interesting angle for the data would be to analyse prices between grocery chains. That would require understanding of the grocery range in each market and how to match references in the data for meaningful comparison.. And much more receipts.

Thanks for reading!

Tracking grocery price trends on AWS - Part 1 - Ingestion pipeline

Markus Toivakka — Thu, 28 Jul 2022 05:21:00 +0000

During the first half of 2022 inflation has been increasing prices in every category of goods and services. Inflation is mentioned every day on the news but without manual bookkeeping it can be hard to notice how the inflation affects daily cost of living. Small increases, accumulated, can make a big change on monthly or yearly budget.

How to track price changes for daily food essentials like milk and bread? That is the main question we are going to tackle in this blog posting. Food supplies that are bought again and again on every grocery run.

Overview

To track grocery price trends, I came up with an idea of gathering the pricing data from the grocery receipts. All the needed data is there:

Item name
Price per item
Purchase date
Shop name

Parsing and analysing the collection of grocery receipts provides the needed data for tracking the grocery price trends. All we need is an automated pipeline to extract the data from the receipts.

Part 1 of the blog post will cover how to setup the receipt ingestion pipeline. After we have finished, we should have an automation to extract the data from a receipt to JSON object.

In Part 2, I will present some ideas and results on price trend tracking. Data used on the analysis is from the close by supermarket, presenting price trends on the grocery items my household is buying regularly.

Architecture

Receipt data ingestion pipeline leverages serverless event driven workflow. An upload to S3 input bucket triggers receipt processing pipeline, resulting extracted grocery item data in JSON format to S3 output bucket.

Main used AWS services are:

Amazon Textract detects and extracts lines of text from printed receipts.
AWS Lambda is used for parsing the receipt data.

Following actions are triggered on every receipt upload:

Receipt .jpg or .pdf is uploaded to input bucket.
Trigger Lambda passes receipt filename and SNS - topic to Amazon Textract.
When Textract gets OCR data ready it publishes a Textract JobId to the provided SNS - topic.
Parser Lambda reads Textract result data, parses pricing data and writes result JSON to output bucket.
(Part 2) Grocery receipt JSON data is analysed with Amazon Quicksight.

Parsing the receipt data

Receipts from the following stores are supported:

S-Market
Prisma
Sale
K-Market
KCM
K-Citymarket

Image below contains an example of a grocery receipt. Depending on grocery chain or supermarket, receipt format may have some nuances like using commas instead of dots for price decimal separator.

When parsing the extracted receipt data, following variations on receipt item rows are implemented:

GREEN. General information about the purchase. Store name and receipt date.
YELLOW. Basic grocery item line has item name(MAITOJUOMA LAKTON RASVATO aka. non-fat lactose free milk) and price.
RED. Alennus, discount entry. Receipt item can have reduced price for various reasons. This pipeline is for tracking grocery price trends, so we are happy with the full price.
BLUE. Multiple items on the same entry(EUR/KPL) or EUR/KG priced goods. Total price is on the first line but per item or per kilogram price on the second. Same as with RED items. Because our aim is to track price trends we will read item name from the first line and item price from the second line. That way we can track the price trend for example 1 Kg of bananas, not for daily banana purchase.

For highlighted blocks on the receipt, pipeline outputs following JSON structure to:

s3://my-grocery-tracking-bucket-output/store=S-MARKET KALEVA PUH 0107671180/20191226-173900.json

Grocery store name is included to S3 prefix and used for data partitioning. More about that on the second part of the blog.
JSON data contains one receipt item line per JSON object:

20191226-173900.json:
{"name": "MAITOJUOMA LAKTON RASVATO", "price": 1.25, "currency": "EUR", "date": "2019-12-26 17:39:00"}
{"name": "100% KAURA 6KPL", "price": 1.59, "currency": "EUR", "date": "2019-12-26 17:39:00"}
{"name": "BANAANI LUOMU", "price": 1.79, "currency": "EUR", "date": "2019-12-26 17:39:00"}

Deploy with Cloudformation

Github repo for grocery-receipt-textract

To try out the solution, you can deploy the ingestion pipeline from Cloudformation template. The template creates all needed AWS resources to your AWS account.

git clone https://github.com/markymarkus/grocery-receipt-textract.git
cd grocery-receipt-textract
aws cloudformation package --s3-bucket cf-stage-sandbox-markus --output-template-file packaged.yaml --region eu-west-1 --template-file template.yml
aws cloudformation deploy --template-file packaged.yaml --stack-name dev-grocery-pipeline --parameter-overrides InputBucketName=my-grocery-tracking-bucket --capabilities CAPABILITY_IAM

# After the stack finishes, two buckets for receipts and pipeline outputs are created:
# Input = my-grocery-tracking-bucket
# Output = my-grocery-tracking-bucket-output

Next we will trigger the pipeline with some grocery receipt test data also included in the repo. Replace the bucket name with the input bucket name from the previous step:

aws s3 sync test_data s3://my-grocery-tracking-bucket/ 
# Wait for about 1 min and check the results:
aws s3 ls s3://my-grocery-tracking-bucket-output/
#PRE store=K-Market Domus/
#PRE store=S-MARKET KALEVA PUH 0107671180/

Conclusion

That's it! We have successfully created grocery receipt ingestion pipeline. In Part 2, we will put the pipeline in action and see if there are any hints of inflation to be found from the extracted price data.

Using Athena to query multi-account Cloudwatch Logs

Markus Toivakka — Sat, 07 May 2022 10:10:10 +0000

The scenario. We have multiple workloads and environments deployed to multi-account organization in AWS Organization. Cloudwatch Logs is used to to store logs from various services within a scope of a single AWS account. EC2s are pushing system logs, Lambda functions pushing execution logs and so on.

In order to increase understanding on application logs, aggregating logs from separate AWS accounts to a single service or S3 bucket can be helpful. Depending on business, regulatory compliance may also oblige you to keep logs for certain amount of time. Handling long time log storage in a centralised account can help when implementing access control and retention schedules.

This post demonstrates how to implement centralised log storage to multi-account organization in AWS Organizations. Main goal is to create cloud infrastructure to:

Store application logs from multiple AWS accounts to a single S3 bucket.
Run ad-hoc queries agains data in S3 with Amazon Athena.

Let's get started!

Overview

The proposed solution uses Cloudwatch Logs for log ingestion on source accounts. Amazon Kinesis is used for log delivery to centralised storage and S3 for long term log data storage. Finally we will use Amazon Athena to run SQL queries against these logs.

Deployment

Cloudformation templates for the logging solution can be found here: https://github.com/markymarkus/cloudformation/tree/master/centralised-cloudwatch-logs

logging-account-template.yml provisions infrastructure for Log Storage Account. Template parameter OrganizationId is used on Cloudwatch Logs Destination access policy to allow log delivery from AWS Organizations member accounts.

member-account-template.yml demonstrates how to create Subscription Filter to Cloudwatch Logs. You can create this also on the same account where logging-account-template.yml is deployed.

Log Source Accounts

Source accounts have workloads producing and ingesting logs to Cloudwatch Logs. Each log group is storing logging events from a specific source. In our example:
/var/log/messages stores system logs from EC2 instances.
/aws/lambda/hello-world stores logs from Hello World Lambda function.
Having consistent naming conventions for log groups will come handy when we run Athena queries against the data. Queries like '/var/log/messages of every EC2 instance in every AWS account of Organization' are depended on consistent naming.

Logs are sent from a member account to a receiving centralised destination in Log Storage Account through a subscription filter.

Log Storage Account

Log Storage Account receives and prepares logging data for Athena and stores logs to S3 bucket. Logs are stored in one JSON record per line format. Supported by Athena and easy to export other services and tools.

Transforming log events

By default, Firehose writes JSON records in a stream to S3 bucket without separators or new lines. This would work if we would not be using Athena to run queries. Athena requires each JSON record to be represented in a separate line.
To split JSON records we are using Firehose transformation Lambda. Lambda function reads records batch and adds a new line character + "\n" after every record. Processed records are written back to Firehose stream which finally delivers logs to S3 bucket.



output = []
for record in event['records']:
    payload = base64.b64decode(record['data'])
    striodata = BytesIO(payload)
    with gzip.GzipFile(fileobj=striodata, mode='r') as f:
        payload = f.read().decode("utf-8")

    # Add newline to each record
    output_record = {
        'recordId': record['recordId'],
        'result': 'Ok',
        'data': base64.b64encode((payload + "\n").encode("utf-8"))
    }
    output.append(output_record)

Running Athena queries

At this point we have application logs in S3 bucket in Log Storage account. Kinesis is a time based stream so each file contains logging from multiple Source AWS accounts and log groups:

To get Athena queries running, first create external table pointing to the data:



CREATE EXTERNAL TABLE logs (
    messageType string,
    owner string,
    logGroup string,
    subscriptionFilters string,
    logEvents array<struct<id:string,
              timestamp:string,
              message:string
              >>
  )           
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://BUCKET_NAME/logs/year=2022/month=05/day=05/'

logEvents are stored in array structure. In order to query Cloudwatch Logs fields within an array, you need to UNNEST logEvents. Here are couple of queries to get you started:

First query returns latest streamed logs from Source Accounts:



SELECT owner,loggroup,n.message FROM logs
CROSS JOIN UNNEST(logs.logevents) AS t (n)
LIMIT 20
######
1   111111111111    /var/log/messages   May 5 03:07:48 ip-10-0-24-56 dhclient[2085]: XMT: Solicit on eth0, interval 119340ms.
2   222222222222    /aws/lambda/hello-world-lambda  START RequestId: fcc9e873-d4c1-4ca3-a7de-fd5490300740 Version: $LATEST
3   222222222222    /aws/lambda/hello-world-lambda  [DEBUG] 2022-05-05T03:07:50.972Z fcc9e873-d4c1-4ca3-a7de-fd5490300740 {'version': '0', 'id': ....
4   222222222222    /aws/lambda/hello-world-lambda  [DEBUG] 2022-05-05T03:07:50.973Z fcc9e873-d4c1-4ca3-a7de-fd5490300740 Hello World!
5   111111111111    /aws/lambda/lambda-writer-LambdaFunction    START RequestId: 38fb9b6b-dbea-4875-91cc-cf1dd5b36ab9 Version: $LATEST
6   111111111111    /aws/lambda/lambda-writer-LambdaFunction    [DEBUG] 2022-05-05T03:08:16.81Z 38fb9b6b-dbea-4875-91cc-cf1dd5b36ab9
7   111111111111    /var/log/messages   May 5 11:40:01 ip-10-0-24-56 systemd: Created slice User Slice of root.
8   111111111111    /var/log/messages   May 5 11:40:01 ip-10-0-24-56 systemd: Started Session 169 of user root.

Next SQL query return log event count for each logging source(Log group):



SELECT owner,loggroup,count(*) FROM logs

CROSS JOIN UNNEST(logs.logevents) AS t (n)

GROUP BY owner,loggroup

1   111111111111    /var/log/secure 429

2   111111111111    /var/log/messages   1670

3   222222222222    /aws/lambda/hello-world-lambda  5764

4   111111111111    /aws/lambda/lambda-writer-LambdaFunction    7198

...

Conclusion

If you are in process of building and planning a logging strategy, this solution can be a good starting point. You can collect Cloudwatch Logs from multiple accounts and regions to a single S3 bucket. Athena queries can be executed against the consolidated logging data. I encourage you to experiment with SQL queries to the logging data. Analysing source specific logging patterns and event amounts may help you to improve an overall log management process.

Thanks for reading.

Paginate direct AWS SDK calls with AWS Step Functions

Markus Toivakka — Fri, 18 Mar 2022 09:16:46 +0000

AWS Step Functions AWS SDK integration lets you call huge selection of AWS services directly from your Step Functions workflow.

For API calls that can return a large list of items, APIs are returning by default only the first set of results. For example, S3 list objects response returns by default max. 1000 objects. Rest of the results must be requested by providing pagination token on the request.

For data processing, pagination is very useful and mandatory pattern. Dividing a result set to fixed size pages makes it easier to build for example meaningful retry logic for error handling. Also, executing partial result sets in parallel can improve a workflow running time significantly.

In this example I am showing how listing objects(arn:aws:states:::aws-sdk:s3:listObjectsV2) on S3 bucket and then triggering a processing step with batch of S3 objects would be implemented in Step Functions ASL(Amazon State Language).

Note: some AWS APIs use NextToken to paginate the results. Workflow with pagination is still same as I am covering next with ContinuationToken.

How to implement pagination with ASL.

Cloudformation template

This example shows very simple flow of listing objects on S3 buckets and then triggering the processing step.

Below is the ASL definition. BatchSize parameter controls how many S3 objects are included in each processing batch. We keep requesting new batches as long as the response is including IsTruncated: true. Size of last object batch is below BatchSize with IsTruncated: false so we can finish processing.

{
    "Comment": "List S3 objects.",
    "StartAt": "list_s3",
    "States": {
        "list_s3": {
            "Comment": "Get first batch of objects.",
            "Type": "Task",
            "Resource": "arn:aws:states:::aws-sdk:s3:listObjectsV2",
            "ResultPath": "$.s3_objects",
            "Parameters": {
                "Bucket": "${BucketName}",
                "MaxKeys": ${BatchSize}
            },
            "Next": "process_s3_objects"
            },
        "process_s3_objects": {
            "Comment": "Processing logic. Now we just wait.",
            "Type": "Wait",
            "Seconds": 2,
            "Next": "check_if_all_listed"
            },
        "check_if_all_listed": {
            "Type": "Choice",
            "Choices": [
                {
                "Variable": "$.s3_objects.IsTruncated",
                "BooleanEquals": false,
                "Next": "success_state"
                }
                ],
            "Default": "list_s3_with_continuation_token"
            },
        "list_s3_with_continuation_token": {
            "Comment": "Get next batch of objects. Provide ContinuationToken in the request.",
            "Type": "Task",
            "Resource": "arn:aws:states:::aws-sdk:s3:listObjectsV2",
            "ResultPath": "$.s3_objects",
            "Parameters": {
                "Bucket": "${BucketName}",
                "MaxKeys": ${BatchSize},
                "ContinuationToken.$": "$.s3_objects.NextContinuationToken"
                },
            "Next": "process_s3_objects"
            },
        "success_state": {
            "Type": "Succeed"
            }
     }
}

Wrapping up

AWS Step Functions is a perfect fit for coordinating workflows and orchestrating AWS services. I strongly recommend building library of good templates for getting a running start for adapting it to your use cases.

DEV Community: Markus Toivakka

My notes on AWS IPAM

What works well

Seamless VPC integration

IP pools

Sharing pools across accounts

Importing existing VPC CIDRs

What doesn’t work so well

Pricing

Missing features

Things to be aware of

Regions and locale

UI vs API behavior

Resource discovery quirks

Conclusion

Using AWS Identity Center (SSO) tokens to script across multiple accounts

Table of contents

1) How Identity Center tokens work (high level)

2) Token → IAM credentials: what happens under the hood

3) Security considerations

4) What an attacker can do with a stolen accessToken

5) Full example: running a ReadOnly check in all your SSO accounts using AWSReadOnlyAccess

6) Hardening recommendations and alternatives

Terraform AWS multi-region deployments: region meta-argument in Beta

The Traditional Approach: Provider Aliases

Enter the Region Meta-Argument

Final Thoughts

Building Bedrock Agents for AWS Account Metadata and Cost Analysis

Introduction

AWS Organizations API

Cost Explorer API

About Foundation Models

Overview of the Solution

Architecture

invoke_agent.py cli

Implementation Walkthrough

Using Mock vs. Real Data

Deployment

...And ACTION!

Key takeaways

LLMs Don’t Know the Current Date

Bedrock Agents Are Token-Hungry

Let the LLM Handle the Work

Conclusion

Implement mTLS on AWS ALB with Self-Signed Certificates

Generate self-signed certificates

Step 1: Generate x.509v3 Configuration Files

Step 2: Generate x.509v3 Certificates

Generate Certicates:

Key Points

Create an Application Load Balancer configured for mTLS

Step 1: Upload the CA Certificate to S3

Step 2: Provision the ALB infrastructure

Use cURL to Test mTLS

Step 1: Test mTLS Connection Without Client Certificate

Step 2: Test mTLS with Client Certificates

Conclusion

Multi-account AWS deployments with Terragrunt

Walkthrough

Terragrunt

Run

Output

Conclusion

Timezone aware Lambda cron schedule

How to

Conclusion

Debugging failed Eventbridge invocation

Walkthrough

Enable error logging

Conclusion

Tracking grocery price trends on AWS - Part 2 - Analytics

Setup Athena

Backfill the purchase history

Some words about the data

Query the grocery data

Visualize the price trends

Conclusion

Tracking grocery price trends on AWS - Part 1 - Ingestion pipeline

Overview

Architecture

5) Full example: running a ReadOnly check in all your SSO accounts using `AWSReadOnlyAccess`

`invoke_agent.py` cli