DEV Community: Manu Muraleedharan

Durable Functions in Lambda

Manu Muraleedharan — Thu, 15 Jan 2026 09:22:03 +0000

What if -

You could build multi-step applications and workflows - directly in Lambda?

Pause your app and continue based on callbacks?

All this, with the tried and tested AWS Lambda platform?

All this and more are possible with the new Durable Function Lambda announced during Reinvent 2025.

What are Durable Functions?
Lambda Durable Functions let you build resilient multi-step applications and workflows that can run up to a year, while preserving progress state across interruptions. Each run is called a Durable Execution. Each time checkpoints record the state so system can automatically recover from failures by replaying the execution - starting from the beginning but skipping any work that has already been completed.

How Does it Work?
Behind the scenes, durable functions use Lambda functions that use a checkpoint-and-replay mechanism. This tracks and supports long-running executions. When a durable function resumes from a wait state (suspension) or does a retry, your code runs from the beginning but checks what has already been completed by using checkpoints. Then it skips the steps that has been completed. It will just reuse the results of the completed checkpoints. This replay mechanism preserves the consistency.

Use Cases for Durable Functions

Many applications that you might have used, such as services like Step Functions or Orchestrated microservices, could be done using Lambda Durable Function.

For example, an order processing system might have these steps: reserve inventory, process payment, and create shipment.

Previously, if you reserve your inventory and then fail in one of the next steps, on potential retries you had to make sure to not reserve the item again, this required custom code to be written or other scaffolding. Now in Durable functions, these will be three separate steps. On retry, Lambda sees the checkpoint for reserving, and does not reserve again, automatically without you doing anything.

Payment processing could be an external api that requires you to implement retries, exponential backoff etc according to the API contract. These are handled natively by Durable Functions and are just parameters for the step.

Maybe the reserving happens in an external ERP system, and it is async. Earlier, your lambda or custom code has to wait (expensive waste) or poll status. Now, with durable functions, it can just wait and use a callback from the ERP system without spending any money for the wait.

Maybe you need a human to approve before creating the shipment. Durable functions can wait for human intervention before proceeding.

How to create Durable Functions
We create Durable Functions just like any Lambda, using the Lambda console or other IaC. Currently, it is available for NodeJS and Python, and in select regions only. (Ohio only as of now)

We use decorators in Python to designate parts of code as Durable Steps and Durable Execution.

Durable Step
Executes business logic with automatic checkpointing and retry. Use steps for operations that call external services, perform calculations, or execute any logic that should be checkpointed. The SDK creates a checkpoint before and after the step, storing the result for replay.

Durable Execution
How the Steps are run, where waits will happen or callbacks.

Sample Code:

from aws_durable_execution_sdk_python import (
    DurableContext,
    durable_execution,
    durable_step,
)
from aws_durable_execution_sdk_python.config import Duration

@durable_step
def validate_order(step_context, order_id):
    step_context.logger.info(f"Validating order {order_id}")
    return {"orderId": order_id, "status": "validated"}

@durable_step
def process_payment(step_context, order_id):
    step_context.logger.info(f"Processing payment for order {order_id}")
    return {"orderId": order_id, "status": "paid", "amount": 99.99}

@durable_step
def confirm_order(step_context, order_id):
    step_context.logger.info(f"Confirming order {order_id}")
    return {"orderId": order_id, "status": "confirmed"}

@durable_execution
def lambda_handler(event, context: DurableContext):
    order_id = event['orderId']

    # Step 1: Validate order
    validation_result = context.step(validate_order(order_id))

    # Step 2: Process payment
    payment_result = context.step(process_payment(order_id))

    # Wait for 10 seconds to simulate external confirmation
    context.wait(Duration.from_seconds(10))

    # Step 3: Confirm order
    confirmation_result = context.step(confirm_order(order_id))

    return {
        "orderId": order_id,
        "status": "completed",
        "steps": [validation_result, payment_result, confirmation_result]
    }

Once you run the Durable Function, you can see them in the new Durable Executions tab in Lambda console. Durable operations show the steps in the Durable Execution. It shows all the executions of this Durable Function.

You can go to Durable Operations. Here you can see the inputs, outputs, logs, etc. from the individual steps.

Event History shows the events during the Durable Execution. If you had a wait, etc., it will show here.

If these Durable Function screens are looking like the Step Function screens, you are not mistaken. They overlap in use cases, but they differ in how you model workflows, how observable they are, and where they work best.

Attribute	Durable Fn	Step Fn
Development	Imperative	Declarative
Best For	Orchestrate with one Function	Orchestrate among AWS services
Observability	CloudWatch	Visual Graphs

Here is a general guideline on when to use what

Prefer Lambda Durable Functions when:

    Workflow is primarily Lambda-based business logic.
    Team wants a code-first model and strong unit-testability.
    You need long-running, pause/resume behavior without building a full state machine.
    You do not require a visual workflow designer or many direct service integrations.

Prefer Step Functions when:

    Workflow orchestrates many AWS services and external systems.
    Visual observability and low-code configuration are important to the team.
    You have complex branching, human-in-the-loop, or very high-fanout orchestration.
    You want a clear separation between orchestration (state machine) and implementation (Lambdas/other services).

More Durable Function Samples showing advanced features: https://github.com/manumaan/durable-functions-lambda

Recreating a Nostalgic Game with Q CLI

Manu Muraleedharan — Wed, 28 May 2025 08:27:43 +0000

Back when I was studying at the College of Engineering, Trivandrum (Any CETians here?), during my MCA days, we had tons of fun in our hostel rooms. Only one of us had a UPS for his computer, so whenever there was a power cut, we all gathered in his room. His screen was the only one lit up, and we’d play a Flash game called Hangaroo—a Hangman clone where you had to guess the word before the kangaroo met its fate! Those memories still bring a smile to my face.

When I came to know about the Q CLI Game Challenge, I immediately knew what I was going to build - a recreation of dear old Hangaroo!

Q CLI is a wonderful coding assistant that natively works in your terminal. I was able to build 90% of the functionality with just one prompt. Then I spent a couple of hours tweaking the looks and gameplay to suit what I had in mind.

Why would you use Q CLI, when many other coding assistants exist? Let me list the superpowers:

Natively list and describe AWS resources
Multi-turn conversations
Can connect to MCP servers and use tools on them
Finds and uses your file system, Git, and OS tools.

Do you also want to create an app using Q CLI? Below are the steps:

Install Q CLI:
Q CLI Setup This should take care of most of the usual usecases.

Special situations
Wonderful article from Ricardo, if the usual cases did not apply to you.

Use the command q chat to start conversing with the Q CLI Agent. I asked it to create a hangman-like game using pygame, and told the agent what I remembered about Hangaroo.

In minutes I had a working game ready to play. Core functionality was up and running. Then I spent some time tweaking some of the looks and game play details, which I had forgot to give in the first go. I believe if you go with a plan, review, execute flow with Q you can probably achieve better results. I wanted to go raw at it and see how much the AI is able to achieve with the least amount of input, and I was impressed.

Gameplay (See if you can guess the movie title before I do) :

I showed the game to my son who was hooked. Now he plans to make his own version of Amongus with Q.

Voice to Voice AI with Amazon Nova Sonic

Manu Muraleedharan — Fri, 16 May 2025 18:19:59 +0000

Amazon Nova Sonic is a state-of-the-art speech-to-speech model that delivers real-time, human-like voice conversations with industry-leading price performance and low latency. Available with a bidirectional streaming API on Bedrock, Nova Sonic can enable developers to create truly natural, human-like AI agents that do not require users to type in their requests. What excites me most is that this capability opens AI access to many people who otherwise might struggle to use it.

Nova Sonic has both masculine-sounding and feminine-sounding voices, and can produce American and British English accents.

Nova Sonic can be used in Agentic workflows. It can consult knowledge bases using RAG and ground the information it gives to the user. It can do function calling, also called tool use. Since tools are supported, we are just a step away from utilising MCP servers with Nova Sonic.

Amazon Nova Sonic uses a persistent bidirectional connection that allows simultaneous event streaming in both directions.We use WebSockets in the demo below. This means that the conversation can flow very naturally, we can continuously stream the audio, and input can be processed while output is being generated. Just like humans, Nova Sonic can even respond without needing to wait for complete utterances from the user.

Nova Sonic is event-driven. client and model exchange structured JSON events and those events control the session lifecycle, audio streaming, text responses, and tool interactions.

How to use Nova Sonic? AWS SDKs in several languages, including Java, JavaScript, C++, Kotlin, and Swift, support the new bidirectional InvokeModelWithBidirectionalStream API. Python SDK, which uses async features to do this, is an experimental one, but it covers the basics well.

You will do the following (Python example, but same applies elsewhere)

Create a Sonic client.
Create function(s) that define how you will handle each event like ContentStart, ContentEnd etc.
Start a session with the client
Call the Invoke api above with await (in experimental Python SDK)

Demo Video Snippet:
Amazon Nova Sonic Demo

You can also get started with this Nova Workshop codebase:
Nova Sample code

Amazon Inspector

Manu Muraleedharan — Wed, 05 Jun 2024 08:52:30 +0000

Inspector is a Vulnerability scanning tool for AWS workloads.
Here is an over view from AWS: https://www.youtube.com/watch?v=viAn4E7uwRU

Personal note: In the context of other law-enforcement terms used to name AWS Security services(Detective, Guard etc), Inspector is a bit different. In my country, a police inspector (the inspector that comes to my mind when I hear the title) is a law enforcement officer, who conducts investigations. I would say AWS Inspector is more like the vehicle inspector, who would verify if your vehicle is configured fine and is not causing pollution.

For an instance to be scanned by Inspector, it needs to be a managed instance in SSM, and the below prerequisites need to be met:

SSM Agent is installed on EC2 and it is running
The instance has an IAM role with the required permissions to talk to SSM.
443 port is open outbound from EC2 instance so it can talk to SSM service.

Steps to follow if your EC2 is not coming as SSM Managed instance: https://repost.aws/knowledge-center/systems-manager-ec2-instance-not-appear

Types of scanning

EC2 scanning - An agent would be installed on the EC2 (SSM Agent) which would scan the EC2 for any vulnerability.
ECR Scanning - Scan images in ECR
Lambda scanning - Scan packages used in lambda
Lambda code scanning - scan code in lambda

For the vulnerabilities found, it will give relevant details like remediation steps, CVSS (Common Vulnerability Scoring System) Score, Inspector score etc which shows how critical this vulnerability is.

Suppression Rules
Say you have a web server. Then port 80, 443 being open is expected and you dont want to see warnings for that.
You can create suppression rules to avoid seeing specific vulnerabilities reported.

Vulnerability Database Search
Search CVE ID in vulnerability databases to get more info on the reported vulnerability.

Deep scan of EC2
In addition to OS packages, application packages would be inspected for vulnerabilities. You can specify which paths you need to be scanned.

Center of Internet Security (CIS) Benchmark assessments
Center of Internet Security, offers a suite of security benchmarks that serve as authoritative guidelines for securing IT systems, which are utilized extensively in the industry. On-demand/Scheduled scans can be run with CIS benchmark for specific operating systems. This will select resources based on the tags you specify.

Export Software BOMS
Export the Bill of Materials in the software packages analysed by inspector into industry standard formats. These BOMs contain a hierarchical list of all the individual components in a package. This functionality helps to check for vulnerabilities in a system not reachable from AWS.

Inspector Demo

EC2 scanning

Create an EC2 instance with an older version of Debian (version 10) from the marketplace.

https://aws.amazon.com/marketplace/pp/prodview-vh2uh3o4pdfow#pdp-overview

(This AMI is free to use)

Create a security group allowing traffic from anywhere(0.0.0.0/0) to 22. Attach this security group to the EC2.

SSH into the EC2, then install and start the SSM Agent.
Steps for this: https://docs.aws.amazon.com/systems-manager/latest/userguide/manually-install-ssm-agent-linux.html

Now the vulnerabilities from this EC2 would be detected by Inspector and shown in the console.

Security group with SSH from anywhere will also be detected as a vulnerability.

Container scanning
Pull a Debian 10 image from dockerhub and push it to the Amazon ECR.

docker pull debian:10.0

How to push to ECR: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html

Now vulnerabilities from this docker image would be detected by Inspector and shown in the console. You can look at vulnerability per container repo or by container image.

Lambda scanning

Create a lambda with some vulnerabilities. For example, lambda below updates a reserved variable and has an insecure socket connection.

import os
import json
import socket

def lambda_handler(event, context):

    # print("Scenario 1");
    os.environ['_HANDLER'] = 'hello'
    # print("Scenario 1 ends")
    # print("Scenario 2");
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.bind(('',0))
    # print("Scenario 2 ends")

    return {
        'statusCode': 200,
        'body': json.dumps("Inspector Code Scanning", default=str)
    }

Code needs to use runtimes supported by Inspector to be able to scan them. Many times this is not the latest, but the -1 version. For python, it is 3.11 as of writing this article, whereas latest version for lambda is 3.12.

You can check the supported versions here:
https://docs.aws.amazon.com/inspector/latest/user/supported.html

Now the vulnerabilities from the lambda code can be seen in the console.

AWS Detective

Manu Muraleedharan — Mon, 29 Apr 2024 06:13:34 +0000

AWS Detective service helps to analyze security issues.

It automatically collects and analyzes security logs like VPC flow log, Cloudtrail and guard duty results and utilizes machine learning, graph theory and visualization to help in RCA.

You can answer questions like
1) How did this security incident happen?
2) Where was the first intrusion
3) How to prevent such incidents.

Amazon Detective requires that you have Amazon GuardDuty enabled on your accounts for at least 48 hours before you enable Detective on those accounts. Findings are sent from GuardDuty to Detective every 6 hrs by default, this can be changed to be as fast as every 15 minutes. It takes 2 weeks of data to build a historical baseline.

Finding Groups
It groups findings from various services together based on incidents, so you can see the related findings in one place. It shows severity, entity affected, MITRE tactic used etc. The group is constructed as a graph that allows you to see the relation between various incidents that occurred. By default, the graph visualization is force-directed. You can manipulate the graph to get more details or different visualizations.

For principals, EC2 instances, and EKS clusters you can see the most number of API calls and success, and failure counts.
Powerful search functionality allows you to search the incidents in the environment through various options. This lets you see if the failures are consistent or if is there a suspicious pattern.

Inside the investigation, you can see a visualization of how different incidents and entities are related to each other. You can manipulate this graph, and research the information that Detective gathers to gain insight into the security incident.

Investigations
For findings in GuardDuty, you have the option to pivot to Detective and investigate the finding concerning the different entities involved (EC2 instance, IAM Role, Account, etc. )

This runs an investigation of the findings in the data so far gathered and creates a report of all the related and relevant data. This can be used, then by the security analyst to find out details on how the attack occurred (eg: A day with many failed API calls with a bruteforce SSH), and what are remediation actions to be taken (isolate EC2, revoke sessions, rotate keys etc..)

Analysts can find the details of the mappings to tactics, techniques, and procedures (TTP). All TTPs are classified according to their severity. The console shows the techniques and actions used. By selecting a specific TTP, they can see the details in the right pane.

Once the analyst has enough information about the incident they can take remediation steps (isolate EC2, revoke sessions, rotate keys etc..)

More information can be gained from this walkthrough of Detective (From AWS): https://www.youtube.com/watch?v=Rz8MvzPfTZA

AWS Guard Duty

Manu Muraleedharan — Tue, 23 Apr 2024 13:51:20 +0000

Guard Duty

First off, a simple definition:
GuardDuty is a guard that will stand in front of your workload and continuously let you know of any threats that are coming to your workload.

Now the real definition:
Amazon GuardDuty offers threat detection enabling you to continuously monitor and protect your AWS accounts, workloads, and data stored in Amazon Simple Storage Service (Amazon S3). GuardDuty analyzes continuous metadata streams generated from your account and network activity found in AWS CloudTrail Events, Amazon Virtual Private Cloud (VPC) Flow Logs, and domain name system (DNS) Logs. GuardDuty also uses integrated threat intelligence such as known malicious IP addresses, anomaly detection, and machine learning (ML) to more accurately identify threats.

Features of Guard Duty

Amazon S3 protection - Monitor object-level suspicious activity
EKS Protection - Monitor suspicious activities on EKS clusters
Runtime Monitoring - Using an agent, monitor suspicious activities on ECS(Fargate), EKS, EC2
Malware Protection - Scan EBS volumes for malware
RDS Protection - scans login activity on Aurora RDS
Lambda Protection - scans network traffic from Lambda execution

Suppression Rules - Selectively suppress some findings to automatically archive findings which are low-value, false positive etc, to reduce the noise.

Threat list - This is a list of known malicious IPs. This could be in many formats including industry-standard formats like STIX, OTX or even plaintext. Lists could be stored at an accessible internet URI, including your own S3 bucket.

Trusted List - stores known trusted IPs with the same storage and format characteristics

Findings - You can drill down into the findings, and get more information about the incident including the target of the attack, the actor of the attack etc. It also provides a link to pivot to detective and investigate this incident.

Demo

1.Malicious IP access

We create a text file with a known IP in it, say 8.8.8.8 and upload it to an S3 bucket. Specify this file as a Threat List inside GuardDuty.

From an EC2 inside your account, ping this IP.

ping 8.8.8.8

Soon, GuardDuty finds this and you can see it in the console.

2.Instance Credential Exfiltration

We know an EC2 instance could have an IAM Role (Instance Profile) which gives it access to AWS API calls as per the role permissions. We can simulate the scenario where a hacker has got access to the EC2 and is using these credentials to call AWS APIs.

Login to the EC2 and get the IAM credentials.
This could depend on the Instance Metadata Service Version of the EC2.
See this page for details:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials

For me, I am on v2. My Ec2-instance has the role ec2-admin

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/ec2-admin

From the output of this command note the Access Key ID, Secret Access Key and Session Token.

Now replicate same session in any other system terminal where you have AWS CLI installed. Below commands can be used for that. It creates a profile called badbob (BAD BOB!) who is the hacker.


aws configure set profile.badbob.region us-east-1

aws configure set profile.badbob.aws_access_key_id <AccessKeyId>

aws configure set profile.badbob.aws_secret_access_key <SecretAccessKey>

aws configure set profile.badbob.aws_session_token <Token>

export AWS_DEFAULT_PROFILE=badbob

Now using the session, issue several AWS API Calls. An example is below. Remember the hacker does not know the permissions on the role, so he may try many commands across the spectrum, so try a whole lot of options.

aws s3 ls --profile badbob

You can see GuardDuty finds this suspicious activity and reports it.

Note that in both cases, GuardDuty gives a whole lot of background information about the finding that helps the security team investigate this finding.

This includes an overview, resources involved in the finding, IAM details, Network details, the action in the finding, actor involved in the finding etc.

If you have enabled another AWS tool, AWS Detective at least 48 hours before you enabled GuardDuty, you would also see the option to investigate this finding in Detective.

We will continue this discussion with an article on AWS Detective.

How do you manage IaC promotion?

Manu Muraleedharan — Fri, 19 Apr 2024 13:26:45 +0000

How do you manage the changes in Infrastructure as code, with respect to testing before putting into production? Production infra might differ a lot from the lower environments. Sometimes the infra component we are making a change to, may not even exist on a non-prod environment.

Chaos Engineering in AWS with FIS

Manu Muraleedharan — Wed, 06 Mar 2024 09:59:33 +0000

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. It is not about creating chaos, it is about making the inherent chaos in real-world applications visible to you.

One of the famous Chaos Engineering tools was Netflix Chaos Monkey, which would shut down random machines in the environment and check the effect on the availability.

In the Well-Architected Framework, Chaos Engineering is a part of the Reliability pillar.

Design Principles of Reliability say:

Automatically recover from failure
Test recovery procedures
Scale horizontally to increase aggregate workload availability
Stop guessing capacity
Manage change through automation

AWS FIS (Fault Injection Simulator) corresponds directly to the 2nd principle and indirectly to the 1st principle.

FIS allows you to create Chaos Experiments and test your workload to see if your workload is designed reliably. It shows whether your recovery procedures work, and whether you can automatically recover from failures. It also gives you an idea of downtime that can be expected in the particular DR strategy you have chosen.

We recently had this question from a customer. They have only 2 AZ in their architecture, and they have a multi-AZ RDS database with a cross-region read replica. If one of the AZs goes down, will the CRR replica continue to function? Will it keep in sync with the DB after the multi-AZ failover happens? How much is the time, that we won't be able to access the RDS db, and the read replica? We want to optimistically answer with Yes for these questions and add "a very short time" to the last one. But how can we be sure?

You create an FIS experiment and test it out.

Key Features of AWS FIS

Managed Chaos Experiments: AWS FIS provides templates for creating and managing chaos experiments, allowing teams to focus on results rather than the intricacies of setup and execution.

Broad Fault Injection Capabilities: Simulate a wide range of failures, including server outages, network latency, unavailability of EC2 resources, and throttled database access, to understand their impact on your application.

Integration with AWS Services: Seamlessly integrate with other AWS services such as Amazon EC2, Amazon RDS, Amazon ECS, AWS Lambda, VPC, etc, enabling a comprehensive testing environment.

Safety and Security: AWS FIS is built with safety in mind, offering mechanisms to limit the blast radius of experiments and ensure that your production environments remain secure.

Experiment: An experiment in AWS Fault Injection Simulator (FIS) is a controlled procedure designed to assess the resilience of your AWS infrastructure by intentionally introducing faults or disruptions.

Template: This template defines the actions (faults) to be executed, the targets (AWS resources) those actions will affect, and any conditions or constraints. Templates ensure experiments are reproducible and standardized.
An experiment is a template in action.

Actions: Actions are the specific faults or disruptions you want to introduce. AWS FIS supports a variety of actions, such as stopping an EC2 instance, injecting latency into a network, and throttling database I/O operations, among others.

Targets: Targets are the AWS resources upon which actions will be performed. Targets can be specified explicitly or selected dynamically based on tags or other identifiers, allowing for flexibility in defining the scope of the experiment. Eg: Which RDS to stop, Which subnet to have network disruption.

Stop Conditions: To ensure safety and prevent unintended consequences, experiments can include stop conditions. These are criteria that, when met, will automatically terminate the experiment. This can be defined as a CloudWatch Alarm.

Creating an Experiment Template

Navigate to the AWS FIS console -> Experiment Template.

Select the account this experiment will run on

After giving a name and description, select the actions you want to include. I want to include a network disruption action that would disrupt all network connectivity in one AZ us-east-1a.

Action Type: NETWORK, aws-network-disrupt-connectivity
Target: (a target node is created automatically for the action which you will edit later)
Duration: I will disrupt for 2 minutes
Scope: (All all types of network connectivity will be disrupted including regional services like S3)

Now edit the Targets node.

You can select targets using resource tags, filters or directly using resource IDs. I am selecting the subnet ID:

Now, I create the action for DB instance going down:

Action type: RDS aws:rds:reboot-db-instances
Start-After: network-disruption (I want db to go down after the first action)
Target: (a target node is created automatically for the action which you will edit later)
Force failover: Yes (this will cause failover to the standby instance)

Now edit the Targets node: (Select the DB instance I have)

Actions are done, now specify the additional options:

Empty target resolution mode: What if you gave some tags to find the targets by, and at runtime, no targets were found? I am specifying that I want the experiment to fail in that case.

Service Access: What IAM Role would be used by the experiment? This needs to have access to CloudWatch if you want to log experiment logs to a CloudWatch log group. The default IAM role for FIS does not have CloudWatch access, edit the policy to add that.

Stop Condition: You wanted to bring down one AZ but ended up bringing down everything. Now you are panicking and want to stop the experiment. Specifying a CloudWatch Alarm will allow you to stop an experiment by putting the Alarm in Alarm State.
For eg: This could be set so it goes into Alarm if there are more than X messages in a queue.

You can also stop the experiment from the console by clicking "stop experiment" if you have access.

Logs: You can send the logs of the experiment to an S3 bucket or a CloudWatch Log Group.

Running the Experiment

Now let's look at our steady state. I have a multi-AZ RDS MySQL DB instance. Currently primary is us-east-1a

This DB sits over 2 AZs, evident from the Subnet Group for the DB, which includes us-east-1a and us-east-1b.

Please note: Both subnets in the group must share the same network accessibility. This could be done by having them associate with the same route table. Otherwise, after failover, it may be stuck in a subnet with no access.

At this point, both DB and the replica are in sync:

Once you save the experiment, start the experiment by clicking Start Experiment

You can see the status changing in actions.

In logs, you can see the targets that have been resolved.
It has found the following subnets:

It has found the DB:

When the network disruption is under process, your MySQL connectivity will seem stuck.

When the DB action starts, you will see the DB instance rebooting.

While rebooting you won't be able to insert into the DB.

When the console comes back to show DB as "Available" create a new connection to the DB.

Note that the console may not yet show the AZ change. It takes some time for it to reflect. But DB has failed over to the other AZ when it says available. It is a DNS change behind the scenes.

We can also query from the replica and it is still in sync!

We could notice a downtime of about 2 minutes for the RDS db to come back up, for our tiny db.t3.micro instance. More powerful instances take less time.

After some minutes console shows the new AZ for primary: us-east-1b

Actions should show as completed now.

The experiment is completed.

Conclusion

By using FIS, we can test the resiliency of our applications in AWS. In this demo, we saw how we tested the resilience of a multi-az db and read replica in the face of losing one of the 2 AZs.

More info can be found here: https://aws.amazon.com/fis/
Great demo at ReInvent: https://www.youtube.com/watch?v=N0aZZVVZiUw

Hope this was helpful!

Installing MySQL on Amazon Linux 2023

Manu Muraleedharan — Thu, 15 Feb 2024 13:18:55 +0000

MySQL does not come by default with Amazon Linux 2023.

I ran into some challenges installing it, and had to consult multiple blogs for success.

Follow these steps to install it.

Download the RPM file

sudo wget https://dev.mysql.com/get/mysql80-community-release-el9-1.noarch.rpm

Install RPM file

sudo dnf install mysql80-community-release-el9-1.noarch.rpm -y

You need the public key of mysql to install the software.

sudo rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2023

If you need to install mysql client:

sudo dnf install mysql-community-client -y

If you need server:

sudo dnf install mysql-community-server -y

Terragrunt with Terraform in AWS

Manu Muraleedharan — Mon, 12 Feb 2024 08:13:02 +0000

Intro

Terragrunt is a utility that can be used along with Terraform, to alleviate some of the challenges that come with using Terraform at scale. Today we will see Terragrunt in action when using it to create infra on AWS.

The main benefit of Terragrunt is to reduce the repetition in code and keep the code DRY (Don't Repeat Yourself principle).
When you have just one instance or one account of AWS, this problem may not occur. But as soon as you start managing multiple accounts or environments of AWS, you start copying code from one place to another and there are chances of manual mistakes and redundant code. Terragrunt helps with this.

Prerequisites
To use Terragrunt with Terraform on AWS, you will need to install the following:

AWS CLI
Terraform
Terragrunt

To use other functionalities we will see, we need:

TFlint (For before-hooks that run linting)
tflint
SSH - Add the SSH key to your GitHub account and make sure you can pull and push from GitHub using SSH. (for pulling Terraform at run-time from a git repo) Also add github to the known hosts:



   (host=github.com; ssh-keyscan -H $host; for ip in $(dig @8.8.8.8 github.com +short); do ssh-keyscan -H $host,$ip; ssh-keyscan -H $ip; done) 2> /dev/null >> .ssh/known_hosts

You can use this git repo to go through the demo examples:
Terragrunt Demo

This code will create 2 EC2 instances, one using local code and one pulling the code from a remote git repo.

Workflow of using Terragrunt

Install the prerequisites above.
Create a file called terragrunt.hcl which holds the terragrunt configuration.
Instead of terraform commands, run terragrunt commands: terraform init --> terragrunt init

terraform plan --> terragrunt plan

terraform apply --> terragrunt apply

terraform destroy --> terragrunt destroy

Keep your code DRY with Terragrunt

Backend configuration

Backend configuration in Terraform does not allow for variables. This means people copy the configuration from one environment and use it in another one and manually change something. Which can lead to errors. With terragrunt, we keep it parameterized.



#Keep your backend DRY
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket = "tf-state-manum-0202041706"
    key = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "tf-lock-table"
  }
}

This above code goes in terragrunt.hcl, and means the state will be kept in different paths for each module. Please note that the key is a variable, changing for each module.

Below image shows the bucket for the backend, where different module state is kept in different prefixes.

Keep provider configuration DRY

Note the below code in terragrunt.hcl, It uses assume_role to assume a specific role in the AWS for terraform.



#Keep your provider DRY
generate "provider" {
  path = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents = <<EOF
provider "aws" {
  region = "us-east-1"
  profile = "admin"
  assume_role {
    role_arn = "arn:aws:iam::644107485976:role/github_actions_role"
  }
}
EOF
}

Keep your CLI arguments dry

In many cases, you have variables to be passed to Terraform which changes for each account and each environment. These may be kept in different files and passed to Terraform with the CLI argument -var-file

This is cumbersome. Terragrunt avoids that and injects the variables using the below code in terragrunt.hcl
Variables from account.tfvars and region.tfvars will be injected into the modules where you are running terragrunt and it will be made available to them with the syntax var.xyz




#Keep your CLI DRY
terraform {
  extra_arguments "common_vars" {
    commands = ["plan", "apply"]

    required_var_files = [
     "${get_parent_terragrunt_dir()}/account.tfvars",
     "${get_parent_terragrunt_dir()}/region.tfvars"
    ]
  }
}

Before, After, Error Hooks

Terragrunt allows you to hook into the lifecycle of terraform runs, and you can run custom code either before, after, or in case of an error in terraform.

Below code will print "start Terraform" at the start of each module terraform using echo command, using before_hook.

It will print "Finished running Terraform" using after_hook.

error_hook will execute in case of an error.




    before_hook "before_hook" {
    commands     = ["apply", "plan"]
    execute      = ["echo","Start Terraform"]
  }

  after_hook "after_hook" {
    commands     = ["apply", "plan"]
    execute      = ["echo", "Finished running Terraform"]
    run_on_error = true
  }
    error_hook "import_resource" {
    commands  = ["apply"]
    execute   = ["echo", "Error Hook executed"]
    on_errors = [
      ".*",
    ]
  }

This brings up an interesting possibility. We could run some linters or other code validators using before_hook, like below:

before_hook "before_hook" {
commands     = ["apply", "plan"]
execute      = ["tflint"]

}

This will look for the .tflint.hcl within the path where you are running the code.

Here's a minimal .tflint.hcl



config {
  module = true
}

// Plugin configuration
plugin "aws" {
  enabled = true
  version = "0.29.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

plugin "terraform" {
  enabled = true
  preset  = "recommended"
  version = "0.4.0"
  source  = "github.com/terraform-linters/tflint-ruleset-terraform"
}

RUN-ALL command

Terragrunt brings run-all command to terraform. Say you have 100 modules in a folder and want to create all that infra. Either you have to run the terraform plan and apply repeatedly 100 times manually or through a script. With terragrunt you just do:



terragrunt run-all plan 
terragrunt run-all apply

Note: you can use skip argument to skip some of the modules.

Different Versions of code in Different Environments

One powerful feature of terragrunt is to pull versioned code out of code repositories and run that terraform code. Say you have production running on stable code, and on dev, you want to try out something. a 1.1 version. You can have that version of code tagged 1.1 on GitHub (or another repository) and then pull that code at the run time. This is done by code like below:



terraform {
  source = "git::git@github.com:manumaan/Terragrunt_Demo.git//compute?ref=v1.1.0"
}

The above code will pull the code with the 1.1.0 tag from the GitHub repo https://www.github.com/manumaan/Terragrunt_Demo and run that in that module.

Some Extra Features
Terragrunt will automatically retry any transient errors. (What is transient is defined by terragrunt). It will automatically run init if init has not been done in that path.

Debugging Terragrunt
You can specify log level during commands to get debug logs. Different levels are:

panic
fatal
error
warn
info --> This is the default
debug
trace

Terraform with Multiple Accounts/Environments

Using Terraform with multiple accounts or environments brings its challenges. Some approaches seen are:

Use different git branches/repos
Use Terraform Cloud

Git branches/repos don't add to the cost or learning complexity but do not help with repetitive code that has to be managed. Variable management is also not available.

Terraform Cloud provides all the features you want, but there is a cost. Integration with policy-as-code is a benefit. Instead of terragrunt.hcl files, here you will create workspaces and projects. Here's my article on Terraform Cloud that gives all the details: Terraform Cloud with AWS

Terragrunt takes a middle path, where it provides some of the functionalities you need for this use case, for no cost, while keeping the code repetition-free.

Below is a comparison of different approaches:

Hope this was helpful!

Terraform Cloud with AWS

Manu Muraleedharan — Sat, 13 Jan 2024 06:06:38 +0000

TerraForm
Terraform is an infrastructure-as-code tool to provision and tear down infrastructure on cloud or on-premises

Terraform Cloud
Terraform Cloud is a SaaS application that helps teams use Terraform together. It aims to solve many problems that arise when many developers are using Terraform in an organization.

It provides secure & easy access to shared state and secret data. No longer have to give out powerful credentials to everyone who needs to run Terraform. No need to have duplicated variable values.
It allows you to integrate with your version control repo and run terraform plans/runs when there is a state change on the repo, i.e. Pull-Request, Commit, etc. You can run a terraform plan on pull-request and see the speculated changes.
Cost Estimation: Estimate the cost of cloud resources you are creating.
You can apply policies to control what infrastructure changes are allowed, using policy-as-code tools like Sentinel or OPA. You can fail the terraform run if the cost is higher than $$$.

Get your own Free Terraform Cloud Account
You can sign up with your email ID and get a free Terraform cloud account from here.

Once your email is verified you are in Terraform Cloud. Click "Create New Organization"

Create Workspace
One Github Repository corresponds to one workspace. All terraform runs from the repo will come inside this workspace.

Terraform Cloud has 3 different workflow types:
VCS - Connect to your GitHub and trigger runs manually or automatically on commit/pull-request

CLI - Have the code on your local system and run from the terminal, but runs will still display on the portal.

API - Run terraform using API.

On the below screen select "Version Control Workflow"

Select your Version Control System (Github in my case) then do the login into the version control system.

Once logged in, you will be asked to select the repository that will be tracked by this workspace.

After this, enter the Workspace name, and you are done.

Connecting Terraform Cloud to AWS
Unlike the Terraform CLI you use on the local system, you don't need to have your AWS credentials available to the Terraform locally in the cloud version. Terraform Cloud can save variables that should be available to all workspaces (or some of them) centrally. You can keep them secure.

You create Variable Sets (that contain variables) and specify whether the variable set would be accessible to all workspaces or some of them.

Navigate to Settings - Variable Sets and create a Variable Set.

Inside this create 2 variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Set the value as the access key id and secret access key from AWS of an IAM user with the required permissions.

Select the variable type as "environment". The other option is "Terraform Variables" which are variables of terraform type.
Mark them as sensitive, and they are no longer visible to anyone.

Workspace Variables
You can specify variables that are local to a specific repository in its Workspace variables. This also allows you to have variables with the same name in different workspaces.
For eg: I have a Workspace with the below variables. These will be used by the terraform scripts in that workspace.

In effect, any workspace will get the variables in the workspace, plus the variable sets defined as global or linked to that workspace.

Now let's start Terraforming!
Add the code from the below repo into your Workspace repository.

manumaangit / terraformcloud_demo

Demo Repo for Terraform Cloud

This repo is for demo of Terraform Cloud

View on GitHub

The terraform does not run automatically the very first time, so trigger a run from the Workspace by going to it and clicking "New Run". You can trigger a run anytime manually using this button. In the below window, I have selected to do both terraform plan and terraform apply. You can choose to do only apply also.

Once it completes planning, just like CLI it will wait for approval to apply by default.

If you want to, you can set an option in Workspace settings to auto-apply.

If you don't want to apply, click "Discard Run".
Now onwards, anytime you commit or pull-request on the repository, it will trigger an automatic run. The run on pull-request is only a speculative plan and you cannot apply it. Only changes that are committed to the repo can be applied.

I make a small change in the main.tf file and raise a pull-request. Terraform Cloud will automatically run a plan from the changes.

I commit the changes. Another run happens which has the option to apply.

Cost Estimation
You can enable cost estimation in the Terraform Cloud. Goto Organization Settings - Cost Estimation and enable it.

Now if you change the ec2 type in the tf file and commit, you can see the cost estimation in the plan run.

Policies for Terraform
Policies-As-Code is a way to specify policies in the form of code, and there are multiple tools out there to do that. In Terraform Cloud 2 of the most popular ones are supported - Sentinel (by Hashicorp, makers of Terraform) and OPA.

In the Free version of Terraform Cloud, there are a lot of limitations on how many policies you can have - You can have a total of 5 policies and one PolicySet as of this writing. (PolicySet is what links a policy to one or more Workspaces in the Terraform Cloud). Also, you cannot fetch policies from a version control system in a Free version.

From Organization Settings - Policies you can create a Policy as given below. I have selected Sentinel type.

Note that in the Free version, you can have a maximum of one soft/hard mandatory policy and multiple advisory policies.

Soft Mandatory - This is like a Warning, even if the code breaks the policy you can override it.
Hard Mandatory - This is an Error. You cannot override if the code breaks the policy you can override it.
Advisory - Will just print a message. More like INFO. Below given Sentinel code checks if the change in infra will incur a cost of more than $100.

import "tfrun"
import "decimal"
delta_monthly_cost = decimal.new(tfrun.cost_estimate.delta_monthly_cost)
main = rule {
 print("Cost change cannot be more than $ 100") and
 delta_monthly_cost.less_than(100)
}

Once the policy is saved, Navigate to Organization Settings - PolicySets and click on "Connect a New PolicySet"

Then click the link that says "Create a PolicySet with individually managed policies"

Set what kind of policies are in the set (Sentinel). Select Policies enforced globally to apply to all Workspaces. At the bottom from the drop-down select the previously created Policy and add it to this PolicySet. Click "Connect PolicySet".

Now let's change the EC2 type to t2.2xlarge which will surely break this policy.

The cost estimation is $276.97 which triggers the Policy to fail. Note that Apply is not an option anymore since we set it as "Hard Mandatory"

Try out the Terraform Cloud and take your Terraforming to the next level.

Hope this was helpful!

Cool Announcements @ AWS ReInvent CEO KeyNote

Manu Muraleedharan — Wed, 29 Nov 2023 08:01:31 +0000

S3 Express One Zone
Highest performance lowest latency cloud storage
50% less cost than S3 standard
10 times faster than S3 standard
Single-digit millisecond latency
Millions of requests per second
Co-locate storage and compute in the same AZ
Pinterest has 10x faster write-speed and 40% less cost from S3 Express One Zone

Graviton 4
30% faster than Graviton3
Faster and more energy-efficient
R8g Ec2 instance with Graviton4 in preview

EC2 Ultra-clusters
20,000 GPUs, connected by EFA of 32000 GBPS
Equal to a supercomputer = 20 Exaflops

New GPU = GH200 which is 4 times faster with new LLM compilers
Uses Grace Hopper technology to connect CPU and GPU at 1TBPS
32 GH200 can connect via the NVLINK switch.

NVIDIA DGX cloud is coming to AWS
This is NVIDIA's AI Factory, connecting 16,000+ GPUs together.
65 Exaflops compute capacity and will make LLM learning 2x fast

*EC2 Capacity Blocks for ML *
Reserve EC2 Ultra Clusters for short-term usage

Trainium 2
4x faster, second-gen specially designed chips for training models

ML Customization in AWS

Fine-tuning available in:
Titan Text Lite, Express, Cohere Command Lite, Meta Llama 2, and Anthropic Claude

Retrieval-Augmented Generation with Knowledge Bases
(Announced Sept 2023)

Continued Pre-Training is available in AWS Bedrock. The technique involves using large amounts of unlabeled data before fine-tuning a model,

*Agents for Bedrock *
Execute multi-step actions across company systems powered by ML

Guardrails for Bedrock
Safeguard generative AI applications with responsible AI policies

Education Commitment
AWS commits to training 29 million people in the cloud and 2 million in AI for free by 2025.

*AWS CodeWhisperer Customization Capability *
Provide custom code suggestions using internal SDK, api, and code.

Amazon Q (Some of the features in Preview)
This is the new service I am most excited for.
AI assistant designed for the business world that understands your company information.
Can chat with Q in AWS console, documentation, code whisperer, and chat apps like Slack.
Q is already trained with all the AWS information, WAF principles, etc.
Troubleshoot errors with Q when you get an error in the AWS console.
Get recommendations and step-by-step information
Feature Development: Develop a new feature in AWS using Q using prompts interactively and iteratively
Code Transformation: Use Q to upgrade language versions in code eg: 1000 Java apps upgraded in 2 days
Business expert: Connect to over 40 data sources and answer business questions, supports RBAC
Amazon Q inside Quicksight: Create BI reports and visualizations using generative AI by Q
Amazon Q inside Amazon Connect: AI Agent stays on-call, to help on-call agents with customer interaction

Zero-ETL integrations with Redshift in:
Aurora Postgres, RDS for MySQL, DynamoDB
As soon as data is written into these databases, query and analyze in Redshift without ETL pipeline.

Zero-ETL integration between DynamoDB and OpenSearch
Search DyanmoDB data through Opensearch without doing ETL.

AI recommendations for Amazon DataZone
Add business descriptions to data in DataZone using AI.

Project Kuiper
A constellation of low Earth orbit satellites that aims to provide fast, affordable, and reliable broadband to customers in areas without reliable internet connection. Private network connectivity is now available.