DEV Community: Monica Colangelo

Automated Mass Tagging in AWS Across Accounts and Organizations

Monica Colangelo — Thu, 17 Aug 2023 16:44:47 +0000

Tagging strategy: easier said...

In the expansive world of AWS, tagging resources stands out as both a straightforward task and an essential one. On the surface, it's about assigning a label, a seemingly simple action. Yet, the implications of this action are profound. Tags are not just mere identifiers; they're pivotal tools in organizing, managing, and optimizing your cloud environment, because they cater to a range of organizational needs, such as:

Expense Tracking: for instance, with Cost Allocation Tags, you can monitor specific costs tied to a particular project or department.
Infrastructure Automation: tags can trigger Automated Infrastructure Activities. Think of an instance that's tagged as 'development' being automatically shut down outside of working hours to save costs.
Project Phases: with Workload Lifecycle tags, you can easily identify whether a particular resource is in the 'testing', 'development', or 'production' phase.
Issue Resolution: incident management tags can help in quickly identifying resources that might be affected during an outage or incident.
Maintenance: update management tags can indicate when a resource was last patched or updated, ensuring timely maintenance.
Operational Insights: for a clear view of your operations, Operational Observability tags can denote the health or status of resources.
Data Protection: risk and security management tags can highlight resources that contain sensitive data, ensuring they have tighter security controls.
Access Management: identity and access tags can dictate who within your organization can access specific resources, reinforcing security protocols.

A well-defined tagging strategy is paramount. AWS itself recognizes the significance of this and has published an extensive whitepaper detailing best practices and guidelines. This strategy isn't just about knowing what to tag, but understanding the 'why' and 'how' behind each tag.

...than done

So, once we've established our tagging strategy, is it smooth sailing from there? Well, as the saying goes, "easier said than done." Indeed, a strategy is only as good as its execution plan. A comprehensive strategy must be paired with a pragmatic action plan detailing its implementation.

From our list of tagging use cases, one aspect becomes abundantly clear: tags, while seemingly simple tools, cater to a myriad of distinct purposes. These purposes, in turn, address the needs of diverse teams within an organization. Whether it's Finance, Operations, Security, or Development teams, each has its unique requirements. These teams might possess different skill sets, operate on varying timelines, and even employ distinct tools for their tagging activities. The challenge then is coordination: how do these teams work in tandem without stepping on each other's toes?

Organizational complexity

To truly grasp the intricacies of tagging, let's delve into a real-world scenario I encountered. In this setup, there's a team, aptly named "Cloud Center", responsible for managing the AWS Organizations. Their tasks encompass creating, overseeing, and auditing the AWS accounts affiliated with the organization.

Within this Organization, there are several Organizational Units (OUs) mirroring the company's internal structure. Each OU houses multiple projects, and each project might have separate accounts for development, testing, and production.

Each OU is backed by a Platform Team, providing project teams with essential tools, such as CI/CD pipelines and Infrastructure as Code (IaC) execution tools like Terraform. Then, every project has a separate team, which is diverse, comprising roles like backend engineers, DevOps specialists, QAs, and more. Notably, many team members are often consultants or contractors dedicated to specific projects rather than company employees.

Additionally, there are shared OUs and accounts dedicated to operational services, like networking, Transit Gateway, Network Firewall, and DNS, or security-centric tasks: each of these is managed by a different team.

On top of all that, this Organization belongs to a larger corporate group. At the group level, there's a need to monitor the spending of each subsidiary company. This requires data extraction from each Organization, necessitating the tagging of AWS accounts themselves, with unique labels and values consistent across the entire corporate group.

The mass-tagging hierarchy

In such a multifaceted environment, expecting every individual to simply read a tagging strategy manual and apply it flawlessly is wishful thinking. Each team has its tagging objectives aligned with its goals. However, keeping track of everyone's tagging needs would be a Herculean task. While enforcing a Tagging Policy can provide some structure, manually reconciling the requirements of so many teams would be a colossal drain on time and resources.

The reality of the matter is that tags, in many instances, aren't particularly volatile entities. In an ideal setup, they act as labels assigned during the creation of a resource. Once in place, these tags seldom change, except for specific use cases. Given this nature, it's counterproductive to burden individuals with a task that, with the right precautions, can be seamlessly automated. After all, machines are inherently better suited for repetitive and mundane tasks than humans.

This realization led to the adoption of a multi-tiered mass-tagging strategy. Each "tier" or "level" employed a recurring Lambda function to tag all resources under its purview. Care was taken to ensure that tags from one level didn't overwrite or remove those from another.

Cloud Center Command

At the topmost tier, the Cloud Center took on the responsibility of tagging AWS accounts directly. This was primarily for billing, finance, and cost allocation purposes. They utilized tags that were globally unique within the corporate group, along with additional tags indicating the OU, project, and environment specifics. Here's how the process was streamlined:

every night, an EventBridge-scheduled Lambda function would activate. This function would:

read all the tags (both key and value) for each account

def list_accounts():
    existing_accounts = [
        account
        for accounts in _org_client.get_paginator("list_accounts").paginate()
        for account in accounts['Accounts']
    ]
    return existing_accounts

def get_account_tags( account_id):
    formatted_tags = _org_client.list_tags_for_resource(
        ResourceId=account_id)
    return formatted_tags

def handler(event, context):
    for account in list_accounts():
    account_id = account.get('Id')

    try:
        tags = get_account_tags(account_id)
    except Exception as ce:
        logger.error(
            f'Exception retrieving tags in Organization for account {account_id}: {ce}')
        continue

create a message for each account in an SQS queue

sqs.send_message(
    QueueUrl=sqs_queue_url,
    DelaySeconds=15,
    MessageAttributes={
        'Account': {
            'DataType': 'String',
            'StringValue': account_id
        },
        'Tags': {
            'DataType': 'String',
            'StringValue': json.dumps(response_json)
        },
        'Region': {
            'DataType': 'String',
            'StringValue': reg
        }
    },
    MessageBody=(
        f'Tag value for account {account_id} in region {reg}'
    )
)

each message in the queue would then trigger a second Lambda function. This function would:

read the message content

for record in event['Records']:
    account_id = record['messageAttributes']['Account']['stringValue']
    tags_raw = record['messageAttributes']['Tags']['stringValue']
    reg = record['messageAttributes']['Region']['stringValue']
    receipt_handle = record['receiptHandle']
    tags = json.loads(tags_raw)['Tags']

assume an IAM role in the target account
save the tag list in an SSM parameter

list resources to be tagged using the resourcegroupstaggingapi

client = create_boto3_client(account_id, 'resourcegroupstaggingapi', assume_role(account_id), reg)
map = client.get_resources(ResourcesPerPage=50)
list = get_resources_to_tag(map['ResourceTagMappingList'], tagkey, tagvalue)

[...]

def get_resources_to_tag(map, tagkey, tagvalue):
    resourcelist = []
    for resource in map:
        logger.debug(f'Resource: {resource}')
        if resource['ResourceARN'].startswith('arn:aws:cloudformation'):
            logger.debug(
                f'Resource {resource} is a cloudformation stack, we do not need to tag it')
            continue
        to_be_tagged = True
        for tag in resource['Tags']:
            if tag['Key'] == tagkey and tag['Value'] == tagvalue:
                to_be_tagged = False
                logger.debug(
                    f'Found tag {tagkey} with value {tagvalue} in resource, no need to retag')
                break
        if to_be_tagged == True:
            logger.debug(
                f'NOT FOUND tag {tagkey} with value {tagvalue} in resource, need to tag')
            resourcelist.append(resource['ResourceARN'])
    return resourcelist

finally, apply the tags.

This approach ensured that the Cloud Center team maintained tags in a centralized manner, eliminating the need for disparate synchronization efforts.

💡
For those interested, you can find a Python version of these Lambda functions here.

Platform Team Playbook

The second "tier" in this tagging hierarchy is occupied by the Platform Teams of each OU, and their approach mirrors that of the Cloud Center, albeit with some tailored modifications.

In the case of the Platform Teams, their management account has read delegation over the Organization. The nightly process for them unfolds as follows:

Account Listing from Organization: Triggered by EventBridge at a different time than the Cloud Center's process, a Lambda function initiates and reads the list of accounts specific to its OU from the Organization.
Mass Tagging (If Necessary): If the Platform Team has its specific tags to apply, it employs a mass-tagging approach identical to the Cloud Center's. It's worth noting that not all Platform Teams have this requirement. Some use this technique to assign tags that indicate, for instance, which EBS volumes need backups or which non-production EC2 instances can be shut down during nights and weekends.
Terraform Pipeline Integration: Given that these Platform Teams provide project teams with Terraform execution pipelines, they adopt a methodology (detailed in this article) that dynamically instructs the AWS provider in Terraform to "ignore" certain tags. This list of "ignored" tags is a merger of the Cloud Center's tags (read from the SSM Parameter saved by the Cloud Center's Lambda function) and their own.
```
provider "aws" {

  ignore_tags {
    keys = ["cost_centre","environment","territory","service","billing_team"]
  }

  [...]
}
```

Project Team Precision

The final tier in this tagging hierarchy is the project teams. Their primary focus is on their specific projects, and they shouldn't be burdened with the complexities of the overarching tagging strategy. While transparency is essential, and indeed, tag information is openly shared (given that tags are visible and not shrouded in secrecy), it's not the project teams' responsibility to manage or be overly concerned with them.

These teams have the liberty to add project-specific tags using their primary tool, Terraform. However, there's a catch: they can only use Terraform through the pipeline provided by their respective Platform Team. This constraint is in place because individual user accounts have very limited permissions, typically read-only. This restriction ensures that resources are not proliferated haphazardly without version control on Git, avoiding the pitfalls of ClickOps.

The beauty of the pipeline's design is its runtime instruction to Terraform to ignore specific tags. This feature acts as a safeguard. Even if a team member inadvertently adds a tag in Terraform that matches an existing tag but with a different value, the pipeline ensures that the original value remains untouched and the new value is disregarded.

Making Sense of the Tagging Puzzle

Tagging in AWS might seem like a small task, but as we've seen, it's a big deal. Getting from a plan on paper to actually tagging everything right is no walk in the park. But with a good system in place and everyone on the same page, it becomes a lot easier.

What's the main lesson here? Keep things automated and work together. Machines are great at repetitive tasks, so let's let them handle that. And when teams collaborate, the whole tagging process becomes smoother and more efficient.

Guardian of the Functions: Keeping an Eye on your Galaxy of AWS Step Functions with Custom Metrics on CloudWatch

Monica Colangelo — Tue, 18 Jul 2023 18:09:21 +0000

Introduction

Managing multiple AWS Step Functions can quickly turn into a complex task, especially when each function forms a crucial link in a broader process. For instance, consider a data processing system where numerous files are uploaded, analyzed, and then relocated. Each step of this process could be orchestrated by its own Step Function, executing a variety of tasks in sequence.

For a team monitoring this process, an error in any of these functions could disrupt the entire sequence and halt the processing of subsequent files. Therefore, having a clear, real-time understanding of the status of each Step Function's latest execution is not just a nice-to-have—it's essential.

Now, imagine a scenario where your team is handling not just one, but dozens or even hundreds of such sequences—each represented by an AWS Step Function. Manually monitoring the status of each function's latest execution becomes an incredibly time-consuming task, and the risk of missing a crucial error increases.

This is where our Guardian comes into play 🧑‍🚒

Our goal is to create an intuitive dashboard that offers an at-a-glance overview of the status of each Step Function. Think of it as a traffic light system: green for successful executions, red for failures. At any moment, a quick look at this dashboard will tell us if all our functions are operating correctly or if there's a hitch in our sequence that needs our immediate attention.

In this blog post, we will outline how to use Terraform and AWS CloudWatch to achieve this. Terraform will help us set up and manage our infrastructure, while AWS CloudWatch will provide the platform for our monitoring dashboard. With these tools at our disposal, we'll turn the daunting task of overseeing a multitude of AWS Step Functions into a manageable, even effortless process.

The Missing Piece in AWS

When dealing with AWS Step Functions, one might assume that AWS would offer a native metric in CloudWatch for monitoring the status of the most recent execution of a function. After all, AWS provides a plethora of such metrics out of the box for many of its services.

Unfortunately, this isn't the case for Step Functions. While AWS does offer metrics like the total number of executions, succeeded executions, failed executions, and throttled executions, these are all aggregate metrics. They provide a broad view of a function's performance but do not offer insight into the status of each function's latest execution.

This lack of granularity can be a significant hurdle when monitoring a large number of Step Functions, especially when the status of the most recent execution is the key metric we're interested in.

So how can we fill this gap? The solution is to create our own custom metric, and in the next section, we'll dive into how we can use AWS Lambda and CloudWatch to do just that.

Creating a Custom Metric with AWS Lambda

Since AWS doesn't offer a native metric for the status of the latest execution of a Step Function, we need to create this metric ourselves. To do this, we'll use AWS Lambda, a service that lets you run your code without provisioning or managing servers.

The idea is straightforward: we'll create a Lambda function that periodically checks the status of the latest execution of each of our Step Functions and then publishes this information as a custom metric to CloudWatch.

Configuring IAM permissions

The first thing we need to do is ensure our Lambda function has the necessary permissions to both read the status of our Step Functions and publish custom metrics to CloudWatch. To do this, we can create an IAM role with the following policy (or you can see how I use Terraform to create it, in the next chapter):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "states:ListStateMachines",
                "states:DescribeExecution",
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        }
    ]
}

This policy allows our function to list all state machines (i.e., Step Functions), describe their executions, and put metric data into CloudWatch.

Creating the Lambda function

With our permissions configured, we can now create our Lambda function. Here's a high-level overview of what our function will do:

List all Step Functions in our account using the ListStateMachines API.
For each Step Function, fetch the most recent execution using the DescribeExecution API.
Map the status of each execution to a numerical value with a function status_to_number.
Publish these numerical statuses as custom metrics to CloudWatch using the PutMetricData API.

Here's an example of what the Python code for this Lambda function might look like:

import boto3

def lambda_handler(event, context):
    # Initialize clients
    sf_client = boto3.client('stepfunctions')
    cw_client = boto3.client('cloudwatch')

    # List all state machines
    state_machines = sf_client.list_state_machines()['stateMachines']

    # Loop through all state machines
    for sm in state_machines:
        # Get latest execution status
        status = sf_client.describe_execution(
            executionArn=sm['stateMachineArn']
        )['status']

        # Map status to a numerical value
        status_value = status_to_number(status)

        # Publish custom metric to CloudWatch
        cw_client.put_metric_data(
            Namespace='StepFunctions',
            MetricData=[
                {
                    'MetricName': sm['name'],
                    'Value': status_value
                },
            ]
        )

def status_to_number(status):
    mapping = {
        'RUNNING': 1,
        'SUCCEEDED': 2,
        'FAILED': 3,
        'TIMED_OUT': 4,
        'ABORTED': 5
    }
    return mapping.get(status, 0)  # return 0 if status is not in the mapping

This way, each state is represented by a unique numerical value, providing more granular information about the status of your Step Function executions.

Scheduling the Lambda function

The final piece of the puzzle is to ensure our Lambda function runs periodically to keep our custom metrics up-to-date. To provide the most recent status of our Step Functions, we will schedule our Lambda function to run at regular intervals using Amazon EventBridge.

How do we ensure that our CloudWatch alarm reflects only the most recent state of the Step Function, not past states? This is a valid concern, as CloudWatch alarms often aggregate data over a certain time period, potentially mixing up the statuses of different Step Function executions.

This is where choosing the right statistic for our CloudWatch alarm comes into play. We will use the 'Maximum' statistic with a period of 1 hour for our alarm. This ensures that the alarm state always reflects the highest (i.e., most severe) status reported by the Lambda function in the past hour.

Why 'Maximum' and why a period of 1 hour? The 'Maximum' statistic ensures that if there's any failed execution (which we mapped to a higher value), it will be the status taken into account. The period of 1 hour is less than the frequency at which our Lambda function is invoked (3 hours). This way, each evaluation period of the alarm is guaranteed to consider only the most recent execution status.

💡
Remember, the right frequency and period may depend on your use case, and you may need to adjust these values to fit your specific needs.

Step-by-Step: Creating our Monitoring Tool with Terraform

You have many AWS Step Functions to monitor and you're probably thinking, 'Surely, I don't have to set all of this up manually, right?' Fear not, because that's where Terraform comes in. By leveraging Infrastructure as Code, we can automate the process of creating our monitoring dashboard, saving time and ensuring consistent configuration. Let's dive into how we can use Terraform to solve our monitoring problem without having to resort to endless manual setup.

Terraforming the Custom Metrics Lambda function

Let's first create the Lambda function using the Terraform AWS Lambda Module. The AWS Lambda function will be responsible for checking the status of each Step Function and then pushing the corresponding status value to CloudWatch.

Below is a possible Terraform configuration that creates the Lambda function:

module "lambda_step_function_status" {
  source                   = "terraform-aws-modules/lambda/aws"
  version                  = "4.16.0"
  function_name            = "${local.project}-${var.env}-step-function-status-check"
  handler                  = "step_function_status.lambda_handler"
  runtime                  = "python3.8"
  memory_size              = 256
  timeout                  = 30
  architectures            = ["x86_64"]
  publish                  = true
  source_path              = "${path.module}/../source/lambda/step_function_status.py"
  artifacts_dir            = "${path.root}/.terraform/lambda-builds/"
  attach_policy_statements = true
  policy_statements        = {
    step_functions = {
      effect    = "Allow",
      actions   = ["states:ListStateMachines", "states:DescribeStateMachine", "states:DescribeStateMachineForExecution"],
      resources = ["*"]
    }
    cloudwatch = {
      effect    = "Allow",
      actions   = ["cloudwatch:PutMetricData"],
      resources = ["*"]
    }
  }
}

This Terraform configuration creates a new AWS Lambda function called step-function-status-check. The function is configured with Python 3.8 as the runtime environment and the handler is set to step_function_status.lambda_handler.

The source_path parameter is used to specify the location of the Python script, which contains the logic for checking the status of Step Functions and pushing the results to CloudWatch. The AWS Lambda function is granted permissions to list and describe state machines (i.e., Step Functions) and to put metric data to CloudWatch.

We can then use the outputs of this Lambda function in our subsequent steps to set up the CloudWatch alarms and dashboard.

Terraforming a trigger event for Lambda in Eventbridge

Now, let's use an EventBridge module to schedule our Lambda function:

module "step_function_status_cron_event" {
  source  = "terraform-aws-modules/eventbridge/aws"
  version = "1.17.2"

  create_bus = false
  bus_name   = "default"

  rules = {
    step_function_status_cron = {
      description         = "Trigger to Step Function Status Check Lambda"
      schedule_expression = "cron(0 */3 * * ? *)" # Every 3 hours
    }
  }

  targets = {
    step_function_status_cron = [
      {
        name  = "lambda_step_function_status_check_cron"
        arn   = module.lambda_step_function_status.lambda_function_arn
        input = jsonencode({ "trigger" : "cron" })
      }
    ]
  }
  create_role = false
}

In the above code, we're creating an EventBridge rule that will trigger our Lambda function every 3 hours as per your original requirement. This schedule expression, cron(0 */3 * * ? *), translates to "At minute 0 past every 3rd hour."

The target of this rule is our previously created lambda_step_function_status_check Lambda function. This means when the rule is triggered, it will execute the lambda_step_function_status_check Lambda function. The input is optional and can be used to pass specific event data to the Lambda function.

We set create_role to false assuming that an existing IAM role will be used that has the necessary permissions for these resources. If you need to create a new role, you can change this to true and ensure the role has the appropriate permissions.

Terraforming the Step Functions

For this example, I'll assume we have a list of step function names stored in a Terraform variable. This list will be used to generate each step function and its corresponding alarm. Here's a simplified example, using an AWS Step Functions module:

variable "step_functions" {
  description = "A list of step function names"
  type        = list(string)
  default     = ["step1", "step2", "step3"]
}

module "step_functions" {
  source  = "terraform-aws-modules/step-functions/aws"
  version = "2.7.3"

  for_each = toset(var.step_functions)

  name       = "${each.value}-step-function"
  definition = file("${path.module}/definitions/${each.value}.json")

  logging_configuration = {
    include_execution_data = true
    level                  = "ALL"
  }

  cloudwatch_log_group_name              = "/aws/stepfunctions/${each.value}-step-function"
  cloudwatch_log_group_retention_in_days = 90
}

In this example, we're using the for_each construct in Terraform to iterate over the list of step function names and create a step function for each. The definition for each step function is assumed to be stored in a separate JSON file in the definitions directory.

The output of this module is a map of step function resources, indexed by their name.

💡
Please remember to replace the placeholders with your actual step function definitions and settings. This is a simplified example, and in a real-world scenario you would probably need to customize this further to match your actual infrastructure and business needs.

Terraforming Cloudwatch Alarms based on the Custom Metrics

Let's proceed to the CloudWatch metric and alarm setup. For this, we will use the aws_cloudwatch_metric_alarm resource, which will create an alarm for each of our Step Functions. We will use the Maximum statistic of our custom metric and set a threshold, so that an alarm is triggered if the Step Function fails:

resource "aws_cloudwatch_metric_alarm" "step_function_alarm" {
  for_each = module.step_functions

  alarm_name          = "StepFunctionStatusAlarm-${each.key}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "${each.key}_Status"
  namespace           = "StepFunctions"
  period              = "3600" # This should be less than the execution time of the Step Function
  statistic           = "Maximum"
  threshold           = 3 # The status code for failure
  alarm_description   = "This metric checks status of Step Function ${each.key}"
  alarm_actions       = [] # Add any actions you want to be triggered when the alarm goes off
  treat_missing_data  = "missing"
}

This will create an alarm for each of the Step Functions, and trigger it if the status of the last execution is higher than the number corresponding to 'SUCCEEDED'.

💡
Please be aware that the period of the alarm should be set to a value that is less than the execution time of the Step Function. This is to ensure that the alarm always considers only the last execution of the Step Function.

Terraforming the Guardian dashboard

Finally, let's create our dashboard to keep an eye on all our Step Functions. We will use the aws_cloudwatch_dashboard resource to do this:

resource "aws_cloudwatch_dashboard" "step_function_dashboard" {
  dashboard_name = "StepFunctionStatusDashboard"

  dashboard_body = jsonencode({
    widgets = [
      {
        "type": "alarm",
        "x": 0,
        "y": 0,
        "width": 24,
        "height": 6,
        "properties": {
          "title": "Step Functions Last Execution Status",
          "alarms": values(aws_cloudwatch_metric_alarm.step_function_alarm[*].arn)
        }
      }
    ]
  })
}

This will create a CloudWatch Dashboard with a single widget that shows the status of all our Step Function alarms. This way, you can quickly glance at the dashboard and see if there are any issues with your Step Functions, as shown in the screenshot below.

Conclusion

In conclusion, we've crafted an efficient, automated system to monitor the status of numerous AWS Step Functions, and visualized this data in an easily digestible dashboard. This solution not only saves you time by avoiding manual checks but also provides a real-time representation of the health of your processes.

This system is flexible, customizable, and can be adapted to monitor different types of Step Functions or to include multiple alarms per function. We've utilized AWS services and Terraform to ensure it can keep up with dynamic cloud environments and be easily adjustable to meet your specific needs.

By 'keeping an eye on the herd', we emphasize the importance of reliable, automated monitoring in today's complex IT landscapes. The goal of this solution is to enhance operational efficiency, aid in troubleshooting, and ensure the smooth running of your business processes. Keep coding and monitoring smart!

Push the Green Button: Creating Event Gadgets with IoT and Serverless Architecture

Monica Colangelo — Thu, 22 Jun 2023 20:32:46 +0000

Preparing for the AWS Summit Milano 2023 not only as an AWS Community Builder but as a representative of my company (sponsor of the event), I found myself grappling with an issue that seems small but is indeed profound - the ubiquitous, forgettable, and somewhat outdated practice of corporate giveaways.

Instead of contributing to the mountain of corporate freebies that usually end up in a drawer somewhere, I wondered, why not leverage my tech know-how for something more meaningful and sustainable? Thus was born the idea to create a simple but compelling swag: a unique, sustainable memento, in the form of a tree planted for our visitors. 🌱

Greening the Gadget: An Unconventional Approach to Event Giveaways

TL;DR: This chapter details my journey to develop an idea for an eco-friendly gadget for the AWS Summit. The project involves creating a physical button that, when pressed, starts a process to plant a tree through Tree-nation, yielding a unique URL for each tree. This URL is then transformed into a QR code through AWS Lambda, giving me a tangible, scannable memento that participants can take home and redeem at their convenience. If you're primarily interested in the technical side of this endeavor, feel free to skip ahead to the next section.

Tree-gifting Made Easy

As I explored potential platforms for my project, I found many that offered the opportunity to fund reforestation efforts across the globe. Some even had options for gifting trees - a sweet gesture, isn't it? But here's where the challenge arose: nearly all required prior registration of the gift recipient, complete with name and email address.

This didn't quite fit the vision I had. I wanted the process to be swift, convenient and not involve me handling personal data or permissions. Who wants to fill out forms in a crowded event?

Luckily, I found Tree-nation. Not only do they have a well-documented API platform to interact with, but they also offer the option to gift trees 'anonymously'. This meant that I could buy a tree as a gift and provide my user a URL; the user could then independently redeem their tree. No exchange of personal information and no long sign-ups at my booth during the event.

Now, having a URL for each tree was handy, but the challenge was: how to efficiently share these URLs at an event? The obvious answer is a QR code, printed and handed out to the user, so that it would serve as the physical gadget from the event. An eco-friendly token of their contribution to the greener cause, that they could hold in their hands, take home, and scan whenever they chose. No need to scan it right then and there, no strings attached. They could redeem their tree whenever they were ready.

Turning a URL into a QR code seemed like a task tailor-made for an AWS Lambda function. So, armed with the plan to use QR codes and AWS Lambda, the user journey was starting to shape up quite nicely.

Button Chronicles: Swapping Digital for the Real Deal

With the destination locked in thanks to Tree-nation, it was time to figure out the journey. The original plan? A virtual button on a web page or app, ready to be displayed on a tablet at the AWS Summit. With just a tap from a visitor, the tree-planting process would be set in motion. I quickly started sketching out a prototype to bring this concept to life.

However, as the prototype took shape, I realized it wasn't hitting the mark. It was essentially a webpage interacting with an API, which felt a tad ordinary. I was looking for something with a bit more spark, a little more 'wow' factor, or at least more unusual.

So, the virtual button was out, and a real, physical button took its place in my idea. It felt risky as I hadn't ventured into programming an electronic board before. But it was a thrilling challenge and an opportunity to break new ground. Anyway, this switch wasn't just about making the project more exciting; it was about adding a tangible touch to the user experience, and giving the project an original edge.

Pressing Forward: The Button and Code Considerations

Navigating through my idea, I immediately liked an ABB normally-open push button that belonged to my household. To connect this button to the digital world, I opted for an ESP8266 electronic board and used Arduino IDE as the development platform to create and flash the necessary code.

From an architectural perspective, I ventured down several paths before finding the best solution for my specific use case:

Simplified Approach with Hardcoded Credentials: The original blueprint of my project was pretty straightforward. I created an AWS Lambda function directly exposed to the Internet, acting as an endpoint for a simple HTTPS request.

In this initial "No-Auth" setup, a user/password combination was directly verified by the Lambda code. AWS API Gateway was also considered in this scenario, since it could serve as a robust and secure front door to manage the Lambda function. However, it seemed like an overkill for a single-function project of this nature, pushing me to consider a more streamlined approach.
IAM-Authenticated Approach: I then evaluated an approach that hinged on AWS IAM to authenticate requests. This would ensure robust security but would also add a layer of complexity to the solution.

Whether directly invoking the AWS Lambda function or sending a message to an Amazon SQS queue, the device would need to include a Sigv4 signature in its request. This signature, generated using an Access Key and Secret Key combination, would be verified by AWS before proceeding with the request.

While this approach certainly enhanced security, it did not come without drawbacks. Notably, the time taken to verify the Sigv4 signature significantly affected performance. The authentication process alone took around 10 seconds, and when adding the 2-second execution time of the Lambda function, the total operation time ballooned to 12 seconds. Given this drawback, I moved on to explore another option.
AWS IoT Core: Given that my system was physical, I turned to AWS IoT Core. Devices registered on IoT Core have the native ability to communicate through queues with MQTT protocol (both publish and subscribe) and HTTPS (publish only). This interaction occurs through SSL certificates signed by AWS and installed on the device. Initially, I thought the MQTT protocol wasn't suitable for my use case because it required a constantly open connection. My button was a one-shot device and thus HTTPS seemed more appropriate. However, after rewriting the code, I found that inserting a message into the queue took around 7 seconds, which was better than before, but not exactly impressive.
MQTT with open connection: Finally, I wrote another version of the code, which established an MQTT connection authenticated through certificates at device startup and kept it open. With this approach, sending a message at the button press was almost instantaneous! This was a significant improvement. The execution time for Lambda remained the same (2 seconds), but with a better authentication system.

After weighing the pros and cons, I settled on the MQTT solution. Some might argue that a button should not maintain an open connection, as it would be more suitable for devices sending continuous sensor data. However, for my particular use case, I deemed it an acceptable compromise. As for the HTTPS solutions, while I could have implemented better systems (JWT or other types of authentication), I found it out of scope for my project and wanted to use something readily available, such as IAM or SSL certificates.

In light of the performance measured, I chose the compromise that seemed most acceptable to me. This doesn't mean it will work for everyone. I would always recommend considering your specific use case and determining what would work best for your situation.

For a visitor to physically take away a token of their participation, the project needed to incorporate a mechanism to create a physical output. This requirement introduced a printer into the mix. More specifically, a printer that could generate QR codes linked to the newly planted trees. To orchestrate this, I decided to leverage AWS IoT again, by registering a label printer as a device and using the AWS IoT Jobs service to send print requests as needed. Thus, the final step of our Lambda function involves the creation of an AWS IoT Job that instructs the printer to print the respective QR code.

The Green Code Deep Dive

Unrolling the AWS infrastructure

To manage the AWS infrastructure, I opted for AWS Cloud Development Kit (AWS CDK) in Python. You can find the complete solution on my Github repo.

I started by creating an Internet of Things (IoT) thing, which is a representation of a specific device or logical entity. In this case, it represents the button in my system.

Next, I generate a certificate signing request (CSR) for my IoT thing, used to request a device certificate. This certificate allows the device to connect to AWS IoT. To manage permissions, I create an IoT policy and attach it to my IoT thing.

After creating the virtual resource on AWS, I can download the SSL certificate that AWS generated for you through the CSR I provided.

Diving into the Device Code: AWS IoT Connectivity with ESP8266

Next, we move on to the C++ code written for the ESP8266 board. I started by creating a Secrets.h file where I inserted the certificate we just downloaded, the corresponding private key (created by CDK), and AWS's root CA (download it here):

#include <pgmspace.h>

#define SECRET

const char WIFI_SSID[] = "XXXXXXXXX";
const char WIFI_PASSWORD[] = "YYYYYYYYY";

#define THINGNAME "button"

const char MQTT_HOST[] = "abcdefghijkl-ats.iot.eu-west-1.amazonaws.com";

// Amazon Root CA 1
static const char cacert[] PROGMEM = R"EOF(
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
)EOF";

// Device Certificate
static const char client_cert[] PROGMEM = R"KEY(
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
)KEY";

// Device Private Key
static const char privkey[] PROGMEM = R"KEY(
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
)KEY";

The MQTT_HOST value can be retrieved by executing aws iot describe-endpoint --endpoint-type iot:Data-ATS in the AWS CLI.

Several key functions that power the IoT button application are included in the .ino file that I flash on my device:

The setup() Function: this function is called once at the beginning when the device is powered up:
```
void setup()
{
  Serial.begin(115200);
  pinMode(BUTTON_PIN, INPUT_PULLUP);
  connectAWS();
}
```
The function begins by setting up the serial communication for debugging purposes and configuring the button's input pin (corresponding to the physical pin on the device where the button is wired). Then, it invokes the connectAWS() function.

The connectAWS() Function: this function is responsible for establishing a connection to the AWS IoT Core service:

void connectAWS()
{
  WiFi.mode(WIFI_STA);
  WiFi.begin(WIFI_SSID, WIFI_PASSWORD);
  //... Connecting to WiFi ...
  NTPConnect();

  net.setTrustAnchors(&cert);
  net.setClientRSACert(&client_crt, &key);

  client.setServer(MQTT_HOST, 8883);
  client.setCallback(messageReceived);

  reconnectAWS();
}

This function connects the ESP8266 to the WiFi network, sets the device's time, sets the trust anchors and client certificates to the secure WiFi client net and the MQTT server host and port to the MQTT client client.

The reconnectAWS() Function: as the name suggests, this function is used to connect to the AWS IoT Core service:
```
reconnectAWS()
{
  while (!client.connected())
  {
    if (client.connect(THINGNAME))
    {
      client.subscribe(AWS_IOT_SUBSCRIBE_TOPIC);
    }
    else
    {
      delay(5000);
    }
  }
}
```
This function keeps trying to reconnect to the AWS IoT Core service as long as the client is not connected. If the connection is successful, it subscribes to the MQTT topic defined by AWS_IOT_SUBSCRIBE_TOPIC (this serves as the first "connection" with the platform, even if we are not waiting for any messages). If the connection is not successful, the function waits for 5 seconds before retrying the connection.
The loop() Function: The loop() function runs in a loop after the setup() function completes:
```
void loop() 
{
  if (!client.connected()) 
  {
    reconnectAWS();
  }
  // ... button reading and debouncing code ...
  if (button_state == LOW) 
  {
    if (!message_sent) 
    {
      publishMessage();
      message_sent = true;
    }
  } else {
    message_sent = false;
  }
  client.loop();
}
```
In this function, I first check if the device is still connected to AWS IoT Core. If not, it tries to reconnect using the reconnectAWS() function. After that, it checks the button's state. If the button is pressed (indicated by a state of LOW), a message is published to AWS IoT Core if it hasn't been sent already. This message sending is guarded by the message_sent variable to ensure that only one message is sent per button press. After the button is released, message_sent is reset to false, enabling the next button press to send a message. Lastly, client.loop() is called to allow the MQTT client to maintain its connection to the server.

Lambda in the middle

The workhorse, responsible for executing my project's logic, is an AWS Lambda function that sends a request to the Tree-nation API to plant a tree. The request includes the Tree-nation planter_id (which is the ID associated with the Tree-nation account, provided by Tree-nation's support team. This, along with the token obtained from Tree-nation and stored as an SSM parameter, is used to authenticate API requests) and the selected species_id (a list of tree species that the user wishes to plant. During each operation, one species is randomly selected. I've selected several species from the Tree-nation catalogue, so this is a list of IDs from their platform). If the request is not successful, the function retries after a brief wait.

payload = json.dumps({
    "recipients": [{"internal_id": imageid}],
    "planter_id": planter_id,
    "species_id": project,
    "quantity": 1
})
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token}
response = requests.request("POST", url, headers=headers, data=payload)

Upon receiving a successful response from Tree-nation, the function generates a QR code that links to the newly planted tree's page on the Tree-nation website.

collect_url = json.loads(response.text)['trees'][0]['collect_url']
input_data = collect_url
qr = qrcode.QRCode(version=1, box_size=10, border=2)
qr.add_data(input_data)
qr.make(fit=True)
img = qr.make_image(fill='black', back_color='white')

The generated QR code is then saved as an image in an Amazon S3 bucket. Additionally, the function stores the image's S3 URL in a DynamoDB table for future reference.

s3.Bucket(bucket).upload_file(filepath, imageid + ".png")
table.put_item(
    Item={
        'ID': imageid,
        'treenation_id': treenation_id,
        'payment_id': payment_id,
        'timestamp': timestamp,
        'URL': "https://" + bucket + ".s3.amazonaws.com/" + imageid + ".png"
    }
)

Lastly, the function creates an AWS IoT Job that sends a command to the registered printer device. For this purpose, a job template provided by AWS is used, which runs a command — in this case, a shell script hosted on the device with a parameter value corresponding to the S3 URL of the image to print.

iot.create_job(
    jobId='RunCommand-' + imageid,
    targets=[target],
    jobTemplateArn='arn:aws:iot:' + region + '::jobtemplate/AWS-Run-Command:1.0',
    documentParameters={
        'command': "/opt/print.sh,s3://" + bucket + "/" + imageid + ".png"
    }
)

Printing the QR Code automatically

The final step in our process is the physical output, the printed QR code that we hand to our event attendee. For this task, I had a Brother QL-500 label printer borrowed from the available equipment at the company. There's an open-source Python library on Github, brother_ql, that makes interacting with this printer incredibly straightforward from a Linux command line.

My interaction device was an old Raspberry Pi 2 equipped with a Wi-fi dongle. As for the connection between the Raspberry Pi and the Brother QL-500 label printer, I established it using a standard USB cable. The fact that the brother_ql library enables easy command line usage made the Raspberry Pi a perfect choice for this setup. This allowed the device to receive AWS IoT Jobs, process the commands, and consequently interact with the printer to produce the QR code labels.

I used the AWS CDK again to create the necessary resources on AWS IoT. The setup for this process was very similar to the one I performed for the button. I saved the SSL certificates onto the Raspberry Pi to make them available for the device's connection with AWS IoT.

To receive commands from AWS IoT and translate them into actions performed by the printer, the Raspberry Pi uses the aws-iot-device-client, which I downloaded and installed on it. You can follow these instructions to do the same.

This installation includes a service in systemd that allows your device to start communicating with AWS IoT automatically whenever it boots up.

The configuration file /etc/.aws-iot-device-client/aws-iot-device-client.conf sets up the connection between the Raspberry Pi device and the AWS IoT Core, enabling the necessary functionalities for the system. It uses the SSL certificates previously saved onto the Raspberry Pi to establish a secure connection. Jobs are enabled as the primary functionality, as that's how the printing instructions are received from the Lambda function.

A setup can be seen in the official example of a configuration file for the AWS IoT Device Client, available at this link. This allows you to explore a fully annotated configuration file, where each section's purpose is explained in detail.

{
  "endpoint": "abcdefghijkl-ats.iot.eu-west-1.amazonaws.com",
  "cert": "/opt/certs/greengadget_printer.public.crt",
  "key": "/opt/certs/greengadget_printer.private.key",
  "root-ca": "/opt/certs/root-CA.crt",
  "thing-name": "printer",
  "logging": {
    "enable-sdk-logging": true,
    "level": "DEBUG",
    "type": "STDOUT",
    "file": ""
  },
  "jobs": {
    "enabled": true,
    "handler-directory": ""
  },
  ...
}

With this setup, the device is ready to securely connect to AWS IoT and receive the Jobs containing the instructions to print the QR codes.

As a final detail, the print.sh shell script that the Job is instructed to run is exceptionally straightforward. Its function is to set the necessary parameters for the Brother printer, fetch the QR code image from the provided S3 URL, print it using the brother_ql command line tool, and finally remove the temporary file:

#!/bin/bash
export BROTHER_QL_MODEL=QL-500
export BROTHER_QL_PRINTER=file:///dev/usb/lp0
TMP_FILE=/tmp/file.png
aws s3 cp $1 $TMP_FILE
brother_ql print -l 38 $TMP_FILE
rm -f $TMP_FILE

The BROTHER_QL_MODEL and BROTHER_QL_PRINTER environment variables are set to specify the printer model and its connection interface respectively. Here, file:///dev/usb/lp0 represents a connection through the first USB printer device file in a Linux system.

With this script, the printing process becomes fully automated. Every time a new Job is created by the Lambda function, the device receives the Job, runs the script, fetches the correct QR code from S3, and prints it out. It's a seamless, hands-off process, right from the button press to the physical QR code label in hand.

BONUS Section: The Art of DIY

In addition to the technical details, there is an equally fun part of the project: DIY crafting! After all, for the event, I didn't want the button and its electronic board to be simply left unprotected on a table. It needed a touch of design, albeit delivered with a playful tone.

The idea was to construct a small box to house the electronic board, with a hole for the button to protrude through. 3D printing was the first idea that came to my mind; however, it soon dawned on me that producing a plastic object was at odds with our goal of sustainable gifting. It was then that I pivoted to a more eco-friendly material - wood. This choice added a touch of warmth and charm to the final product.

The box ended up with a drawer that can be opened to reveal the board:

All of this was painted with water-based paint:

I carved out a pocket in the wood to fit in another piece of wood that symbolizes a stylized tree (that I owned before and just repainted accordingly).

This is the final result!

I'm not a pro at crafting but I had immense fun with this DIY part of the project; it's an incredibly rewarding work! Do you like it?

So, I'm happy to think that our job is not just about creating high-tech applications or services - it can also be a way to promote sustainability in creative and fun ways. I hope this journey at the crossroads of technology and the environment has inspired you as much as it did me. In the end, I transformed a simple button press into a real-world positive impact. And let me tell you, there's nothing quite like seeing your code come to life... and plant a tree! 🌱

Automating the injection of CI/CD runtime information into Terraform provider

Monica Colangelo — Fri, 31 Mar 2023 18:35:29 +0000

As a DevOps engineer or software developer, you may have encountered scenarios where you must inject CI/CD runtime information into your Terraform provider code.

This information could be anything from environment-specific variables to runtime configuration values only available during the CI/CD process. However, provider usage is evaluated very early on in the Terraform run, before we have enough context to do variable interpolations, so you can't use variables there (like you can normally do with standard resources and environment variables).

I stumbled upon a real-world use case dealing with some Terraform code to create AWS resources, where I need to add information about the role ARN to be assumed, and this information cannot be statically inserted in the code, because this code needs to be executed with different roles depending on some condition and/or constraints.

Another use case is to add default tags to all providers, such as the build number, to ensure consistency across all created AWS resources (and maybe you can't be sure that a default_tags entry is present in every provider).

To automate the process of injecting CI/CD runtime information into our Terraform provider, we'll introduce the tool hcledit. With hcledit and some other manipulation, we can insert data into the Terraform code using regex/grep to find the correct place to add it.

How did I do it? Let's see the code! Here's my Bash function:

initialize_provider() {
  # allows the last command in a pipeline to be executed in the current shell environment, rather than a subshell
  shopt -s lastpipe
  # this is a simple regex to match every provider
  regex='provider[[:blank:]]\+"\([[:alnum:]_-]\+\)"[[:blank:]]*{'

  # Search for files in the current directory that contain the regular expression
  for file in $(grep -rl "$regex" .); do
    echo "Processing file: $file"
      # Create temporary file with the same name as the original file
      temp_file="${file}.tmp"
      cp "$file" "$temp_file"
      # declare an array variable
      declare -a arr

      # Modify file contents and write to temporary file
      grep "$regex" "$file" | while IFS= read -r line; do
          # Extract the provider name, which is the second word without the quotes and braces
          pn=$(echo "$line" | sed "s/$regex/ \1/g" | tr -d '"{}' | cut -d' ' -f2)
          echo "Found provider: $pn"
          #add the provider name to the array
          arr+=("$pn")
      done

      # Iterate over the array
      for provider_name in "${arr[@]}"; do
          # Add the assume_role block to the provider
          hcledit block append provider.$provider_name assume_role -f $temp_file -u --newline
          hcledit attribute append provider.$provider_name.assume_role.role_arn \"$ROLE_ARN\" -f $temp_file -u
          # Add the default_tags block to the provider
          hcledit block append provider.$provider_name default_tags -f $temp_file -u --newline
          hcledit attribute append provider.$provider_name.default_tags.tags \"$BUILD_ID\" -f $temp_file -u
      done
      cat "$temp_file"
  done

  # Move temporary files to original file names
  for temp_file in *.tmp; do
      mv "$temp_file" "${temp_file%.tmp}"
  done
}

Let's clarify what it does:

The first line enables the lastpipe shell option, which allows the last command in a pipeline to be executed in the current shell environment, rather than a subshell. It is most useful if you call this function from another process (i.e. your CI/CD pipeline).
regex is a regular expression that matches every provider in the Terraform code. I need to match multiple providers because you can have more than one, for example when you deploy in different regions. Beware that, if you use non-AWS providers, you may change this regex to exclude them or, in general, better match your needs.
The script searches for files in the current directory that contain the regex regular expression.
For each file found, the script creates a temporary file with the same name as the original file and copies the contents of the original file to the temporary file.
The script declares an empty array called arr.
The script reads through the contents of the original file and extracts the names of all the providers matching the regex regular expression. Each provider name is added to the arr array.
The script iterates over the arr array and appends an assume_role block to each provider in the Terraform code, using hcledit, which is much more convenient than Bash directly.
For each assume_role block, the script appends a role_arn attribute with the value of $ROLE_ARN, using hcledit.
Then the script uses hcledit again to add a default_tags block to each provider in the Terraform code.
For each default_tags block, the script appends a tags attribute with a list containing the value of $BUILD_ID.
Finally, the script moves the temporary files to their original file names by removing the .tmp extension.

In conclusion, by automating the injection of CI/CD runtime info into your Terraform code with tools like hcledit and a little bit of scripting know-how, you can easily add environment-specific variables and runtime configuration values to your Terraform code, making it more efficient and less error-prone.

Happy automating!

Continuous Delivery for the rest of us

Monica Colangelo — Wed, 04 Jan 2023 11:16:21 +0000

Getting the bigger picture

DevOps methodologies have now taken hold; the use of pipelines for code builds is a well-established practice adopted by any modern development team. But when we change the conversation from Integration to Deployment, I often find myself looking at extremely simplified examples, where newly built code is released into production with a single command at the end of the build and test stages. Which is, of course, the definition of Continuous Deployment. But in my daily work, I have learned that reality is rarely that simple.

If your team is super smart and has every possible test in place, and your business structure allows you to apply Continuous Deployment directly from commit to production, first of all, congratulations! Unfortunately, this is not the case everywhere: the rest of us often need to deploy our code in different environments at different moments in time (intervals of days, or weeks! 😰 I know, I know...), there are release windows to be met to release to production, and there are acceptance tests performed by other teams so the testing environment can only be updated at agreed-upon times, and a variety of other constraints, especially in large and very structured companies. Still, we don't want to give up the benefits of automation.

Automation, yes, I said it. But, how? When your artefacts need to be deployed in multiple environments, you can't just repeat the same process for each environment: you need to deploy the software without rebuilding it. This aligns with the "build once, deploy anywhere" principle, which states that once a release candidate for a software component has been created, it should not be altered in any way before it is deployed to production. And if you find yourself in the situation that I just described, with different timing for each environment, you can't just put your deployments in line and execute them one after another in the same pipeline execution.

I have come across various articles on the Web about GitOps, and while they can be useful, they focus on specific, isolated aspects of configuration, or they often oversimplify, leaving me feeling like I'm missing the bigger picture: which is, of course, the process.

In this article, I want to illustrate an approach that I successfully have applied in several projects, combining a classic Continuous Integration pipeline with the Continuous Deployment practices enabled by GitOps, so that the entire workflow goes directly from committing the application code to a Continuous Deployment in a development environment, also managing multiple Kubernetes environments where you can release your code at different moments in time. Or, as I like to call it: Continuous Delivery for the rest of us 🤓

A brief definition of GitOps

If you never heard of GitOps before, it is a way of implementing Continuous Deployment for cloud-native applications.

The term "GitOps" refers to the use of Git as a single source of truth for declarative infrastructure and application code in a continuous deployment workflow; it reflects the central role that Git plays in this approach to Continuous Deployment. By using Git as the foundation for their deployment process, teams can leverage the power and flexibility of Git to manage and deploy their applications and infrastructure in a reliable and scalable way.

In a GitOps workflow, developers commit code changes to a Git repository, and automated processes pull those changes and deploy them in a reliable and repeatable manner. This approach enables teams to deploy applications and infrastructure changes with confidence, as the entire deployment process is version controlled and auditable.

In the following I will explain the details of the various steps, starting from the end: it may seem counter-intuitive, but I have the feeling it may be more useful to start from the final goal and go backwards to "how to get there".

Argo CD: the GitOps tool

Argo CD is a Continuous Deployment tool for Kubernetes. It helps developers and operations teams automate the deployment of applications to Kubernetes clusters.

Here's how it works:

You define your application's desired state in a declarative configuration file, usually written in the Kubernetes resource manifest format (e.g., YAML).
You commit this configuration file to a Git repository, which serves as the source of truth for your application's desired state.
Argo CD monitors the Git repository for changes to the configuration file. When it detects a change, it synchronizes the desired state of the application with the actual state of the application in the cluster.
If the actual state of the application differs from the desired state, Argo CD will apply the necessary changes to bring the application back into alignment. This includes creating, updating, or deleting resources in the cluster as needed.

I will not cover the details of the Argo CD installation or configuration procedure here; you can easily find many guides on this.

In this discussion, I will use a configuration consisting of a single cluster, with Argo CD installed in a dedicated namespace, and three environments, develop, staging and production, installed in as many namespaces. Depending on your level of experience and the needs of your use case, your topology may vary.

Argo CD is itself configured via a dedicated Git repository and a pipeline that performs configuration synchronization. These configurations, specifically, include three very important pieces of information:

the repository containing the Kubernetes configurations
the directory within that repository (we'll learn more in the Kustomize chapter)
the repository branch or tag to use, which will be the only information to be updated when a release needs to be delivered in an environment (we'll learn more in the Release Captain chapter)

So for example, a configuration for develop environment (my environment is an "application" in Argo CD terms) can be like this:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: develop
  namespace: argocd
spec:
  project: develop
  source:
    repoURL: https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/mysupercoolk8srepository
    targetRevision: main
    path: develop
  destination:
    server: https://kubernetes.default.svc
    namespace: develop
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - PrunePropagationPolicy=foreground
      - PruneLast=true
    retry:
      limit: 1
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Here you can see the three pieces of information:

...    
    repoURL: https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/mysupercoolk8srepository
    targetRevision: main
    path: overlays/develop
...

targetRevision is, in my case, the main branch of the Kubernetes YAML files repository, where the integration pipeline of each microservice pushes its updated container image tag (corresponding to commit hash).

In my topology, for staging and production environments I have the very same configuration, except for targetRevision and path properties. These configurations are saved, as I mentioned, in a dedicated repository, with a corresponding very simple pipeline that runs kubectl -f apply to the cluster when a commit is made.

As for the develop environment, its configuration on ArgoCD will always point to the default branch (main in this case) used by the integration pipeline to update container image tags. In this way, a continuous deployment approach is used in this environment, as the new version of an image is updated as soon as it is available.

About the other environments, however, I explained before why I've decided to maintain a more conservative approach and make releases in a controlled manner, applying continuous delivery. Therefore, the configuration of these environments on ArgoCD will point to a specific tag applied to the commit on the default branch when an application version, as a whole, is considered ready to be promoted to the next environment (as we'll see in the Release Captain chapter).

So for example, the staging environment is configured as follows:

...    
    repoURL: https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/mysupercoolk8srepository
    targetRevision: release/2.7.0
    path: overlays/staging
...

and similarly for the production environment:

...    
    repoURL: https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/mysupercoolk8srepository
    targetRevision: release/2.6.3
    path: overlays/production
...

ArgoCD supports three types of information as targetRevision: a tag, a branch, or a commit. Using the commit gives the greatest assurance of immutability; however, it becomes difficult to track releases, which is instead made easier by a release branch or tag. Both of these, however, are not immutable; so it is important for the team to be disciplined and respect the process, i.e., not make any changes to tags or release branches. At the end of the day, the choice is up to you and what's better for your team.

Kustomize: managing multiple environments without duplicating code

Kustomize is a tool that allows developers to customize and deploy their Kubernetes applications, creating customized versions of their applications by modifying and extending existing resources, without having to write new YAML files from scratch. This can be useful in a variety of scenarios, such as creating different environments (e.g. staging, production), or deploying the same application to different clusters with slight variations.

To use Kustomize, you create a base directory containing your Kubernetes resources and then create one or more overlays that contain the customizations you want to apply. Kustomize then merges the overlays with the base resources to generate the final, customized resources that can be deployed to your cluster. You can find more information about Kustomize logic and syntax here.

How is Kustomize configured in my use case? My filesystem structure for the Kubernetes files repository using Kustomize is:

.
|-- base
|   |-- microservice1
|   |   |-- deployment.yaml
|   |   |-- kustomization.yaml
|   |   `-- service.yaml
|   `-- microservice2
|       |-- deployment.yaml
|       |-- kustomization.yaml
|       `-- service.yaml
`-- overlays
    |-- develop
    |   |-- kustomization.yaml
    |   |-- microservice1
    |   |   `-- deployment.yaml
    |   `-- microservice2
    |       `-- deployment.yaml
    |-- production
    |   |-- kustomization.yaml
    |   |-- microservice1
    |   |   `-- deployment.yaml
    |   `-- microservice2
    |       `-- deployment.yaml
    `-- staging
        |-- kustomization.yaml
        |-- microservice1
        |   `-- deployment.yaml
        `-- microservice2
            `-- deployment.yaml

Let's say that I have a microservice2 deployment.yaml like this (some properties are hidden, just for brevity):

cat base/microservice2/deployment.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: my-super-ms
  name: my-super-ms
spec:
  selector:
    matchLabels:
      app: my-super-ms
  replicas: 1
  template:
    metadata:
      labels:
        app: my-super-ms
    spec:
      containers:
        - name: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/my-super-ms
          ports:
            - containerPort: 80

You can notice that I did not put any container image tag here. That's because it is a piece of information that will come from the build pipeline when an image is actually built.

To better understand this concept, let's see the corresponding kustomization.yaml file in the base directory:

cat base/microservice2/kustomization.yaml 
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
commonLabels:
  app: my-super-ms
resources:
- service.yaml
- deployment.yaml
images:
- name: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/my-super-ms
  newTag: 6ce74723

What are those last three lines? Well, those are the changes made by the build pipeline when a new container image is created (as we'll see in the next chapter). In the initial version of this file, when I created it, I didn't include them, but merely indicated which files inside the directory to consider. So this change is exactly what the build pipeline does as its last action.

What about the overlays? As I said before, a Kustomize overlay is a directory that contains customizations that you want to apply to your Kubernetes resources. It is called an "overlay" because it is layered on top of a base directory containing your base resources.

An overlay directory typically contains one or more Kubernetes resource files, as well as a Kustomization file. The resource files in the overlay directory contain the customizations that you want to apply to your resources, such as changing the number of replicas for a deployment, adding a label to a pod, or maybe having different ConfigMap contents because some parameters differ among environments. The Kustomization file is a configuration file that specifies how the customizations in the overlay should be applied to the base resources.

Let's see a simple example: as seen before I've set a replicas: 1 spec in my deployment.yaml, but let's say I want to change this property in the staging environment to test HA.

My overlay configuration will be like this:

cat overlays/staging/microservice2/deployment.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: my-super-ms
  name: my-super-ms
spec:
  replicas: 3

That's it, this is my complete file: I don't need to replicate my entire Deployment. I just put different values to the parameters I'd like to change.

What about my overlay Kustomize file? It just needs to know which files have to be merged. In my case it looks like this:

cat overlays/staging/kustomization.yaml
---                
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: staging

bases:
  - ../../base/microservice1
  - ../../base/microservice2

patchesStrategicMerge:
  - microservice1/deployment.yaml
  - microservice2/deployment.yaml

For this YAML code, a pipeline is perhaps not strictly necessary for the process to work, but I recommend that one should be provided.

Specifically, a pipeline should be run every time a pull request is opened, and it should check the code for errors and security bugs; you can use tools such as Checkov or similar.

In fact, in this case, the default branch should be "armoured" and no one should push directly to it, except the build pipeline. A developer who intends to make additions or changes to Kubernetes files should commit them to a temporary branch (NOT a release branch) and open a pull request, triggering the pipeline execution, which, in the end, can accept the pull request and merge the code, thus making it immediately available to the develop environment (which, remember, is configured via Argo CD to be constantly aligned to the default branch).

The build pipeline in my opinion can instead write its changes directly to the main branch since the only detail it is going to change is the container image tag and there is no point in performing checks on Kubernetes files with such changes.

Building bridges

Going backwards, we finally arrived at the starting point: the build pipeline. As I said before, I will not address here what a Continuous Integration pipeline is - there are plenty of examples and explanations on the Web, and I assume that, if you've got so far, you probably already know it. For our process, what matters is that this pipeline, after pushing the new container image to the registry, performs an update on the Kubernetes file repository to communicate the new tag.

Although I used AWS in my design to illustrate this approach, the process is usable with any CI/CD platform and wherever Kubernetes is hosted. I used this approach in different projects, with AWS Code Suite and EKS as well as Gitlab or Bitbucket and Rancher; technicalities don't matter, what really matters is applying a structured process, whatever software products you choose to use and constraints you happen to have.

In my example, using CodeBuild as executor and CodeCommit as a repository, this last stage is run by this buildspec.yaml:

version: 0.2

env:
  git-credential-helper: yes

phases:
  pre_build:
    commands:
      - TAG=`echo $CODEBUILD_RESOLVED_SOURCE_VERSION | head -c 8`
      - REPOSITORY_URI=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$TAG

  build:
    commands:
      - cd $CODEBUILD_SRC_DIR_k8s_repo
      - cd base/$IMAGE_REPO_NAME
      - kustomize edit set image $REPOSITORY_URI
      - git config --global user.email "noreply@codebuild.codepipeline"
      - git config --global user.name "CodeBuild"
      - git commit -am "updated image $IMAGE_REPO_NAME with tag $TAG"
      - git push origin HEAD:main

I want to emphasize that the update of the image tag is done in the base directory of the Kubernetes repository, and not in the overlays: the management of different versions of the application in different environments is done with release branches, as we see in the next chapter. The overlays are only meant to allow for small differences in configurations, not versions.

To summarize, the whole process works as follows:

A developer makes changes to an application and pushes a new version of the software to a Git repository.
A continuous integration pipeline is triggered, which results in a new container image being saved to a registry.
The last step of the integration pipeline changes the Kubernetes manifests hosted in a dedicated Git repository, automatically updating the specific image with the newly created tag.
ArgoCD constantly compares the application state with the current state of the Kubernetes cluster. Then, it applies the necessary changes to the cluster configuration; Kubernetes uses its controllers to reconcile the changes required to the cluster resources until the desired configuration is reached.

All this works seamlessly in the development environment. But how do you make delivery in environments that need to be updated at different times? This is where the important role of the Release Captain comes in.

The Release Captain👩‍✈️: coordinating releases

A so-called "Release Captain" is a role within a software development team that is responsible for coordinating the release of code from develop to production (and every other environment in between).

In teams capable of doing Continuous Deployment directly to production, this role is played entirely by one or more pipelines that execute extensive tests, automatically open and approve merge requests, and tag commits properly.

Ideally, every team member should be able to act as a Release Captain on a rotating basis. What is the task of the Release Captain within our GitOps process? It is a relevant task, but fortunately not too onerous.

Let's say that, up to a certain point, development has been focused on the develop environment. At some point, a release in the staging environment must finally be established and scheduled. This is where the release captain comes in: he assigns a release number and tags the commit in the Kubernetes file repository default branch. Once this tag is created, the release captain will modify the configuration of the staging environment in the Argo CD repository, changing targetRevision by replacing its previous value with the new tag. Once this change is pushed, the triggered pipeline will execute the configuration change directly to Argo CD, which, in turn, will synchronize with the contents of the new tag, effectively deploying the new release.

This approach treats microservice applications as a single block to be released all at once in a given release. This may seem superfluous in many circumstances, especially if there are only a few microservices, but, in my opinion, it is important for two reasons:

if acceptance testing is done on a given release, you are assured of passing to production exactly the same versions of all microservices that have been certified as inter-working. In other words, if I have certified that microservice A in version 1.2.3 works with microservice B in version 4.5.6, at the time of promotion to the next environment I need to be sure that I release exactly the same versions together;

a thing is often underestimated from a development point of view but very problematic from an operation point of view: the rollback process. In case of problems, rolling back to the previous version, and returning targetRevision to the previous value, is extremely quick and safe and saves a lot of headaches.

That's it! For each release in the staging environment, simply repeat this process. For releases in the production environment, it is even simpler: once a release has been tested and judged suitable for deployment to production, there is no tag to be created: you use the same tag that has already been tested, and the only action to take is to edit the production environment configuration file in the Argo CD repository.

The following drawing summarizes the entire workflow, starting from the push of application code and ending with deployment to the various environments in the Kubernetes cluster.

Several aspects of this process can be slightly modified to meet the needs of the team; however, I have found it effective in even different situations, especially, as I mentioned, for projects within large companies that have constraints and yet do not want to give up the benefits of automation.

A Year of Growth and Impact: Reflections on 2022 as a Woman in Tech

Monica Colangelo — Tue, 27 Dec 2022 10:30:16 +0000

As the year 2022 comes to a close, it's time to reflect on the accomplishments and successes of the past year. As a female voice in the tech industry, I have made it my mission to inspire, educate, and empower others through my blog and newsletter.

Throughout the year, I have written a number of technical articles that have received a great deal of positive feedback. These articles have aimed to demystify complex topics and make them accessible to a wider audience.

In addition to sharing my knowledge through my writing, I have also launched a newsletter to further disseminate my ideas and insights. This has allowed me to reach an even larger audience and help more people learn about the exciting world of technology.

As a woman in tech, I understand the importance of representation and diversity in the industry. It's vital that all voices are heard and that everyone has the opportunity to learn and grow. That's why I strive to be a positive role model and to use my platform to inspire and empower others, particularly other women and girls interested in tech.

Overall, the year 2022 has been a fulfilling and successful one, and I am grateful for the opportunity to share my knowledge and insights with the world. I look forward to continuing this work in the coming year and beyond, and to helping make the tech industry a more inclusive and equitable place for all.

4 ultimate reasons to prefer AWS CDK over Terraform

Monica Colangelo — Mon, 05 Dec 2022 14:38:29 +0000

There is an Italian version of this article; if you'd like to read it click here.

Over the past few months I have been using AWS CDK for some projects, and every time I started talking about it, someone would ask: why should I abandon the tool I am using and switch to CDK? What advantages does it offer?

I will not dwell on implementation details in this post; there are many useful resources to be found online, from tutorials for beginners to very advanced articles.

Instead, I want to summarise what I consider to be very interesting features of the framework.

I am a passionate advocate of Infrastructure as Code and have been using it extensively since the earliest versions of the tools that have become established leaders in this field today. What you learn with experience is that there is no such thing as the perfect tool that solves every problem or that fits all occasions; there are tools that are adapted to many different situations, or that are selected for certain specific characteristics of the company you work for, its processes, the risks you accept to face, the problems you take on, and so on.

In order to explain the advantages (and limitations) I have found in CDK, it is necessary to take a step back and recall the characteristics of some of the most widely used Infrastructure as Code tools.

Cloudformation

Cloudformation is the Infrastructure as Code service of AWS. It has been active since 2011 (it seems like yesterday, but in the cloud era we are talking about geological eras before that), free of charge, and uses descriptive languages such as JSON and YAML (the latter as of 2016, to the relief of many) to create templates in which the resources to be created on AWS are defined. These templates are processed by the Cloudformation service, which creates the resources as described. If we want to change our infrastructure, we simply re-execute the modified template.

Advantages

The unbeatable advantage of Cloudformation is the automatic rollback management. If my template contains errors, Cloudformation stops the infrastructure update action and automatically returns to the previous state, i.e. to the last 'working' version of my template.

Limits

Over the years, Cloudformation has undergone many evolutions, introduced features, cross-account usage and more... and yet, nobody loves it. At most, it is tolerated. Why? Because of the languages it uses. JSON and YAML are essentially data serialisation formats and work well with machines... less well with humans. They are certainly easy to read, but extremely tedious to write. Since they are not programming languages, there are no practical (as well as basic) mechanisms such as loops for repetitive operations: if I need to create 10 security groups, I have to list them all, one by one, without fail. If you have ever used Cloudformation, you know what I am talking about.
It works exclusively on AWS.

Terraform

Terraform is an open-source tool from Hashicorp for Infrastructure as Code, initially released in 2014. It uses the declarative HashiCorp Configuration Language (HCL), which from the earliest releases immediately seemed friendlier to the writing of infrastructure. Once a user invokes Terraform on a given resource, Terraform performs CRUD actions via the cloud provider's API to obtain the desired state. The code can be factored into modules, promoting reusability and maintainability.

Advantages

Terraform manages external resources with 'providers'. Users can interact with Terraform providers by declaring resources or using data sources; there are many providers maintained by both Hashicorp and the community, and AWS is one of them. The first advantage is therefore that it is a cross-platform tool.
As the HCL language has evolved over the years, Terraform allows the use of several constructs that function as loops in order to shorten the repetitive writing of similar resources. For example, one of the most common constructs is to cycle through a list:

resource "aws_ecr_repository" "ecr_repo" {
  count                = length(local.repo_list)
  name                 = local.repo_list[count.index]
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

but with the latest versions of Terraform, it is possible to use more complex constructs, such as extracting keys and values from a map to be used as required:

...
  dynamic "predicates" {
    for_each = [for k, v in each.value["sets"] : {
      set = v
    } if contains(keys(aws_waf_ipset.waf_ipset), v)]
    content {
      data_id = aws_waf_ipset.waf_ipset[predicates.value.set].id
      negated = false
      type    = "IPMatch"
    }
  }
...

Probably not the clearest code in the world, though still better than the endless lists of attributes in Cloudformation...

Limits

The infamous state file! Terraform saves the state of the infrastructure in a JSON file that is generated at each execution. Keeping this file is extremely important because it is the "source of truth"; in fact, Terraform consults this file before each execution to establish the discrepancy between the desired state (i.e. the code we want to execute) and the current state, and from this comparison decides what action to take to close the gap. If the state file is lost, Terraform is unable to realise that part of the infrastructure had already been created previously and will want to start from scratch.

Furthermore, keeping all the code of a very large infrastructure together is a bad practice, for several reasons: operational risk, shared management, handovers, and general maintainability of the code. Typically, each infrastructure 'stack' is created with blocks of code executed separately: this means that each stack will have its own state file, and consequently the preservation of these state files, in the long run, with large teams and very large infrastructures, becomes a very important and delicate issue.
No rollback management. The CRUD operations performed by Terraform, as I mentioned earlier, are sequential calls to the cloud provider's API; if for some reason in mid-execution a call fails... Terraform stops and leaves it to the user to put back the changes left in the middle. Not the best way to behave, especially in production environments.

CDK

OK, finally to the point: what are the characteristics of CDK that make it preferable to the instruments just mentioned? Personally, I see at least four! Let's look at them in order of importance.

Advantage #1: Rollback

CDK is a framework that, when executed, "synthesises" a Cloudformation template and then applies it. Consequently, it inherits all the positive features of Cloudformation, and, in particular, the ability to automatically roll back to the previous state. This is a very important feature in my opinion, especially when making changes to previously created stacks, especially in a production environment. Rollback is a step too often underestimated... until something goes wrong.

Advantage #2: No state file

As I said, being Cloudformation templates synthesised by the framework, the management of the state of the infrastructure is left to Cloudformation itself, and there are no state files to manage. In addition, it is much easier to consult the status of resources from the same console as the AWS account. Given the risks I listed earlier regarding state file management, this is no small advantage.

Advantage #3: Friendly/familiar programming language

AWS CDK is available for the most popular languages: TypeScript, Python, Java, .NET, and Go. There are no particular differences between these implementations: the choice can be based solely on the user's familiarity with one language or another. In my case, I used Python and my experience was pleasantly simple and smooth, thanks also to extremely comprehensive documentation and support for the main IDEs.

The use of an actual programming language also has the considerable advantage of being able to perform any type of operation not necessarily linked to CDK, such as requests to external APIs to retrieve information or notifications, manipulation of strings, files, JSON and so on... the limit is your imagination!

Advantage #4: Automatic generation of IAM policies

Finally, there is hardly any need to write any IAM roles and policies. The framework, based on the relationships between the resources declared in the code, is able to automatically calculate the necessary permissions and create roles and policies itself, following the principle of only assigning strictly necessary permissions.

This is by no means a trivial advantage, considering that this mechanism ensures that you do not forget any permissions and, above all, avoid assigning more permissions than you need, either by mistake or out of haste.

Of course, it is always possible to add permissions that the framework is unable to calculate. For example, it may happen that a Lambda function is created that internally makes API calls to AWS services, in which case the Lambda code is not part of the CDK code and is therefore excluded from the 'calculation'. The permissions required by the function for its calls must therefore be added to the role that the CDK automatically creates.

In addition to the advantage from the point of view of security, there is also the enormous time-saving in the development of infrastructure code. An example? The creation of a CodePipeline resource with its CodeCommit repository and CodeBuild stage required me to write about 500 lines of Terraform code; in CDK, the IAM part is about ten lines. Impressive.

Final considerations

AWS CDK is a tool that solves the problems of Cloudformation without losing its positive features, adding further advantages over other tools. Its greatest limitation, however, is in the fact that it can, of course, only be used on AWS.

There are other tools that use programming languages for writing infrastructure code, and which are available for use on other cloud providers: for example, Pulumi or cdktf. However, these tools do not have the same advantages, as they still use API calls (so there is no rollback) and save the state of the infrastructure in special files that have to be managed.

The persistence of these limitations has always put me off the idea of changing Infrastructure as Code tools because the change of habits, paradigm and especially code base seemed not worth it. AWS CDK, on the other hand, has such advantages that I would seriously consider abandoning other tools.

And what do you think? Have you tried AWS CDK? Would you consider switching tools in light of the advantages? Let me know in the comments!

The only newsletter you’ll ever need to read

Monica Colangelo — Tue, 22 Nov 2022 17:36:51 +0000

Okay, maybe I was a bit bold in the title... but maybe not, and it's up to you!

I had been thinking about creating my own newsletter for a while, but couldn't make up my mind.

Then I joined Mastodon recently (you can follow me here: https://hachyderm.io/@monica). And now that I am experiencing a new, much less toxic social, the idea of creating connections and community has finally made me decide!

My idea is to periodically share some interesting readings about Cloud, DevOps, and Architecture that I find online (of course, on Dev.to too!). Nothing for sale, just knowledge sharing.

If you like, you can also share with me some brilliant content you may find online, and you could see it in the newsletter itself.

If you like the idea you're very welcome!

Please visit my newsletter subscription page and let's start making cloud together!

Including an existing virtual machine in a CI/CD pipeline

Monica Colangelo — Sat, 20 Aug 2022 15:51:00 +0000

This post was originally published at https://letsmake.cloud.

In this article, we will see how to execute one or more steps of a CI/CD pipeline directly on a "traditional" virtual machine.

Think of legacy and/or proprietary applications, with license or support constraints, or that you cannot or do not want to re-engineer for any other reason, but which are necessary to perform specific tests or analyses with particular software: for example, scans with software belonging to the security team that prefers to centralize information in a hybrid environment, running simulations with software such as MATLAB and Simulink installed centrally for cross-team use, and so on.

The constraint of using these software does not mean that modern DevOps methodologies, such as CI/CD pipelines, cannot be used for code development. As we will see, a pipeline can include a step in which the execution of commands or scripts takes place directly on a virtual machine.

Solution architecture

In today's use case, the virtual machine is an AWS EC2 Windows; the goal is to run some commands on the EC2 every time my code is modified and pushed into a Git repository.

It is worth mentioning, however, that the AWS Systems Manager Agent can also be installed on on-premise or hosted elsewhere machines. This solution is therefore extendable to many applications even if they are not hosted directly on AWS.

Solution architecture details:

a CodeCommit repository contains both the application code and a JSON file that includes information about the commands to be executed on the virtual machine;
a push on this repository will trigger the execution of a CodePipeline, which in turn calls a StepFunction;
the StepFunction initializes a workflow to execute a command specified in an SSM Document;
the Document is "sent" from the StepFunction to the virtual machine through a Lambda function;
a second Lambda, also controlled by the StepFunction, verifies the outcome of the execution.

Prerequisites

The virtual machine is already installed and configured with the software to run, and it has an Instance Profile with the necessary permissions to allow the SSM Document to run (and possibly to access the AWS services needed by the use case).

In my case, the policies associated with the Instance Profile are:

AWS Managed Policy AmazonSSMManagedInstanceCore (to be managed by SSM)
AWS Managed Policy AWSCodeCommitReadOnly (to access the code repository)
custom policy to allow s3:PutObject on the output bucket

AWS Cloud Development Kit (AWS CDK)

To create this architecture, I used AWS CDK for Python.

AWS CDK is an open-source software development framework for defining the AWS cloud infrastructure introduced in July 2019. Since AWS CDK uses CloudFormation as a foundation, it has all the benefits of CloudFormation by allowing you to provision cloud resources using modern programming languages such as Typescript, C#, Java and Python.

If you are not familiar with AWS CDK, you can follow a great tutorial here.

Using AWS CDK is also advantageous because it allows you to write less code than other "classic" Infrastructure as Code tools (in my example, my approximately 150 lines of Python code generate 740 CloudFormation YAML lines); in particular, many IAM roles and policies are deduced directly from the framework without having to write them explicitly.

You can find the complete example at this link.

SSM Document

To start developing my solution, I first create an SSM Document for my EC2 Windows, which is the script that needs to be run on the virtual machine:

schemaVersion: "2.2"
description: "Example document"
parameters:
  Message:
    type: "String"
    description: "Message to write"
  OutputBucket:
    type: "String"
    description: "Bucket to save output"
  CodeRepository:
    type: "String"
    description: "Git repository to clone"
mainSteps:
  - action: "aws:runPowerShellScript"
    name: "SampleStep"
    precondition:
      StringEquals:
        - platformType
        - Windows
    inputs:
      timeoutSeconds: "60"
      runCommand:
        - Import-Module AWSPowerShell
        - Write-Host "Create temp dir"
        - $tempdir=$(-join ((48..57) + (97..122) | Get-Random -Count 32 | % {[char]$_}))
        - New-item "$env:temp\$tempdir" -ItemType Directory
        - Write-Host "Cloning repository"
        - "git clone {{CodeRepository}} $tempdir"
        - $fname = $(((get-date).ToUniversalTime()).ToString("yyyyMMddTHHmmssZ"))
        - Write-Host "Writing file on S3"
        - "Write-S3Object -BucketName {{OutputBucket}} -Key ($fname + '.txt') -Content {{Message}}"
        - Write-Host "Removing temp dir"
        - Remove-Item -path $tempdir -Recurse -Force -EA SilentlyContinue
        - Write-Host "All done!"

This example script has 3 parameters: the Git repository to clone, the message to write to the output file, and the S3 bucket to save that file; and then it uses these parameters with the commands to execute. Of course, this is a straightforward example, which can be modified as needed.

I use this YAML file directly in my Python code to create the Document on AWS SSM:

with open("ssm/windows.yml") as openFile:
    documentContent = yaml.load(openFile, Loader=yaml.FullLoader)
    cfn_document = ssm.CfnDocument(self, "MyCfnDocument",
        content=documentContent,
        document_format="YAML",
        document_type="Command",
        name="pipe-sfn-ec2Win-GitS3",
        update_method="NewVersion",
        target_type="/AWS::EC2::Instance"
    )

Lambda

I create the CodeCommit repository where I will save the application code, the S3 bucket to write the processing results, and then the two Lambdas:

repo = codecommit.Repository(self, "pipe-sfn-ec2Repo",
            repository_name="pipe-sfn-ec2-repo"
        )

output_bucket = s3.Bucket(self, 'ExecutionOutputBucket')

submit_lambda = _lambda.Function(self, 'submitLambda',
                    handler='lambda_function.lambda_handler',
                    runtime=_lambda.Runtime.PYTHON_3_9,
                    code=_lambda.Code.from_asset('lambdas/submit'),
                    environment={
                        "OUTPUT_BUCKET": output_bucket.bucket_name,
                        "SSM_DOCUMENT": cfn_document.name,
                        "CODE_REPOSITORY": repo.repository_clone_url_http
                        })

status_lambda = _lambda.Function(self, 'statusLambda',
                    handler='lambda_function.lambda_handler',
                    runtime=_lambda.Runtime.PYTHON_3_9,
                    code=_lambda.Code.from_asset('lambdas/status'))

As you can see, the Lambda "submit" has 3 environment variables that will serve as parameters for the commands to be executed on the virtual machine. The Lambda code is also in Python:

import boto3
import os
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

ssm_client = boto3.client('ssm')

document_name = os.environ["SSM_DOCUMENT"]
output_bucket = os.environ["OUTPUT_BUCKET"]
code_repository = os.environ["CODE_REPOSITORY"]

def lambda_handler(event, context):
    logger.debug(event)

    instance_id = event["instance_id"]
    message = event["message"]

    response = ssm_client.send_command(
                InstanceIds=[instance_id],
                DocumentName=document_name,
                Parameters={
                    "Message": [message],
                    "OutputBucket": [output_bucket],
                    "CodeRepository": [code_repository]})

    logger.debug(response)

    command_id = response['Command']['CommandId']
    data = {
        "command_id": command_id, 
        "instance_id": instance_id
    }

    return data

This first Lambda "submit" output becomes the second Lambda "status" input: it checks the status of the just started execution:

import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

ssm_client = boto3.client('ssm')

def lambda_handler(event, context):

    instance_id = event['Payload']['instance_id']
    command_id = event['Payload']['command_id']

    logger.debug(instance_id)
    logger.debug(command_id)

    response = ssm_client.get_command_invocation(CommandId=command_id, InstanceId=instance_id)

    logger.debug(response)

    execution_status = response['StatusDetails']
    logger.debug(execution_status)

    if execution_status == "Success":
        return {"status": "SUCCEEDED", "event": event}
    elif execution_status in ('Pending', 'InProgress', 'Delayed'):
        data = {
            "command_id": command_id, 
            "instance_id": instance_id,
            "status": "RETRY", 
            "event": event
        }
        return data
    else:
        return {"status": "FAILED", "event": event}

This Lambda "status" output will determine the StepFunction workflow: if the SSM Document execution is completed, the StepFunction will terminate with a corresponding status (Success or Failed); instead, if the execution is still in progress, the StepFunction will wait some time and then will re-execute the Lambda again for a new status check.

I also need to grant the necessary permissions. The first Lambda must be able to launch the execution of the SSM Document on EC2; the second Lambda instead needs the permissions to consult the SSM executions:

ec2_arn = Stack.of(self).format_arn(
    service="ec2",
    resource="instance",
    resource_name="*"
)

cfn_document_arn = Stack.of(self).format_arn(
    service="ssm",
    resource="document",
    resource_name=cfn_document.name
)

ssm_arn = Stack.of(self).format_arn(
    service="ssm",
    resource="*"
)

submit_lambda.add_to_role_policy(iam.PolicyStatement(
    resources=[cfn_document_arn, ec2_arn],
    actions=["ssm:SendCommand"]
))

status_lambda.add_to_role_policy(iam.PolicyStatement(
    resources=[ssm_arn],
    actions=["ssm:GetCommandInvocation"]
))

Please note that these are the only permissions I have explicitly written in my code, as these are capabilities deriving from Lambdas' internal logic. All other permissions (for example, reading the CodeCommit repository, executing the StepFunction, triggering the CodePipeline, etc.) are implicitly inferred from the CDK framework, greatly shortening the writing of my IaC code.

StepFunction

The StepFunction workflow is shown in the following diagram:

This is its definition:

submit_job = _aws_stepfunctions_tasks.LambdaInvoke(
    self, "Submit Job",
    lambda_function=submit_lambda
)

wait_job = _aws_stepfunctions.Wait(
    self, "Wait 10 Seconds",
    time=_aws_stepfunctions.WaitTime.duration(
        Duration.seconds(10))
)

status_job = _aws_stepfunctions_tasks.LambdaInvoke(
    self, "Get Status",
    lambda_function=status_lambda
)

fail_job = _aws_stepfunctions.Fail(
    self, "Fail",
    cause='AWS SSM Job Failed',
    error='Status Job returned FAILED'
)

succeed_job = _aws_stepfunctions.Succeed(
    self, "Succeeded",
    comment='AWS SSM Job succeeded'
)

definition = submit_job.next(wait_job)\
    .next(status_job)\
    .next(_aws_stepfunctions.Choice(self, 'Job Complete?')
            .when(_aws_stepfunctions.Condition.string_equals('$.Payload.status', 'FAILED'), fail_job)
            .when(_aws_stepfunctions.Condition.string_equals('$.Payload.status', 'SUCCEEDED'), succeed_job)
            .otherwise(wait_job))

sfn = _aws_stepfunctions.StateMachine(
    self, "StateMachine",
    definition=definition,
    timeout=Duration.minutes(5)
)

CodePipeline

In this example, for the sake of simplicity, I define my pipeline by creating only two steps, the source one and the StepFunction execution one:

pipeline = codepipeline.Pipeline(self, "pipe-sfn-ec2Pipeline",
    pipeline_name="pipe-sfn-ec2Pipeline",
    cross_account_keys=False
)

source_output = codepipeline.Artifact("SourceArtifact")

source_action = codepipeline_actions.CodeCommitSourceAction(
    action_name="CodeCommit",
    repository=repo,
    branch="main",
    output=source_output
)

step_function_action = codepipeline_actions.StepFunctionInvokeAction(
    action_name="Invoke",
    state_machine=sfn,
    state_machine_input=codepipeline_actions.StateMachineInput.file_path(source_output.at_path("abc.json"))
)

pipeline.add_stage(
    stage_name="Source",
    actions=[source_action]
)

pipeline.add_stage(
    stage_name="StepFunctions",
    actions=[step_function_action]
)

I draw your attention to the state_machine_input definition: in this code, I have indicated that the StepFunction input parameters must be read from the abc.json file contained directly in the CodeCommit repository.

Execution

To test the solution, push the abc.json file with the following content into the repository:

{
    "instance_id": "i-1234567890abcdef",
    "message": "aSampleMessage"
}

In this way, the developer who writes his own code and has to execute his commands on the virtual machine can indicate both the machine and the execution parameters.

That's all! Once pushed, the pipeline starts automatically, downloads the code from the repository and launches the StepFunction:

It is possible to consult the flow of the StepFunction execution:

You can also consult the execution of the SSM Document:

Considerations

Having constraints, imposed by organizational choices or by some kinds of software, is a very common situation, especially in large companies: this should not discourage the introduction of modern methodologies and technologies, because these technologies allow solutions for (almost) any integration.

The introduction of a StepFunction into the pipeline, which may seem like an overengineering in cases where commands take a few seconds to execute on a virtual machine, is actually indispensable when this execution takes a relatively long time.

Using AWS CDK dramatically shortens code writing time, as long as you are familiar with one of the supported programming languages.

Blue/green deployment of a web server on ECS Fargate

Monica Colangelo — Sat, 20 Aug 2022 15:44:00 +0000

This post was originally published at https://letsmake.cloud.

Seamless technological upgrades of legacy infrastructures

A frontline web server exposing a backend application - who hasn't seen one?

This apparently simple logical architecture is obviously based on multiple instances that can guarantee high reliability and load balancing.

It is a model that has existed for decades, but the technologies to make it evolve. Sometimes, due to time or budget, or organizational reasons, it is not possible to modernize specific applications, for example, because they often belong to different teams with different priorities, or because the project group has been dissolved and the application must be "kept alive" as it is.

These situations are super common, and the result is that many old configurations are never deleted from web servers but instead continue to stratify more and more, even becoming extremely complex.

This complexity increases the risk of making a mistake when inserting a change exponentially, and the blast radius is potentially enormous in this situation.

To summarize, the use case I want to describe has the constraint of not intervening in the configurations and maintaining the logical architecture, but we want to act at a technological level to improve safety and reliability and minimize operational risk.

The solution I created is based on ECS Fargate, where I transformed old virtual machines into containers, and uses the same methodology that usually applies to backend applications, that is the blue/green deployment technique, with the execution of tests to decide whether a new configuration can go online safely.

Blue/green deployment

Blue/green deployment is an application release model that swaps traffic from an older version of an app or microservice to a new release. The previous version is called the blue environment, while the new version is called the green environment.

In this model, it is essential to test the green environment to ensure its readiness to handle production traffic. Once the tests are passed, this new version is promoted to production by reconfiguring the load balancer to transfer the incoming traffic from the blue environment to the green environment, running the latest version of the application at last.

Using this strategy increases application availability and reduces operational risk, while also the rollback process is simplified.

Fully managed updates with ECS Fargate

AWS CodePipeline supports fully automated blue/green releases on Amazon Elastic Container Service (ECS).

Normally, when you create an ECS service with an Application Load Balancer in front of it, you need to designate a target group that contains the microservices to receive the requests. The blue/green approach involves the creation of two target groups: one for the blue version and one for the green version of the service. It also uses a different listening port for each target group, so that you can test the green version of the service using the same path as the blue version.

With this configuration, you run both environments in parallel until you are ready to switch to the green version of the service.

When you are ready to replace the old blue version with the new green version, you swap the listener rules with the target group rules. This change takes place in seconds. At this point the green service is running in the target group with the listener of the "original" port (which previously belonged to the blue version) and the blue service is running in the target group with the listener of the port that was of the green version (until termination).

At this point, the real question is: how can this system decide if and when the green version is ready to replace the blue version?

You need a control logic that executes tests to evaluate whether the new version can replace the old one with a high degree of confidence. Swapping from the old to the new version is only allowed after passing these tests.

All these steps are fully automatic on ECS thanks to the complete integration of AWS CodePipeline + CodeBuild + CodeDeploy services. The control tests in my case are performed by a Lambda.

The following diagram illustrates the approach described.

Using Terraform to build a blue/green deployment system on ECS

Creating an ECS cluster and a pipeline that builds the new version of the container image to deploy in blue/green mode is not difficult in itself but requires creating many cloud resources to coordinate. Below we will look at some of the key details in creating these assets with Terraform.

You can find the complete example at this link. Here are just some code snippets useful for examining the use case.

Application Load Balancer

One of the basic resources for our architecture is the load balancer. I enable access logs stored in a bucket because they will be used indirectly for the testing Lambda:

resource "aws_alb" "load_balancer" {
  name            = replace(local.name, "_", "-")
  internal        = false
  access_logs {
    bucket  = aws_s3_bucket.logs_bucket.bucket
    prefix  = "alb_access_logs"
    enabled = true
  }
  ...
}

So I create two target groups, identical to each other:

resource "aws_alb_target_group" "tg_blue" {
  name        = join("-", [replace(local.name, "_", "-"), "blue"])
  port        = 80
  protocol    = "HTTP"
  target_type = "ip"
  ...
}

resource "aws_alb_target_group" "tg_green" {
  name        = join("-", [replace(local.name, "_", "-"), "green"])
  port        = 80
  protocol    = "HTTP"
  target_type = "ip"
  ...
}

As mentioned before, I create two listeners on two different ports, 80 and 8080 in this example. The meta-argument ignore_changes makes Terraform ignore future changes to the default_action that will have been performed by the blue/green deployment.

resource "aws_alb_listener" "lb_listener_80" {
  load_balancer_arn = aws_alb.load_balancer.id
  port              = "80"
  protocol          = "HTTP"

  default_action {
    target_group_arn = aws_alb_target_group.tg_blue.id
    type             = "forward"
  }

  lifecycle {
    ignore_changes = [default_action]
  }
}

resource "aws_alb_listener" "lb_listener_8080" {
  load_balancer_arn = aws_alb.load_balancer.id
  port              = "8080"
  protocol          = "HTTP"

  default_action {
    target_group_arn = aws_alb_target_group.tg_green.id
    type             = "forward"
  }

  lifecycle {
    ignore_changes = [default_action]
  }
}

ECS

In the configuration of the ECS cluster, the service definition includes the indication of the target group to be associated with the creation. This indication will be ignored in any subsequent Terraform runs:

resource "aws_ecs_service" "ecs_service" {
  name             = local.name
  cluster          = aws_ecs_cluster.ecs_cluster.id
  task_definition  = aws_ecs_task_definition.task_definition.arn
  desired_count    = 2
  launch_type      = "FARGATE"

  deployment_controller {
    type = "CODE_DEPLOY"
  }

  load_balancer {
    target_group_arn = aws_alb_target_group.tg_blue.arn
    container_name   = local.name
    container_port   = 80
  }

  lifecycle {
    ignore_changes = [task_definition, load_balancer, desired_count]
  }
  ...
}

CodeCommit

The application code - in my case, the webserver configurations and the Dockerfile to create the image - is saved in a Git repository on CodeCommit. A rule is associated with this repository that intercepts every push event and triggers the pipeline:

resource "aws_codecommit_repository" "repo" {
  repository_name = local.name
  description     = "${local.name} Repository"
}

resource "aws_cloudwatch_event_rule" "commit" {
  name        = "${local.name}-capture-commit-event"
  description = "Capture ${local.name} repo commit"

  event_pattern = <<EOF
{
  "source": [
    "aws.codecommit"
  ],
  "detail-type": [
    "CodeCommit Repository State Change"
  ],
  "resources": [
   "${aws_codecommit_repository.repo.arn}"
  ],
  "detail": {
    "referenceType": [
      "branch"
    ],
    "referenceName": [
      "${aws_codecommit_repository.repo.default_branch}"
    ]
  }
}
EOF
}

resource "aws_cloudwatch_event_target" "event_target" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.commit.name
  arn       = aws_codepipeline.codepipeline.arn
  role_arn  = aws_iam_role.codepipeline_role.arn
}

The Dockerfile that will be saved on this repository depends of course on the application. In my case I have a code structure like this:

.
├── version.txt
├── Dockerfile
└── etc
    └── nginx
        ├── nginx.conf
        └── conf.d
            ├── file1.conf
            └── ...
        └── projects.d
            ├── file2.conf
            └── ...
        └── upstream.d
            └── file3.conf
            └── ...

My Dockerfile will be therefore very simple:

FROM nginx:latest

COPY etc/nginx/nginx.conf /etc/nginx/nginx.conf 
COPY etc/nginx/conf.d /etc/nginx/conf.d 
COPY etc/nginx/projects.d /etc/nginx/projects.d/ 
COPY etc/nginx/upstream.d /etc/nginx/upstream.d/
COPY version.txt /usr/share/nginx/html/version.txt

CodeBuild

I then configure the step needed to generate the new version of the container through CodeBuild. The privileged_mode = true property enables the Docker daemon within the CodeBuild container.

resource "aws_codebuild_project" "codebuild" {
  name          = local.name
  description   = "${local.name} Codebuild Project"
  build_timeout = "5"
  service_role  = aws_iam_role.codebuild_role.arn

  artifacts {
    type = "CODEPIPELINE"
  }

  environment {
    compute_type                = "BUILD_GENERAL1_SMALL"
    image                       = "aws/codebuild/standard:6.0"
    type                        = "LINUX_CONTAINER"
    image_pull_credentials_type = "CODEBUILD"
    privileged_mode             = true

    environment_variable {
      name  = "IMAGE_REPO_NAME"
      value = aws_ecr_repository.ecr_repo.name
    }

    environment_variable {
      name  = "AWS_ACCOUNT_ID"
      value = data.aws_caller_identity.current.account_id
    }
  }

  source {
    type      = "CODEPIPELINE"
    buildspec = "buildspec.yml"
  }
}

The buildspec.yml file in the CodeBuild configuration is used to define how to generate the container image. This file is included in the repository along with the code, and it looks like this:

version: 0.2
env:
  shell: bash

phases:
  install:
    runtime-versions:
      docker: 19

  pre_build:
    commands:
      - IMAGE_TAG=$CODEBUILD_BUILD_NUMBER
      - echo Logging in to Amazon ECR...
      - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION)

  build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...
      - docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
      - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:latest

  post_build:
    commands:
      - echo Build completed on `date`
      - echo Pushing to repo
      - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:latest
      - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG

CodeDeploy

The CodeDeploy configuration includes three different resources. The first two are relatively simple:

resource "aws_codedeploy_app" "codedeploy_app" {
  compute_platform = "ECS"
  name             = local.name
}

resource "aws_codedeploy_deployment_config" "config_deploy" {
  deployment_config_name = local.name
  compute_platform       = "ECS"

  traffic_routing_config {
    type = "AllAtOnce"
  }
}

I finally configure the blue/green deployment. With this code I indicate:

to perform an automatic rollback in case of deployment failure
if successful, terminate the old version after 5 minutes
listeners for "normal" (prod) traffic and test traffic

resource "aws_codedeploy_deployment_group" "codedeploy_deployment_group" {
  app_name               = aws_codedeploy_app.codedeploy_app.name
  deployment_group_name  = local.name
  service_role_arn       = aws_iam_role.codedeploy_role.arn
  deployment_config_name = aws_codedeploy_deployment_config.config_deploy.deployment_config_name

  ecs_service {
    cluster_name = aws_ecs_cluster.ecs_cluster.name
    service_name = aws_ecs_service.ecs_service.name
  }

  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE"]
  }

  deployment_style {
    deployment_option = "WITH_TRAFFIC_CONTROL"
    deployment_type   = "BLUE_GREEN"
  }

  blue_green_deployment_config {
    deployment_ready_option {
      action_on_timeout    = "CONTINUE_DEPLOYMENT"
      wait_time_in_minutes = 0
    }

    terminate_blue_instances_on_deployment_success {
      action                           = "TERMINATE"
      termination_wait_time_in_minutes = 5
    }
  }

  load_balancer_info {
    target_group_pair_info {
      target_group {
        name = aws_alb_target_group.tg_blue.name
      }

      target_group {
        name = aws_alb_target_group.tg_green.name
      }

      prod_traffic_route {
        listener_arns = [aws_alb_listener.lb_listener_80.arn]
      }

      test_traffic_route {
        listener_arns = [aws_alb_listener.lb_listener_8080.arn]
      }
    }
  }
}

It is necessary to insert in the Git repository, together with the code, also two files essential for the correct functioning of CodeDeploy.

The first file is taskdef.json and includes the task definition for our ECS service, with the indication of container image, executionRole and logConfiguration to be inserted according to the resources created by Terraform. For example:

{
  "executionRoleArn": "arn:aws:iam::123456789012:role/ECS_role_BlueGreenDemo",
  "containerDefinitions": [
    {
      "name": "BlueGreenDemo",
      "image": "123456789012.dkr.ecr.eu-west-1.amazonaws.com/BlueGreenDemo_repository:latest",
      "essential": true,
      "portMappings": [
        {
          "hostPort": 80,
          "protocol": "tcp",
          "containerPort": 80
        }
      ],
      "ulimits": [
        {
          "name": "nofile",
          "softLimit": 4096,
          "hardLimit": 4096
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/aws/ecs/BlueGreenDemo",
          "awslogs-region": "eu-south-1",
          "awslogs-stream-prefix": "nginx"
        }
      }
    }
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "512",
  "family": "BlueGreenDemo"
}

The second file to include in the repository is appspec.yml, which is the file used by CodeDeploy to perform release operations. The task definition is set with a placeholder (because the real file path will be referenced in the CodePipeline configuration), and the name of the lambda to run for the tests is indicated.

In our case, the lambda must be executed when the AfterAllowTestTraffic event arrives, that is, when the new version is ready to receive the test traffic. Other possible hooks are documented on this page; my choice depended on my use case and how I decided to implement my tests.

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "<TASK_DEFINITION>"
        LoadBalancerInfo:
          ContainerName: "BlueGreenDemo"
          ContainerPort: 80
Hooks:
  - AfterAllowTestTraffic: "BlueGreenDemo_lambda"

Lambda

The Lambda function that performs the tests is created by Terraform, and some environment variables are also configured, the purpose of which will be explained later:

resource "aws_lambda_function" "lambda" {
  function_name = "${local.name}_lambda"
  role          = aws_iam_role.lambda_role.arn
  handler       = "app.lambda_handler"
  runtime       = "python3.8"
  timeout       = 300
  s3_bucket     = aws_s3_bucket.lambda_bucket.bucket
  s3_key        = aws_s3_object.lambda_object.key
  environment {
    variables = {
      BUCKET               = aws_s3_bucket.testdata_bucket.bucket
      FILEPATH             = "acceptance_url_list.csv"
      ENDPOINT             = "${local.custom_endpoint}:8080"
      ACCEPTANCE_THRESHOLD = "90"
    }
  }
}

resource "aws_s3_object" "lambda_object" {
  key    = "${local.name}/dist.zip"
  bucket = aws_s3_bucket.lambda_bucket.bucket
  source = data.archive_file.lambda_zip_file.output_path
}

data "archive_file" "lambda_zip_file" {
  type        = "zip"
  output_path = "${path.module}/${local.name}-lambda.zip"
  source_file = "${path.module}/../lambda/app.py"
}

CodePipeline

Finally, to make all the resources seen so far interact correctly, I configure CodePipeline to orchestrate the three stages corresponding to:

download the source code from CodeCommit
build of the container image performed by CodeBuild
a release made by CodeDeploy

resource "aws_codepipeline" "codepipeline" {
  name     = local.name
  role_arn = aws_iam_role.codepipeline_role.arn

  artifact_store {
    location = aws_s3_bucket.codepipeline_bucket.bucket
    type     = "S3"
  }

  stage {
    name = "Source"
    action {
      name             = "Source"
      category         = "Source"
      owner            = "AWS"
      provider         = "CodeCommit"
      version          = "1"
      output_artifacts = ["source_output"]

      configuration = {
        RepositoryName        = aws_codecommit_repository.repo.repository_name
        BranchName            = aws_codecommit_repository.repo.default_branch
        PollForSourceChanges  = false
      }
    }
  }

  stage {
    name = "Build"
    action {
      name             = "Build"
      category         = "Build"
      owner            = "AWS"
      provider         = "CodeBuild"
      input_artifacts  = ["source_output"]
      output_artifacts = ["build_output"]
      version          = "1"

      configuration = {
        ProjectName = aws_codebuild_project.codebuild.name
      }
    }
  }

  stage {
    name = "Deploy"
    action {
      category        = "Deploy"
      name            = "Deploy"
      owner           = "AWS"
      provider        = "CodeDeployToECS"
      version         = "1"
      input_artifacts = ["source_output"]

      configuration = {
        ApplicationName                = local.name
        DeploymentGroupName            = local.name
        AppSpecTemplateArtifact        = "source_output"
        AppSpecTemplatePath            = "appspec.yaml"
        TaskDefinitionTemplateArtifact = "source_output"
        TaskDefinitionTemplatePath     = "taskdef.json"
      }
    }
  }
}

Purpose of the tests and implementation logic of Lambda

The purpose of the control test on the new version of the service to be deployed is not to verify that the new configurations are working and conform to expectations, but rather that they do not introduce "regressions" on the previous behaviour of the web server. In essence, this is a mechanism to reduce the risk of "breaking" something that used to work - which is extremely important in the case of an infrastructure shared by many applications.

The idea behind this Lambda is to make a series of requests to the new version of the service when it has been created and is ready to receive traffic, but the load balancer is still configured with the old version (trigger event of the AfterAllowTestTraffic hook configured in the CodeDeploy appspec.yml).

The list of URLs to be tested must be prepared with a separate process: it may be a static list, but in my case, having no control or visibility on the URLs delivered dynamically from the backend (think about a CMS: tons of URLs that can change anytime), I created a night job whose starting point is the webserver access logs of the day before; but I also had to take into account URLs that are rarely visited and that may not be present in the access logs every day. Furthermore, the execution time of a Lambda is limited, therefore a significant subset of URLs must be carefully chosen in order not to prolong the execution excessively and risk the timeout.

The process of creating this list is linked to how many and which URLs are served by the web server. Therefore it is not possible to provide suggestions on a unique way to generate the list: it is strictly dependent on the use case. The list of URLs of requests to be made is contained, in my example, in the file acceptance_url_list.csv on an S3 bucket.

The environment variables used by my Lambda include the bucket and path of this file, the endpoint to send requests to and a parameter introduced to allow for a margin of error. The applications to which the web server sends requests may change as a result of application releases and the URLs that were functional the day before may no longer be reachable; not being able to have complete control, especially in very complex infrastructures, I have chosen to introduce a threshold corresponding to the percentage of requests that must obtain an HTTP 200 response for the test to be considered passed.

Once the logic described is understood, the Lambda code is not particularly complex: the function makes the requests listed in the list, calculates the percentage of successes, and finally notifies CodeDeploy of the outcome of the test.

import boto3
import urllib.request
import os
import csv
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event,context):
    codedeploy = boto3.client('codedeploy')

    endpoint = os.environ['ENDPOINT']
    bucket = os.environ['BUCKET']
    file = os.environ['FILEPATH']
    source_file = "s3://"+os.environ['BUCKET']+"/"+os.environ['FILEPATH']
    perc_min = os.environ['ACCEPTANCE_THRESHOLD']

    count_200 = 0
    count_err = 0

    s3client = boto3.client('s3')
    try:
        s3client.download_file(bucket, file, "/tmp/"+file)
    except:
        pass

    with open("/tmp/"+file, newline='') as f:
        reader = csv.reader(f)
        list1 = list(reader)

    for url_part in list1:
        code = 0
        url = "http://"+endpoint+url_part[0]
        try:
            request = urllib.request.urlopen(url)
            code = request.code
            if code == 200:
                count_200 = count_200 + 1
            else:
                count_err = count_err + 1
        except:
            count_err = count_err + 1
        if code == 0:
            logger.info(url+" Error")
        else:
            logger.info(url+" "+str(code))

    status = 'Failed'
    perc_200=(int((count_200/(count_200+count_err))*100))
    logger.info("HTTP 200 response percentage: ")
    logger.info(perc_200)
    if perc_200 > int(perc_min):
        status = "Succeeded"

    logger.info("TEST RESULT: ")
    logger.info(status)

    codedeploy.put_lifecycle_event_hook_execution_status(
        deploymentId=event["DeploymentId"],            
        lifecycleEventHookExecutionId=event["LifecycleEventHookExecutionId"],
        status=status
    )
    return True

Integration is a "tailoring" activity

In this article, we have seen how to integrate and coordinate many different objects to make them converge towards end-to-end automation, also including resources that need to be tailored to the specific use case.

Automation of releases is an activity that I find very rewarding. Historically, releases have always been a thorn in the side, precisely because the activity was manual, not subject to any tests, with many unexpected variables: fortunately, as we have seen, it is now possible to rely on a well-defined, clear and repeatable process.

The tools we have available for automation are very interesting and versatile, but the cloud doesn't do it all by itself. However, some important integration work is necessary (the code we have seen is only a part; the complete example is here), and above all knowing how to adapt the resources to the use case, always looking for the best solution to solve the specific problem.

How to expose multiple applications on Google Kubernetes Engine with a single Cloud Load Balancer

Monica Colangelo — Sat, 20 Aug 2022 15:38:00 +0000

This post was originally published at https://letsmake.cloud.

In my previous article, I talked about how to expose multiple applications hosted on AWS EKS via a single Application Load Balancer.

In this article, we will see how to do the same thing, this time not on AWS but Google Cloud!

Network Endpoint Group and Container-native load balancing

On GCP, configurations called Network Endpoint Group (NEG) are used to specify a group of endpoints or backend services. A common use case using NEGs is deploying services in containers using them as a backend for some load balancers.

Container-native load balancing uses GCE_VM_IP_PORT NEGs (where NEG endpoints are pod IP addresses) and allows the load balancer to target pods, and distribute traffic among them directly.

Commonly, container-native load balancing is used for the Ingress GKE resource. In that case, the ingress-controller takes care of creating all the necessary resource chain, including the load balancer; this means that each application on GKE corresponds to an Ingress and consequently a load balancer.

Without using the ingress-controller, GCP allows you to create autonomous NEGs; in that case, you have to act manually, and you lose the advantages of the elasticity and speed of a cloud-native architecture.

To summarize: in my use case, I want to use a single load balancer, configured independently from GKE, and have traffic routed to different GKE applications, depending on the rules established by my architecture; and, at the same time, I want to take advantage of cloud-native automatisms without making manual configuration updating operations.

AWS ALB vs GCP Load Balancing

Realizing the same use case on two different cloud providers, the most noteworthy difference is in the "boundary" that Kubernetes reaches in managing resources; or, if we want to look at things from the other point of view, in the configurations that must be prepared on the cloud provider (manually or, as we will see, with Terraform).

In the article about EKS, on AWS I configured, in addition to the ALB, the target groups, one for each application to be exposed; these target groups were created as "empty boxes". Subsequently, I created the deployments and their related services on EKS; finally, I made a TargetGroupBinding configuration (lb-controller custom resource) to indicate to the pods belonging to a specific service which was the correct target group to register with.

In GCP, the Backend Service resource (which can be roughly assimilated to an AWS target group) cannot be created as an "empty box", but since its creation it needs to know its targets to forward traffic to. As I said before, in my use case the targets are the NEGs that GKE automatically generates when a Kubernetes service is created; consequently, I will create these services at the same time as the infrastructure (they will be my "empty boxes"), and I will only manage the application deployments separately.

This apparent difference is purely operational: it is just a matter of configuring the Kubernetes service with different tools, and it can be noteworthy if the configuration of the cloud resources (for example, with Terraform) is made by a different team than the one that deploys the applications in the cluster.

From a functional point of view, the two solutions are exactly equivalent.

The other difference is that in GKE the VPC IP addresses to be assigned to the pods are managed natively, and they do not require any add-on, while on EKS the VPC CNI plugin or other similar third-party plugins must be used.

Component configuration

GKE cluster and network configuration are considered a prerequisite and will not be covered here. The code shown here is partial; a complete example can be found here.

Kubernetes Services

In this example I create two different applications, represented by Nginx and by Apache, to show traffic routing on two different endpoints.

With Terraform I create the Kubernetes services related to the two applications; the use of annotations allows the automatic creation of NEGs:

resource "kubernetes_service" "apache" {
  metadata {
    name      = "apache"
    namespace = local.namespace
    annotations = {
      "cloud.google.com/neg" = "{\"exposed_ports\": {\"80\":{\"name\": \"${local.neg_name_apache}\"}}}"
      "cloud.google.com/neg-status" = jsonencode(
        {
          network_endpoint_groups = {
            "80" = local.neg_name_apache
          }
          zones = data.google_compute_zones.available.names
        }
      )
    }
  }
  spec {
    port {
      name        = "http"
      protocol    = "TCP"
      port        = 80
      target_port = "80"
    }
    selector = {
      app = "apache"
    }
    type = "ClusterIP"
  }
}

resource "kubernetes_service" "nginx" {
  metadata {
    name      = "nginx"
    namespace = local.namespace
    annotations = {
      "cloud.google.com/neg" = "{\"exposed_ports\": {\"80\":{\"name\": \"${local.neg_name_nginx}\"}}}"
      "cloud.google.com/neg-status" = jsonencode(
        {
          network_endpoint_groups = {
            "80" = local.neg_name_nginx
          }
          zones = data.google_compute_zones.available.names
        }
      )
    }
  }
  spec {
    port {
      name        = "http"
      protocol    = "TCP"
      port        = 80
      target_port = "80"
    }
    selector = {
      app = "nginx"
    }
    type = "ClusterIP"
  }
}

NEG

NEG links always have the same structure, so it's easy to build a list:

locals {
  neg_name_apache = "apache"
  neg_apache      = formatlist("https://www.googleapis.com/compute/v1/projects/%s/zones/%s/networkEndpointGroups/%s", module.enabled_google_apis.project_id, data.google_compute_zones.available.names, local.neg_name_apache)
  neg_name_nginx  = "nginx"
  neg_nginx       = formatlist("https://www.googleapis.com/compute/v1/projects/%s/zones/%s/networkEndpointGroups/%s", module.enabled_google_apis.project_id, data.google_compute_zones.available.names, local.neg_name_nginx)
}

Backend

At this point it is easy to create the backend services:

resource "google_compute_backend_service" "backend_apache" {
  name    = "${local.name}-backend-apache"

  dynamic "backend" {
    for_each = local.neg_apache
    content {
      group          = backend.value
      balancing_mode = "RATE"
      max_rate       = 100
    }
  }
...
}

resource "google_compute_backend_service" "backend_nginx" {
  name    = "${local.name}-backend-nginx"

  dynamic "backend" {
    for_each = local.neg_nginx
    content {
      group          = backend.value
      balancing_mode = "RATE"
      max_rate       = 100
    }
  }
...
}

URL Map

I then define the url_map resource, which represents the traffic routing logic. In this example, I use a set of rules that are the same for all domains to which my load balancer responds, and I address the traffic according to the path; you can customize the routing rules following the documentation.

resource "google_compute_url_map" "http_url_map" {
  project         = module.enabled_google_apis.project_id
  name            = "${local.name}-loadbalancer"
  default_service = google_compute_backend_bucket.static_site.id

  host_rule {
    hosts        = local.domains
    path_matcher = "all"
  }

  path_matcher {
    name            = "all"
    default_service = google_compute_backend_bucket.static_site.id

    path_rule {
      paths = ["/apache"]
      route_action {
        url_rewrite {
          path_prefix_rewrite = "/"
        }
      }
      service = google_compute_backend_service.backend_apache.id
    }

    path_rule {
      paths = ["/nginx"]
      route_action {
        url_rewrite {
          path_prefix_rewrite = "/"
        }
      }
      service = google_compute_backend_service.backend_nginx.id
    }
  }
}

Putting it all together

Finally, the resources that bind the created components together are a target_http_proxy and a global_forwarding_rule:

resource "google_compute_target_http_proxy" "http_proxy" {
  project = module.enabled_google_apis.project_id
  name    = "http-proxy"
  url_map = google_compute_url_map.http_url_map.self_link
}

resource "google_compute_global_forwarding_rule" "http_fw_rule" {
  project               = module.enabled_google_apis.project_id
  name                  = "http-fw-rule"
  port_range            = 80
  target                = google_compute_target_http_proxy.http_proxy.self_link
  load_balancing_scheme = "EXTERNAL"
  ip_address            = google_compute_global_address.ext_lb_ip.address
}

Use on Kubernetes

Once set up on GCP is complete, using this technique on GKE is even easier than on EKS. It is sufficient to insert a deployment resource that corresponds to the service already created on the load balancer:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: nginx
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: apache
  labels:
    app: apache
spec:
  selector:
    matchLabels:
      app: apache
  strategy:
    type: Recreate
  replicas: 3
  template:
    metadata:
      labels:
        app: apache
    spec:
      containers:
        - name: httpd
          image: httpd:2.4
          ports:
            - containerPort: 80

From now on, each new pod that refers to the deployment associated with that service will automatically be associated with its NEG. To test it, just scale the number of replicas of the deployment:

kubectl scale deployment nginx --replicas 5

and within a few seconds the new pods will be present as a target of the NEG.

Thanks to Cristian Conte for contributing with his GCP knowledge!

How to expose multiple applications on Amazon EKS with a single Application Load Balancer

Monica Colangelo — Sat, 20 Aug 2022 15:32:00 +0000

This post was originally published at https://letsmake.cloud.

Expose microservices to the Internet with AWS

One of the defining moments in building a microservices application is deciding how to expose endpoints so that a client or API can send requests and get responses.

Usually, each microservice has its endpoint. For example, each URL path will point to a different microservice:

www.example.com/service1 > microservice1
www.example.com/service2 > microservice2
www.example.com/service3 > microservice3
...

This type of routing is known as path-based routing.

This approach has the advantage of being low-cost and simple, even when exposing dozens of microservices.

On AWS, both Application Load Balancer (ALB) and Amazon API Gateway support this feature. Therefore, with a single ALB or API Gateway, you can expose microservices running as containers with Amazon EKS or Amazon ECS, or serverless functions with AWS Lambda.

AWS recently proposed a solution to expose EKS orchestrated microservices via an Application Load Balancer. Their solution is based on the use of NodePort exposed by Kubernetes.

Instead, I want to propose a different solution that uses the EKS cluster VPC CNI add-on and allows the pods to automatically connect to their target group, without using any NodePort.

Also, in my use case, the Application Load Balancer is managed independently of EKS, i.e. it is not Kubernetes that has control over it. This way you can use other types of routing on the load balancer; for example, you could have an SSL certificate with more than one domain (SNI) and base the routing not only on the path but also on the domain.

Component configuration

The code shown here is partial. A complete example can be found here.

EKS cluster

In this article, the EKS cluster is a prerequisite and it is assumed that it is already installed. If you want, you can read how to install an EKS cluster with Terraform in my article on autoscaling. A complete example can be found in my repository.

VPC CNI add-on

The VPC CNI (Container Network Interface) add-on allows you to automatically assign a VPC IP address directly to a pod within the EKS cluster.

Since we want pods to self-register on their target group (which is a resource outside of Kubernetes and inside the VPC), the use of this add-on is imperative. Its installation is natively integrated on EKS, as explained here.

AWS Load Balancer Controller plugin

AWS Load Balancer Controller is a controller that helps manage an Elastic Load Balancer for a Kubernetes cluster.

It is typically used for provisioning an Application Load Balancer, as an Ingress resource, or a Network Load Balancer as a Service resource.

In our case provisioning is not required, because our Application Load Balancer is managed independently. However, we will use another type of component installed by the CRD to make the pods register to their target group.

This plugin is not included in the EKS installation, so it must be installed following the instructions from the AWS documentation.

If you use Terraform, like me, you can consider using a module:

module "load_balancer_controller" {
  source  = "DNXLabs/eks-lb-controller/aws"
  version = "0.6.0"

  cluster_identity_oidc_issuer     = module.eks_cluster.cluster_oidc_issuer_url
  cluster_identity_oidc_issuer_arn = module.eks_cluster.oidc_provider_arn
  cluster_name                     = module.eks_cluster.cluster_id

  namespace = "kube-system"
  create_namespace = false
}

Load Balancer and Security Group

With Terraform I create an Application Load Balancer in the public subnets of our VPC and its Security Group. The VPC is the same where the EKS cluster is installed.

resource "aws_lb" "alb" {
  name                       = "${local.name}-alb"
  internal                   = false
  load_balancer_type         = "application"
  subnets                    = module.vpc.public_subnets
  enable_deletion_protection = false
  security_groups            = [aws_security_group.alb.id]
}

resource "aws_security_group" "alb" {
  name        = "${local.name}-alb-sg"
  description = "Allow ALB inbound traffic"
  vpc_id      = module.vpc.vpc_id

  tags = {
    "Name" = "${local.name}-alb-sg"
  }

  ingress {
    description = "allowed IPs"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "allowed IPs"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }
}

It is important to remember to authorize this Security Group as a source in the Security Group inbound rules of the cluster nodes.

At this point, I create the target groups to which the pods will bind themselves. In this example I use two:

resource "aws_lb_target_group" "alb_tg1" {
  port        = 8080
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = module.vpc.vpc_id

  tags = {
    Name = "${local.name}-tg1"
  }

  health_check {
    path = "/"
  }
}

resource "aws_lb_target_group" "alb_tg2" {
  port        = 9090
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = module.vpc.vpc_id

  tags = {
    Name = "${local.name}-tg2"
  }

  health_check {
    path = "/"
  }
}

The last configuration on the Application Load Balancer is the listeners' definition, which contains the traffic routing rules.

The default rule on listeners, which is the response to requests that do not match any other rules, is to refuse traffic; I enter it as a security measure:

resource "aws_lb_listener" "alb_listener_http" {
  load_balancer_arn = aws_lb.alb.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type = "fixed-response"

    fixed_response {
      content_type = "text/plain"
      message_body = "Internal Server Error"
      status_code  = "500"
    }
  }
}

resource "aws_lb_listener" "alb_listener_https" {
  load_balancer_arn = aws_lb.alb.arn
  port              = "443"
  protocol          = "HTTPS"
  certificate_arn   = aws_acm_certificate.certificate.arn
  ssl_policy        = "ELBSecurityPolicy-2016-08"

  default_action {
    type = "fixed-response"

    fixed_response {
      content_type = "text/plain"
      message_body = "Internal Server Error"
      status_code  = "500"
    }
  }
}

The actual rules are then associated with the listeners. The listener on port 80 has a simple redirect to the HTTPS listener. The listener on port 443 has rules to route traffic according to the path:

resource "aws_lb_listener_rule" "alb_listener_http_rule_redirect" {
  listener_arn = aws_lb_listener.alb_listener_http.arn
  priority     = 100

  action {
    type = "redirect"
    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }

  condition {
    host_header {
      values = local.all_domains
    }
  }
}

resource "aws_lb_listener_rule" "alb_listener_rule_forwarding_path1" {
  listener_arn = aws_lb_listener.alb_listener_https.arn
  priority     = 100

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.alb_tg1.arn
  }

  condition {
    host_header {
      values = local.all_domains
    }
  }

  condition {
    path_pattern {
      values = [local.path1]
    }
  }
}

resource "aws_lb_listener_rule" "alb_listener_rule_forwarding_path2" {
  listener_arn = aws_lb_listener.alb_listener_https.arn
  priority     = 101

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.alb_tg2.arn
  }

  condition {
    host_header {
      values = local.all_domains
    }
  }

  condition {
    path_pattern {
      values = [local.path2]
    }
  }
}

Getting things work on Kubernetes

Once setup on AWS is complete, using this technique on EKS is super easy! It is sufficient to insert a TargetGroupBinding type resource for each deployment/service we want to expose on the load balancer through the target group.

Let's see an example. Let's say I have a deployment with a service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app.kubernetes.io/name: nginx
spec:
  selector:
    app.kubernetes.io/name: nginx
  ports:
    - port: 8080
      targetPort: 80
      protocol: TCP

The only configuration I need to add is this:

apiVersion: elbv2.k8s.aws/v1beta1
kind: TargetGroupBinding
metadata:
  name: nginx
spec:
  serviceRef:
    name: nginx
    port: 8080
  targetGroupARN: "arn:aws:elasticloadbalancing:eu-south-1:123456789012:targetgroup/tf-20220726090605997700000002/a6527ae0e19830d2"

From now on, each new pod that belongs to the deployment associated with that service will self-register on the indicated target group. To test it, just scale the number of replicas:

kubectl scale deployment nginx --replicas 5

and within a few seconds the new pods' IPs will be visible in the target group.