DEV Community

mikayel ghazaryan
mikayel ghazaryan

Posted on

EC2 instance deployment unification across AWS Organizations

When working at scale on AWS, we aim to standardize our approach to repetitive tasks. Simultaneously, we seek to extract generalized solutions for reuse in similar scenarios. One such common task is launching and configuring EC2 instances.
When dealing with EC2 instances, we typically aim to address several key requirements:

  • Boot them from a predictable, hardened image.
  • Ensure specific software is installed and pinned to certain versions.
  • Have various agents installed and configured for instance health monitoring, logging, antivirus, etc.

At times, this needs to be done at scale across multiple accounts for numerous machines. In this scenario, assuming we have an AWS Landing Zone with an account vending machine and multiple accounts, we need a unified process that guarantees all EC2 instances require minimum attention from us.
To address this challenge, let’s break down the steps needed to run all our machines with the same configuration (allowing for small variations):

  • Base: Utilize the same Hardened Image as a source for all instances. This ensures they start with a standardized and secure foundation. By setting up a so called Image Factory we can produce such images with all the latest required updates and patches.
  • Boot-up: The Terraform module to ensure that the infrastructure around our machines is configured in the same way. This guarantees uniformity in how instances are launched, ensuring our machines are reachable and manageable via AWS Systems Manager.
  • Configuration: Use AWS Systems Manager for post-deployment configuration. Some software cannot be preinstalled on all machines and requires configuration or installation after deployment. Systems Manager allows us to automate this process at scale.

This unified approach ensures consistency across EC2 instances, enhancing both manageability and scalability. 
In this article we will not dive deep into topics of image hardening or implementing Terraform module, but we would focus on setting up the AWS Systems Manager across AWS organization. 

Software installation

Before jumping into each step in details let’s talk about how we define which software should be installed at which stage: 

  • Software can be preinstalled on the hardened image.
  • Software can be installed from the Terraform module while deploying machine.
  • Software can be installed by centralized Systems Manager after the deployment.

In general, we aim to have required software preinstalled on the hardened image as much as possible, as it allows us to save time during the machine boot-up process.

Using a Terraform module is another effective way to install the needed software. In a Terraform module, this can be accomplished in two ways: using user data and utilizing Systems Manager. Both methods have their own drawbacks. User data does not allow us to apply changes to already deployed instances, as it is executed only during the machine’s initial boot-up. And in general, tying software installation and configuration together means that every time we need to reconfigure an already deployed machine, we have to re-deploy our instance with Terraform, which opens up the risk that Terraform might decide that the instance should be re-created. Which is highly undesirable.

A centralized Systems Manager deployment, on the other hand, can be a perfect solution for these flaws. Imagine having one centralized place to roll out all changes to all machines from a single control plane. All our scripts should be written in an idempotent manner, a requirement that applies to all methods. Additionally, we can utilize protected tags to specify which software should be installed and configured on each machine. This method essentially allows us to separate deployment from configuration, and reconfiguration will not require any redeployment, as it is the case with Terraform module updates.

With this centralized solution, we will not only be able to configure individual machines, but also use SSM Automation to re-run configuration scripts to correct any configuration drift on running machines, or deploy additional configuration changes, targeting not one, but many machines across many accounts with different targeting and deployment strategies.

Hardened golden images

This is a quite cumbersome process and I would recommend to use ready-to-go services which will provide CIS hardened image. For more information please read article about comparing golden image solutions.

Terraform Module

To standardize EC2 instance creation use a Terraform module. The Terraform module guarantees that the EC2 machine boots with additional supporting infrastructure, for example:

  • Default Security Group
  • Instance is configured to be managed by SSM Fleet Manager.
  • IAM instance profile — with default set of policies attached.
  • Additional EBS Volumes.
  • Break-glass access SSH Keys.
  • etc.

Centralized Systems Manager

Allows us to install additional agents, configure monitoring and logging, and we can handle additional setup and configuration required. 
As mentioned before we can organize some customization using instance tags to let Systems Manager know which software is optional or mandatory to install on the given machine.

Diving into the Solution

We assume you have an SSM Delegated Management account within your AWS organization, along with Workload accounts where instances will be booted.

Our solution aims to minimize the resources deployed in workload accounts while centralizing EC2 configuration automation in the SSM Delegated Management account. To achieve this, we will use the following AWS services on the central Management Account:

  • S3 Bucket: Stores Ansible playbooks and shell scripts, shared across the entire AWS Organization, or Organizational Unit(s) where EC2 machines are planned to be booted.
  • AWS EventBridge Bus: A dedicated EventBus created to receive signals from all accounts where a managed EC2 machine is booted.
  • Step Function: Orchestrates overall solution.
  • Lambda Function: In general, it can be used in many scenarios, but in our case it's used to organize secure, one-time access to the secrets needed by EC2 machines.
  • SSM Automation: SSM Documents (1 Automation and multiple Commands) are deployed and shared across the entire AWS Organization, or Organizational Unit(s) where EC2 machines are planned to be booted.

Solution Diagram

In the Workload account, we will need to deploy the following resources:

  • Custom EventBridge Bus Rule (and Role): Forwards EC2 state events to the central account.
  • SSM Execution Role: Allows cross-account SSM Automation execution.
  • Step Function Role: Grants cross-account access to the Step Function to the resources in the Workload account.

S3 Bucket

We will use a S3 bucket to store all our scripts. These resources will be available across the AWS Organization or respective OUs. These scripts must be idempotent. Ansible playbooks inherently meet this requirement, but extra care is needed when writing Shell/PowerShell scripts. These scripts will be downloaded and executed on the target machines, so Ansible must be part of the original image or preinstalled during the boot script call.

EventBus

First, we need to be informed in our central account when an EC2 machine is booted. We create a dedicated EventBus in our central account to keep our EC2 event stream separate from other events. We then deploy an EventBus rule to each Workload Account to forward EC2 events to the central bus. We can use following EventPattern:

EventPattern:
  source:
    - aws.ec2
  detail-type:
    - EC2 Instance State-change Notification
  detail:
    state:
      - running
Enter fullscreen mode Exit fullscreen mode

This pattern ensures that only events indicating an EC2 machine has been booted are forwarded.

Step Function

The dedicated EventBus in central account will trigger the Step Function. The Step Function is preparing data, orchestrates and triggers the SSM Automation.

Step Function example

Example Step Function has following steps:

  • Verifies if the EC2 instance is still running.
  • Checks if the EC2 instance has the required tag.
  • Retrieves the EC2 instance’s Instance Profile and it's IAM Role.
  • Triggers a Lambda function to copy the secret to a temporary location (with the instance ID in the path) in the Central account and generate an inline policy.
  • Make an AWS API call to attach the temporary secret access policy to the Instance Profile in the Workload account.
  • Triggers SSM Automation and waits for it to finish.
  • Deletes the temporary secret access policy from the Instance Profile (Workload account).
  • Deletes the temporary secret from the Secret Manager (Central account).

Of course we need to have in mind error handling to make sure that Policy is deleted from Instance Profile and the temporary secret is deleted in case of all possible failures. Or the step function can have fewer steps, e.g. the part with the organization of the temporary access to the secret can be removed completely if it is considered unnecessary.

SSM Automation

The most important step is running the SSM Automation. Key components are:

  • SSM documents
  • Ansible Playbooks & Shell/PowerShell Scripts: Stored in the S3 bucket in the Central account.
  • Cross-Account Execution

SSM Documents

All SSM Documents are centrally deployed in the primary account and shared across the entire AWS Organization or specific Organizational Units (OUs).

These documents are categorized into two main types: Automation Documents and Command Documents.

We use only one Automation Document to control the execution of SSM Automation. It includes flow instructions, tag checks and calls Command Documents in the correct order.

Example SSM Document of Automation type:
SSM Automation Example

Command Documents simply download shell scripts and/or Ansible playbooks from the S3 bucket in the central account and run them in the target machine in Workload accounts.

Example SSM Document of Command type:

{
    "description": "Run remote Ansible playbook.",
    "schemaVersion": "2.2",
    "parameters": {
        "Condition": {
            "type": "String",
            "description": "Pass 'true' or 'false' to disable the step",
            "default": "true"
        }
    },
    "mainSteps": [
        {
            "name": "run_playbook",
            "action": "aws:runDocument",
            "precondition": {
                "StringEquals": [
                    "{{ Condition }}",
                    "true"
                ]
            },
            "inputs": {
                "documentParameters": "{
                    \"ExtraVariables\":\"Version=977a5f0c55a79cd6aa1e614e413f75aa\",
                    \"InstallDependencies\":\"False\",
                    \"PlaybookFile\":\"touch-file.yaml\",
                    \"SourceInfo\":\"{\\\"path\\\":\\\"https://ssm-resources-<central account id>-eu-central-1.s3.eu-central-1.amazonaws.com/playbooks/touch-file.yaml\\\"}\",
                    \"SourceType\":\"S3\",
                    \"Verbose\":\"-v\"
                }",
                "documentType": "SSMDocument",
                "documentPath": "AWS-ApplyAnsiblePlaybooks"
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

SSM Automation Cross-Account Execution

We follow the solution described in the AWS documentation: Running automations in multiple AWS Regions and accounts. We need to deploy two types of roles:

  • SSM-AutomationAdministrationRole in the central account.
  • SSM-AutomationExecutionRole in each workload account.

These roles should be organized so that the SSM-AutomationExecutionRole allows the SSM-AutomationAdministrationRole to assume it across AWS accounts and has sufficient permissions to perform the required actions (e.g., reading objects from the central S3 bucket, describing the EC2 instance, attaching to the EC2 machine with an SSM Session to run Ansible playbooks and shell scripts).
We execute SSM Automation from the Step Function, configured to use these roles. It can look like this:

"StartAutomationExecution": {
    "Catch": [
        {
            "ErrorEquals": [
                "States.All"
            ],
            "Next": "DeleteRolePolicy",
            "ResultPath": "$.Input"
        }
    ],
    "InputPath": "$",
    "Next": "GetAutomationExecution",
    "Parameters": {
        "DocumentName": "linux_main_document", # SSM Automation Document name, that we created
        "DocumentVersion": "$LATEST",
        "Parameters": {
            "AutomationAssumeRole": [
                "arn:aws:iam::xxxxxxxx:role/SSM-AutomationAdministrationRole"
            ],
            "InstanceId.$": "States.Array($.InstanceId)"
        },
        "TargetLocations": [
            {
                "Accounts.$": "States.Array($.Account)",
                "ExecutionRoleName": "SSM-AutomationExecutionRole",
                "Regions.$": "States.Array($.Region)",
                "TargetLocationMaxConcurrency": "5",
                "TargetLocationMaxErrors": "1"
            }
        ]
    }
}
Enter fullscreen mode Exit fullscreen mode

In this example, we pass the Account and Region where SSM Automation should run and target a specific machine by providing its instance ID (information obtained from the EventBus event). We can organize it differently, depending on our goals, such as running SSM Automation on all machines across all target accounts or across Organizational Units or select instances by a specific tag.

Summary

The article outlines a unified approach for deploying EC2 instances across AWS Organizations, emphasizing the need for standardization and automation to improve manageability and scalability. It highlights the importance of using a consistent, hardened image as the foundation for all instances and employing Terraform modules to ensure uniform infrastructure configuration.

Post-provisioning configuration tasks are automated through AWS Systems Manager, which facilitates centralized management of software installation and configuration. The solution integrates various AWS services, including EventBridge, Step Functions, and Lambda, to streamline the deployment process and reduce manual intervention. This comprehensive strategy not only ensures consistent deployment of EC2 instances across multiple accounts but also minimizes operational overhead while allowing for necessary customizations, ultimately enhancing both efficiency and security.

Potential Challenges

  • The complexity of the solution might require a learning curve for teams not familiar with all the AWS services involved.
  • Ensuring all scripts and playbooks remain idempotent could require ongoing maintenance and testing.
  • Cross-account permissions and roles need to be carefully managed to maintain security while allowing necessary access.

Future Enhancements

  • The Step Function could be expanded to include more error handling and recovery steps. The secret management process could be modified or removed based on specific security requirements.
  • The targeting strategy for SSM Automation could be adjusted to run on groups of instances, entire OUs, or based on specific tags.
  • Overall, this solution provides a robust framework for standardizing EC2 deployments across complex AWS environments, balancing security, manageability, and flexibility at scale.

Demo

Top comments (0)