Nikolai Main

Posted on Oct 6

Optimizing AWS Infrastructure Deployment: Terraform, Sentinel, and CI/CD Best Practices

#aws #terraform #gitlab #policy

This project follows on from a previous post where I built AWS infrastructure solely in the AWS console. In it i cover the following topics:

Centralized Terraform state management
Terraform code validation with Sentinel
CI/CD Pipeline deployment
AWS Infrastructure

Less focus is placed on actual application design but may be covered in a later post.

Project Overview

In my initial project, I spent about an hour building the infrastructure and quickly realized how easy it is to make even minor mistakes that can lead to system failures. This often resulted in spending an additional 10 minutes here and there, sifting through each component to identify the error.

Recognizing this challenge in a relatively small project made me acutely aware of the potential headaches that could arise when managing larger systems.

To address this issue, I turned to Terraform. I dedicated a similar amount of time — approximately 1-2 hours — to define my infrastructure. However, the benefits were substantial: instead of spending 1-2 hours each time I needed to deploy, I can now get my entire infrastructure up and running in about 10 minutes, with a comparable teardown time.

This improvement effectively reduced my deployment time by approximately 50 minutes. Additionally, I can confidently assert that my application and infrastructure are secure, thanks to the comprehensive scans conducted prior to deployment:

Infrastructure Validation: My infrastructure is validated and checked with Sentinel in my cloud workspace. If any misconfigurations—such as poor naming and tagging conventions, overly permissive IAM policies, or insecure VPC designs—are present in my Infrastructure as Code (IaC), the run will fail, and I will be notified of the necessary changes.
Application Security Scans: For my application, I utilize GitLab's built-in suite of security tools to scan for code and dependency vulnerabilities, as well as exposed secrets. If GitLab isn’t an option, there are several other security scanning tools available, such as CodeQL, SonarQube, and Trivy. Once the application image is built, it undergoes an additional scan with Trivy to ensure its security.

Infrastructure Overview

Frontend Infrastructure (Repo 1)

ECR (Elastic Container Registry)
ECS (Elastic Container Service)
Application Load Balancer

Backend Infrastructure (Repo 2)

VPC (Virtual Private Cloud)
RDS (Relational Database Service)
API Gateway
AWS Lambda
Secrets Manager

Security Checks and Scans

Pipeline Scans

Secret Detection
SAST (Static Application Security Testing) Scanning
Dependency Scanning
SCA (Software Composition Analysis) Scanning

Sentinel Scans

Appropriate IAM Permissions
General Configuration Checks
VPC Traffic Flows

Deployment Workflow Overview

Workflow 1 (Backend Configuration)

Backend code is pushed to GitLab.
Terraform run triggered in cloud workspace.
Sentinel policies check code for misconfigurations
1. VPC: Naming conventions and private subnet config
1. Security Groups: Only allowing traffic over necessary ports
2. Lambda: IAM permissions and general config
3. RDS: Check for encryption, Public Accessibility and Default credentials.
4. Secrets Manager: Checks for secret rotation and read replicas
Upon validation infrastructure can be applied. Note relevant outputs.
1. RDS Endpoint + Secret Name are needed for Lamdba to work in this project.

Example Sentinel Policy - VPC Checks



import "tfplan/v2" as tfplan
import "tfrun" as run
import "strings"

// Define variables

messages = \[\]
resource = "VPC"

// Define main function
checks = func() {
  if run.is\_destroy == true {
    return true
  }

  // Retrieve resource info
  vpc = filter tfplan.resource\_changes as \_, rc {
    rc.mode is "managed" and
    rc.type is "aws\_vpc"
  }
  subnet = filter tfplan.resource\_changes as \_, rc {
    rc.mode is "managed" and
    rc.type is "aws\_subnet"
  }

  // Checking if resource exists.
  if length(vpc) == 0 {
    append(messages, "No vpc found.")
  }
  if length(subnet) == 0 {
    append(messages, "No subnets found.")
  }

  // Iterate over subnets
  for subnet as address, subnet {
    // Check number of available addresses
    if int(strings.split(subnet.change.after.cidr\_block, "/")\[1\]) < 24{
      append(messages, (subnet.address + " CIDR prefix too large. Must be at least 24."))
    }
    if(strings.has\_prefix(subnet.address, "aws\_subnet.private")){

      // Check subnet CIDR block
      if subnet.change.after.cidr\_block == "0.0.0.0/0"{
        append(messages, "Subnet not private. Edit CIDR block")
      }

      // Check if subnet has a public IP enabled.
      if subnet.change.after.map\_public\_ip\_on\_launch == true{
       append(messages, "Subnet not private. Public IP enabled")
      }
    }
  }

  // Run VPC checks
  for vpc as address, vpc {

    // Check if requires\_compatibilities is set and includes "FARGATE"
    requires\_name = vpc.change.after.tags else \[\]

    // Check VPC name/tags
    if length(requires\_name) == 0 or requires\_name.Name == "main-vpc"{
      append(messages, "VPC must follow proper naming conventions. Current name: " + requires\_name.Name)
    }
  }

  // Checking if any error messages have been produced
  // If messages is empty, the policy returns True and passes.
  if length(messages) != 0 {
    print(resource + " misconfigurations:")
    counter = 1
   for messages as message{
     print(string(counter) + ". " + message)
      counter += 1
    }
    return false
  }
  return true
}

// Main rule
main = rule {
   checks()
}

Workflow 2 (Frontend Configuration)

Application code is developed on local machine and pushed to Gitla
Pipeline is trigged (More details below)
1. Scan application code
1. Build image, scan and push to ECR
2. Retrieve relevant outputs from backend infrastructure
3. Create TF_vars file and push back to GitLab
2nd Terraform workspace by push to repo w/ tag
Similar plan > sentinel scan > apply process takes place.

GitLab Pipeline

Stage 1: Test - SAST, Dependency, Secrets etc..


 docker:latest
services:
- docker:dind
variables:
  DOCKER\_HOST: tcp://docker:2375/
  DOCKER\_DRIVER: overlay2
  REPO\_NAME: gitlab-cicd

// Declaring the required GitLab scans.
include:
  - template: Jobs/Dependency-Scanning.gitlab-ci.yml
  - template: Jobs/SAST.gitlab-ci.yml
  - template: Jobs/Secret-Detection.gitlab-ci.yml

// All included templates run during 'test' stage.
stages:
  - test
  - build-image
  - fetch-terraform-outputs
  - update-terraform

Stage 2: Build, Scan, Push



  stage: build-image
  before\_script:
  - apk add --no-cache aws-cli
  - apk add --no-cache curl
  script:

  // Building Docker image
  - echo "Building Docker image..."
  - docker build -t $REPO\_NAME:latest .

  // Scanning Docker image with Trivy
  - echo "Running Trivy scan on Docker image"
  - curl -sSL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh
    | sh -
  - export PATH=$PATH:$(pwd)/bin
  - trivy image --exit-code 0 --severity HIGH,CRITICAL $REPO\_NAME:latest || true
  - trivy image --format json --output trivy-results.json $REPO\_NAME:latest

  # Retrieving ECR repo credentials
  - echo "Logging in to Amazon ECR..."
  - aws ecr get-login-password --region $AWS\_DEFAULT\_REGION | docker login --username
    AWS --password-stdin $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com

  # Pushing Docker image to ECR
  - echo "Pushing Docker image to ECR..."
  - TIMESTAMP=$(date +%Y%m%d%H%M%S)
  - IMAGE\_TAG="$REPO\_NAME:$TIMESTAMP"
  - docker tag $REPO\_NAME:latest $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG
  - docker push $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG
  - echo "TF\_VAR\_image\_uri=$AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG"
    >> build.env
  artifacts:
    paths:
    - build.env

Stage 3: Fetch TF outputs



  stage: fetch-terraform-outputs
  image: alpine:latest
  script:
  - apk add --no-cache curl jq
  - echo "Creating variables for specific outputs..."

  // Retrieving outputs via Terraform Cloud API
  - "curl -s -X GET \\\\\\n  \\"https://app.terraform.io/api/v2/workspaces/${HCP\_WORKSPACE\_ID}/current-state-version-outputs\\"
    \\\\\\n -H \\"Authorization: Bearer ${HCP\_TOKEN}\\" \\\\\\n  -H
    'Content-Type: application/vnd.api+json' | \\\\\\njq -r '.data\[\] | select(.attributes.name

    // Saving outputs as environment variables to be passed to the next stage.
    | test(\\"public\_subnet\_ids|alb-sg-id|container-sg-id|vpc\_id\\")) | \\n  if .attributes.name
    == \\"public\_subnet\_ids\\" then\\n    \\"PUBLIC\_SUBNET\_IDS=\\\\(.attributes.value)\\"\\n
    \\ elif .attributes.name == \\"alb-sg-id\\" then\\n    \\"ALB\_SG\_ID=\\\\(.attributes.value)\\"\\n
    \\ elif .attributes.name == \\"container-sg-id\\" then\\n    \\"CONTAINER\_SG\_ID=\\\\(.attributes.value)\\"\\n
    \\ elif .attributes.name == \\"vpc\_id\\" then\\n    \\"VPC\_ID=\\\\(.attributes.value)\\"\\n
    \\ else\\n    empty\\n  end' > terraform\_outputs.env\\n"
  artifacts:
    reports:
      dotenv: terraform\_outputs.env

Stage 4: Update main.tf



 stage: update-terraform

  image: alpine:latest

  dependencies:

  - build

  - fetch-terraform-outputs

  before_script:

  - apk add --no-cache git

  - git config --global user.email "${USER_EMAIL}"

  - git config --global user.name "${USER_NAME}"

  script:

  - echo "Contents of current directory:"

  - ls -la

  - echo "Contents of build.env:"

  - cat build.env || echo "build.env not found"

  - echo "Contents of terraform_outputs.env:"

  - cat terraform_outputs.env || echo "terraform_outputs.env not found"

  - export $(cat build.env | xargs)

  - export $(cat terraform_outputs.env | xargs)

  - echo "Cloning repository..."

  - git clone https://<username>:${GITLAB_PAT}@gitlab.com/<project_id>/<repo.git> || exit

    1

  - cd Test

 // Create TF_vars file

  - echo "Creating/Updating TF_vars file..."

  - |

    cat << EOF > terraform.tfvars

    image_uri = "${TF_VAR_image_uri}"

    public_subnet_ids = ${PUBLIC_SUBNET_IDS}

    alb_sg_id = "${ALB_SG_ID}"

    container_sg_id = "${CONTAINER_SG_ID}"

    vpc_id = "${VPC_ID}"

    EOF

 // Commit and push TF_vars to repo

  - git add terraform.tfvars

  - git commit -m "Update image URI and Terraform outputs in TF_vars [ci skip]" ||

    echo "No changes to commit"

  - TAG_NAME="$(date +%Y.%m.%d-%H%M%S)"

  - echo "Creating a new tag $TAG_NAME"

 // Creating a tag to trigger TF cloud only from pushes from this pipeline.

  // 'ci skip' tells the repo not to run the pipeline again on this push. 

  - git tag -a $TAG_NAME -m "Release version $TAG_NAME [ci skip]"

  - git push origin HEAD:main --tags || exit 1

Final Notes

In conclusion, I now have an end-to-end deployment solution that ensures my application is both secure and robust. This streamlined process has significantly reduced my mean time to deployment, allowing me to reallocate time and resources to other areas.

By identifying potential issues much earlier in the deployment process, I can mitigate risks that previously led to delays and unnecessary costs. This proactive approach not only enhances the overall efficiency of our development cycle but also improves the quality of our releases.

Looking ahead, I plan to further enhance this pipeline by deploying the application to a testing environment before production. This additional step will allow us to conduct Dynamic Application Security Testing (DAST) scans, ensuring that any vulnerabilities are addressed prior to going live. This commitment to security and quality will help us maintain a high standard for our applications while fostering confidence in our deployment processes.

Yuh.

DEV Community