Aisalkyn Aidarova

Posted on May 30

project #4 Complete DevOps Engineering Project

End-to-End Application Deployment Using GitHub, Docker, Terraform, AWS, ECS, CI/CD, Monitoring, and Security

When you used GitHub Pages (github.io), you did not deploy containers, Kubernetes, or ECS.

The workflow was much simpler:

Student
   |
   V
GitHub Repository
   |
   V
GitHub Pages
   |
   V
Public Website

What happened:

Student created website files

index.html
style.css
app.js

Student pushed code to GitHub.

git add .
git commit -m "website"
git push

GitHub Pages detected the files.
GitHub Pages copied the static files to GitHub's web hosting infrastructure.
GitHub served those files directly to visitors.

When someone visits:

https://username.github.io

GitHub simply returns:

index.html
style.css
app.js

to the browser.

Is Docker involved?

Possibly internally inside GitHub's infrastructure, but your students do not use Docker.

Students never create:

docker build
docker run

Is Kubernetes involved?

Possibly internally inside GitHub's infrastructure, but students do not use Kubernetes.

Students never create:

kubectl apply
kubectl get pods

What is GitHub Pages good for?

Static websites:

HTML
CSS
JavaScript
Images

Examples:

Portfolio
School website
Documentation
Landing page

What GitHub Pages cannot do

It cannot run:

NodeJS
Python
Java
.NET
Databases
APIs

For example:

Frontend
Backend API
PostgreSQL

cannot run on GitHub Pages.

If you want to learn real DevOps

The next step after GitHub Pages is:

GitHub
   |
GitHub Actions/Jenkins
   |
Docker Build
   |
ECR
   |
ECS Fargate
   |
ALB
   |
Route53

Then you will see:

Where containers come from
Why Docker is needed
Why ECS/Kubernetes is needed
How load balancers work
How production deployments happen

actual lab starts here:

DevOps Lab 1: Containerize Your GitHub Website with Docker

Goal

a few days ago, each of you created a website using Jules, merged the code into GitHub, and viewed it using GitHub Pages.

Today, we will take the same website source code and put it inside a Docker container.

This is the first step from a simple website to real production deployment.

Part 1: What We Are Building

Current workflow

Developer writes code
        |
        V
GitHub Repository
        |
        V
GitHub Pages
        |
        V
Website in browser

GitHub Pages is good for simple static websites.

But in real companies, applications are usually deployed like this:

Developer writes code
        |
        V
GitHub Repository
        |
        V
Docker Image
        |
        V
Container
        |
        V
Cloud Platform
        |
        V
Production Website

Today we are doing this part:

GitHub Repository
        |
        V
Docker Image
        |
        V
Docker Container
        |
        V
Website in browser

Part 2: Why DevOps Engineers Use Docker

A DevOps engineer does not only write code or test code. A DevOps engineer helps move application code from developer laptop to production safely and repeatedly.

Docker helps us package the application.

Without Docker, we may have this problem:

It works on my laptop, but it does not work on the server.

With Docker, we package:

Application code
Web server
Runtime environment
Configuration

into one Docker image.

Then the same image can run on:

Student laptop
EC2
ECS
Kubernetes
Production server

This is why enterprise companies use Docker.

Part 3: What Goes Inside Docker?

For this lab, most students have a simple static website.

Example repository:

my-website/
├── index.html
├── style.css
├── script.js
├── images/
└── README.md

Inside Docker, we need the files required to run the website:

index.html
style.css
script.js
images/

We do not need to copy unnecessary files like:

.git/
README.md
notes.txt

But for beginner practice, copying the full project is acceptable.

Part 4: Important Note About Jules Agent

Some students created the website with Jules.

Jules helped generate the code, but Jules does not need to go inside the Docker container.

Docker only needs the final website files.

If the repository has files like:

index.html
style.css
script.js

then we containerize those files.

If the repository has files like:

package.json
src/
app/
next.config.js

then it may be a React or Next.js application, and the Dockerfile will be different.

For today, we will start with the simple static website version.

Part 5: Prerequisites

Each student must have:

Git installed
Docker Desktop installed and running
GitHub repository with their website
VS Code installed
Terminal access

Check Docker:

docker --version

Check Git:

git --version

Part 6: Clone Your GitHub Repository

Go to GitHub.

Open your website repository.

Click:

Code → HTTPS → Copy URL

Example:

https://github.com/username/my-website.git

Now open terminal.

Run:

cd Desktop

Clone your repository:

git clone https://github.com/username/my-website.git

Go inside the project folder:

cd my-website

Check files:

ls

You should see something like:

index.html
style.css
script.js

Part 7: Open Project in VS Code

Run:

code .

If code . does not work, open VS Code manually and open the project folder.

Part 8: Create Dockerfile

Inside the root of your project, create a new file named:

Dockerfile

Important:

The file name must be exactly:

Dockerfile

Not:

dockerfile
Dockerfile.txt
docker-file

Add this content:

FROM nginx:alpine

COPY . /usr/share/nginx/html

EXPOSE 80

Explanation:

FROM nginx:alpine

This means we are using Nginx as the web server.

Nginx will serve our website files to the browser.

COPY . /usr/share/nginx/html

This copies our website files from the current folder into the Nginx web folder inside the container.

EXPOSE 80

This documents that the container listens on port 80.

Port 80 is the default HTTP web port.

Docker’s COPY instruction is used to copy files into an image, and this is the key step that places the website source files into the container image.

Part 9: Create .dockerignore

Create another file:

.dockerignore

Add this:

.git
.github
README.md
node_modules
.DS_Store

Why do we need .dockerignore?

It prevents unnecessary files from going into the Docker image.

In enterprise companies, this is important because Docker images should be:

Small
Clean
Secure
Fast to build
Easy to scan

Do not put secrets inside Docker images.

Never put these inside Docker:

AWS keys
Passwords
.env files
Private tokens
SSH keys

Part 10: Build Docker Image

In terminal, make sure you are inside your project folder.

Run:

pwd

Then build the Docker image:

docker build -t my-website:v1 .

Explanation:

docker build

Builds a Docker image.

-t my-website:v1

Gives the image a name and tag.

Means Docker should use the current folder as the build context.

Check image:

docker images

You should see:

my-website    v1

Part 11: Run Docker Container

Run:

docker run -d -p 8080:80 --name my-website-container my-website:v1

Explanation:

-d

Run container in the background.

-p 8080:80

Maps your laptop port 8080 to container port 80.

--name my-website-container

Gives the container a readable name.

my-website:v1

The image we created.

Part 12: Open Website in Browser

Open:

http://localhost:8080

You should see your website.

This means:

Your GitHub source code
        |
        V
Docker image
        |
        V
Docker container
        |
        V
Browser

Part 13: Check Running Containers

Run:

docker ps

You should see your running container.

Example:

CONTAINER ID   IMAGE           PORTS
abc123         my-website:v1   0.0.0.0:8080->80/tcp

Part 14: Stop Container

Run:

docker stop my-website-container

Check again:

docker ps

The container should not appear.

Part 15: Start Container Again

Run:

docker start my-website-container

Open again:

http://localhost:8080

Part 16: Remove Container

Stop it first:

docker stop my-website-container

Remove it:

docker rm my-website-container

Part 17: Remove Image

If you want to remove the image:

docker rmi my-website:v1

Only do this after you finish the lab.

Part 18: Common Errors and Fixes

Error 1: Docker is not running

Error:

Cannot connect to the Docker daemon

Fix:

Open Docker Desktop and wait until it says Docker is running.

Error 2: Port already in use

Error:

port is already allocated

Fix:

Use another port:

docker run -d -p 8081:80 --name my-website-container my-website:v1

Open:

http://localhost:8081

Error 3: Container name already exists

Error:

container name is already in use

Fix:

Remove old container:

docker rm my-website-container

If it is running:

docker stop my-website-container
docker rm my-website-container

Error 4: Website does not show correctly

Check if index.html is in the root folder.

Correct:

my-website/
├── index.html
├── style.css
└── script.js

Possible problem:

my-website/
└── website/
    ├── index.html
    ├── style.css
    └── script.js

If your files are inside a subfolder called website, change Dockerfile:

FROM nginx:alpine

COPY website/ /usr/share/nginx/html

EXPOSE 80

Part 19: Push Dockerfile to GitHub

Now save the Dockerfile and .dockerignore in your repository.

Run:

git status

You should see:

Dockerfile
.dockerignore

Add files:

git add Dockerfile .dockerignore

Commit:

git commit -m "Add Dockerfile for website containerization"

Push:

git push

Now your GitHub repository contains:

Website source code
Dockerfile
.dockerignore

This means another DevOps engineer can clone your repo and build the same image.

Part 20: Enterprise Explanation

In a real company, developers do not manually copy files to servers.

They follow a workflow:

Developer writes code
        |
        V
Pull Request
        |
        V
Code Review
        |
        V
Merge to main
        |
        V
CI/CD Pipeline
        |
        V
Docker Image Build
        |
        V
Image Scan
        |
        V
Push to Registry
        |
        V
Deploy to ECS or Kubernetes
        |
        V
Monitor Logs and Metrics

A DevOps engineer is responsible for making this process:

Automated
Repeatable
Secure
Observable
Reliable
Scalable

Part 21: What Students Must Pay Attention To

1. Repository structure

Make sure the application files are organized.

Bad:

my-website/
├── final-final-index.html
├── copy-style.css
├── old-script.js

Good:

my-website/
├── index.html
├── style.css
├── script.js
├── images/
├── Dockerfile
└── .dockerignore

2. Do not expose secrets

Never commit:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
OpenAI API key
Database password
Private key

Secrets must be stored in:

AWS Secrets Manager
GitHub Secrets
SSM Parameter Store
Kubernetes Secrets

3. Use tags properly

Bad:

docker build -t my-website .

Better:

docker build -t my-website:v1 .

Best in enterprise:

docker build -t my-website:git-commit-id .

Why?

Because in production, we need to know exactly which version is deployed.

4. Keep images small

Use:

FROM nginx:alpine

instead of:

FROM nginx:latest

alpine images are usually smaller.

Smaller images are:

Faster to build
Faster to push
Faster to pull
Easier to scan

5. Test locally before pushing

Before sending the image to cloud, always run locally:

docker build -t my-website:v1 .
docker run -d -p 8080:80 my-website:v1

If it does not work locally, it will not magically work in AWS.

6. Understand the port

Inside container:

On laptop:

This command:

docker run -p 8080:80 my-website:v1

means:

Laptop port 8080 → Container port 80

7. Understand image vs container

Docker image:

Template / package / blueprint

Docker container:

Running instance of the image

Example:

Image = class
Container = object

or:

Image = cake recipe
Container = actual cake

Part 22: Student Deliverables

Each student must submit:

GitHub repository link
Screenshot of Dockerfile
Screenshot of successful docker build
Screenshot of docker ps
Screenshot of website running at:

http://localhost:8080

Short explanation:

What is Docker?
What is Docker image?
What is Docker container?
Why do DevOps engineers use Docker?

Part 23: Final Architecture After This Lab

Student GitHub Repo
        |
        V
Dockerfile
        |
        V
Docker Image
        |
        V
Docker Container
        |
        V
Website running locally

This is the foundation for the next lab.

Part 24: Next Lab Preview

Next lab will be:

Docker Image
        |
        V
Amazon ECR
        |
        V
Amazon ECS Fargate
        |
        V
Application Load Balancer
        |
        V
Public Production Website

Amazon ECR is used to store Docker images in AWS, and before pushing an image, Docker must authenticate to the target ECR registry. AWS notes that ECR authentication tokens are temporary and valid for 12 hours.

In ECS, a task definition works like the blueprint for the application. It tells ECS which Docker image to use, how much CPU and memory to allocate, and how the container should run.

An Application Load Balancer can be used with ECS services to distribute traffic across running tasks, which is important in production when more than one container is running.

DevOps Lab 2: Deploy Dockerized Website to AWS ECS Fargate

Goal

In Lab 1, you:

GitHub Repository
        |
        V
Docker Image
        |
        V
Docker Container
        |
        V
Website running on localhost

In this lab, we move the application to AWS.

At the end of this lab:

GitHub Repository
        |
        V
Docker Image
        |
        V
Amazon ECR
        |
        V
Amazon ECS Fargate
        |
        V
Application Load Balancer
        |
        V
Public Website

You will access your website using a public AWS URL.

Learning Objectives

Students will learn:

What ECR is
What ECS is
What Fargate is
What a Task Definition is
What a Service is
What a Load Balancer is
How production applications are deployed

Enterprise Perspective

Many students think:

Docker = Production

Wrong.

Docker only creates the package.

Production requires:

Docker Image
+
Image Registry
+
Container Orchestration
+
Networking
+
Load Balancer
+
Monitoring

This is where ECS enters.

Step 1: Architecture Overview

Today we build:

Browser
   |
   V
Application Load Balancer
   |
   V
ECS Service
   |
   V
ECS Task
   |
   V
Docker Container
   |
   V
Website

Step 2: Create AWS ECR Repository

Search:

Elastic Container Registry

Open:

Amazon ECR

Click:

Create Repository

Repository name:

student-website

Visibility:

Private

Click:

Create Repository

Why ECR Exists

Without ECR:

Laptop
   |
   V
Container

AWS cannot access your laptop.

We need a central image registry.

Laptop
   |
   V
ECR
   |
   V
ECS

Think of ECR as GitHub for Docker images.

Step 3: Authenticate Docker to AWS

Open CloudShell or Terminal.

Run:

aws configure

Enter:

Access Key
Secret Key
Region

aws ecr get-login-password \
--region us-east-1 \
| docker login \
--username AWS \
--password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com

Expected:

Login Succeeded

Step 4: Tag Docker Image

Check image:

docker images

Example:

student-website:v1

Tag image:

docker tag student-website:v1 \
ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/student-website:v1

Why Tag?

Locally:

student-website:v1

AWS requires:

ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/student-website:v1

Now AWS knows where the image belongs.

Step 5: Push Image to ECR

Run:

docker push \
ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/student-website:v1

Wait for upload.

Refresh ECR.

You should see:

student-website:v1

Step 6: Create ECS Cluster

Search:

Elastic Container Service

Open:

Amazon ECS

Click:

Create Cluster

Cluster name:

student-cluster

Infrastructure:

AWS Fargate

Click:

Create

Why ECS?

Imagine:

100 containers

Questions:

Which server runs them?
Which one failed?
How many copies?
Who restarts failed containers?

ECS manages all of this.

Step 7: Why Fargate?

Without Fargate:

You manage EC2 servers.

You must:

Patch Linux
Upgrade OS
Replace failed servers
Manage capacity

With Fargate:

AWS manages servers.

You only deploy containers.

Step 8: Create Task Definition

Open:

Task Definitions

Click:

Create New Task Definition

Launch Type:

Fargate

Task name:

student-task

CPU:

0.5 vCPU

Memory:

1 GB

Container Section

Container Name:

website

Image URI:

ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/student-website:v1

Container Port:

Click:

Create

Why Task Definitions Matter

Task Definition is:

Docker Image
CPU
Memory
Port
Environment Variables

combined together.

Think:

Task Definition = Blueprint

Step 9: Create Service

Open cluster.

Click:

Create Service

Choose:

student-task

Launch Type:

Fargate

Desired Tasks:

What Is a Service?

Task:

One running container

Service:

Keeps task alive forever

If task crashes:

ECS starts another one

Automatically.

Step 10: Networking

Choose:

Default VPC

Subnets:

Select all public subnets

Assign Public IP:

Enabled

Security Group

Create:

student-web-sg

Inbound:

HTTP 80
Anywhere

Why Security Groups Matter

Security Groups are AWS firewalls.

Bad:

All ports open

Good:

Only required ports

Enterprise rule:

Least Privilege

Step 11: Create Application Load Balancer

Name:

student-alb

Scheme:

Internet-facing

Listener:

HTTP 80

Target Type:

IP

Why ALB Exists

Without ALB:

User
   |
Container

One container failure = outage.

With ALB:

User
   |
ALB
   |
Multiple Containers

Traffic distributes automatically.

Step 12: Connect Service to ALB

Target Group:

student-target-group

Container:

website

Container Port:

Create Service.

Wait:

2-5 minutes

Step 13: Verify Deployment

Open:

ECS Cluster

Check:

Tasks = Running

Status:

Healthy

Step 14: Test Website

Open:

Load Balancer DNS Name

Example:

http://student-alb-123456.us-east-1.elb.amazonaws.com

Website should load.

Congratulations.

Your website is now running in AWS.

What Just Happened?

You moved from:

Laptop

to:

AWS Production Environment

Architecture:

GitHub
   |
Docker Build
   |
ECR
   |
ECS Task Definition
   |
ECS Service
   |
ALB
   |
Users

DevOps Engineer Responsibilities

When deploying production applications:

Verify Image

Make sure:

Correct version
Correct tag
No vulnerabilities

Verify Networking

Check:

Security Groups
Subnets
Ports

Verify Health Checks

Make sure:

Container starts
Container stays healthy

Verify Logs

Check:

CloudWatch Logs

Verify Scaling

Can application survive:

10 users?
100 users?
1000 users?

Common Production Issues

Wrong Port

Container:

ALB:

Result:

Application unreachable

Wrong Image

Task Definition:

v1

Expected:

v2

Result:

Old application deployed

Security Group Blocked

No inbound:

Result:

Website inaccessible

Health Check Failure

ALB health check:

Container returns:

Result:

Target unhealthy

Student Deliverables

Each student submits:

ECR Repository screenshot
Successful Docker Push screenshot
ECS Cluster screenshot
ECS Service screenshot
Running Task screenshot
ALB screenshot
Website URL
Screenshot of website running from ALB DNS

Final Architecture

Developer
    |
GitHub
    |
Docker Build
    |
Amazon ECR
    |
Amazon ECS Fargate
    |
Application Load Balancer
    |
Internet
    |
Users

This is the same deployment model used by thousands of enterprise applications today.

Lab 3 Preview

In Lab 3 we will automate everything.

Instead of manually:

Build
Push
Deploy

students will create:

GitHub Push
      |
      V
GitHub Actions
      |
      V
Docker Build
      |
      V
ECR Push
      |
      V
ECS Deployment

This is where you will begin learning real CI/CD and how enterprise DevOps teams deploy applications automatically.

This is where students finally start feeling like DevOps engineers.

Lab 1: Website → Docker

Lab 2: Docker → ECS

Lab 3: GitHub Push → Automatic Deployment

The goal is:

Developer changes code
       |
       V
Git Push
       |
       V
GitHub Actions
       |
       V
Docker Build
       |
       V
ECR
       |
       V
ECS Update
       |
       V
Production Updated

Before Lab 3:

Student manually:
docker build
docker push
update ECS

After Lab 3:

git push

Everything else happens automatically

LAB 3

CI/CD Pipeline with GitHub Actions

Business Problem

Imagine 50 developers.

Every day:

Developer A pushes code
Developer B pushes code
Developer C pushes code
Developer D pushes code

Should DevOps manually run:

docker build
docker push
update ecs

50 times per day?

No.

This is why CI/CD exists.

What is CI?

Continuous Integration

Whenever code changes:

Compile
Test
Validate

automatically.

What is CD?

Continuous Delivery / Deployment

Whenever code passes tests:

Build
Deploy

automatically.

Final Architecture

Developer
    |
    V
GitHub
    |
    V
GitHub Actions
    |
    +----------------+
    | Build Image    |
    | Security Scan  |
    | Push To ECR    |
    +----------------+
    |
    V
Amazon ECR
    |
    V
Amazon ECS
    |
    V
Application Load Balancer
    |
    V
Users

Step 1

Create IAM User For GitHub

Search:

IAM

Create user:

github-actions-user

Permissions:

AmazonEC2ContainerRegistryFullAccess

AmazonECS_FullAccess

For bootcamp this is okay.

Later we will reduce permissions.

Why?

GitHub must authenticate to AWS.

GitHub needs:

Access Key
Secret Key

to:

Push images
Update ECS

Step 2

Create Access Keys

IAM User

Create:

Access Key
Secret Key

Save them.

Students will use them in GitHub Secrets.

Step 3

Configure GitHub Secrets

Open Repository.

Settings

Secrets and Variables

Actions

New Repository Secret

Create:

AWS_ACCESS_KEY_ID

Create:

AWS_SECRET_ACCESS_KEY

Create:

AWS_REGION

Example:

us-east-1

Create:

AWS_ACCOUNT_ID

Example:

123456789012

Why Secrets?

Never do this:

AWS_SECRET_ACCESS_KEY: abc123

inside source code.

If somebody steals repository:

AWS account compromised

Secrets protect credentials.

Step 4

Create GitHub Actions Folder

Inside repository:

.github/
└── workflows/

Create:

deploy.yml

Why?

GitHub automatically reads:

.github/workflows

and executes workflows.

Step 5

Create Workflow

deploy.yml

name: Deploy Website

on:
  push:
    branches:
      - main

jobs:
  deploy:

    runs-on: ubuntu-latest

    steps:

      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4

        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}

          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

          aws-region: ${{ secrets.AWS_REGION }}

      - name: Login ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build Image

        run: |

          docker build -t website .

          docker tag website:latest \
          ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/student-website:latest

      - name: Push Image

        run: |

          docker push \
          ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/student-website:latest

What Does This Pipeline Do?

When student runs:

git push

GitHub:

Downloads code
Builds Docker image
Logs into AWS
Pushes image to ECR

automatically.

Step 6

Commit Workflow

git add .

git commit -m "Add CI/CD"

git push

Step 7

Watch Pipeline

GitHub

Actions

Students will see:

Workflow Running

Then:

Success

What Just Happened?

Nobody manually executed:

docker build
docker push

GitHub did it.

This is CI/CD.

Enterprise Discussion

Students should understand:

A DevOps engineer is NOT paid to click buttons.

A DevOps engineer is paid to automate.

Bad DevOps:

Build manually
Deploy manually

Good DevOps:

Push code

Everything automatic

Step 8

Deploy New Version

Modify:

<h1>Version 2</h1>

Commit:

git add .

git commit -m "version2"

git push

Pipeline runs.

New image created.

Enterprise Problem

ECR now contains:

latest

But ECS still runs:

old container

Why?

ECS only starts new containers when told.

Step 9

Force New Deployment

Pipeline:

- name: Deploy ECS

  run: |

    aws ecs update-service \
      --cluster student-cluster \
      --service student-service \
      --force-new-deployment

Now pipeline:

Build Image
Push Image
Restart ECS Service

What Happens?

ECS:

Old Task
      ↓
New Task
      ↓
Pull Latest Image
      ↓
Run New Version

Production Deployment Flow

Student changes:

index.html

Pushes:

git push

Automatically:

GitHub Actions
      |
Build Docker
      |
Push ECR
      |
Restart ECS
      |
Pull Latest Image
      |
Deploy

No AWS Console.

No manual work.

What DevOps Engineers Monitor

Students must learn:

Pipeline Status

Success
Failed

Build Logs

Docker build errors

AWS Authentication

Credential failures

ECS Deployment

New task healthy?

Website

Application accessible?

Common Failures

Wrong Secret

Invalid AWS credentials

Pipeline fails.

Dockerfile Broken

docker build failed

Pipeline fails.

ECR Permission Missing

access denied

Pipeline fails.

ECS Service Name Wrong

service not found

Deployment fails.

Student Deliverables

Each student submits:

GitHub Actions workflow file
Successful workflow screenshot
ECR image screenshot
ECS deployment screenshot
Production URL
Screenshot showing Version 2 deployed automatically

End Result

Students now understand:

Developer
     |
GitHub
     |
GitHub Actions
     |
Docker
     |
ECR
     |
ECS
     |
Load Balancer
     |
Users

This is the first complete production-grade CI/CD pipeline.

Lab 4 should be Terraform, where students stop creating ECR, ECS, ALB, Security Groups, and networking manually and instead create the entire AWS infrastructure from code. That is usually the point where students start thinking like infrastructure engineers rather than application deployers.

Lab 4 is where students stop being "people who deploy applications" and start becoming Infrastructure Engineers / DevOps Engineers.

Up to now:

Lab 1

Website -> Docker

Lab 2

Docker -> ECS

Lab 3

GitHub -> CI/CD -> ECS

But there is a huge problem.

Imagine the company says:

We need 50 environments.

Dev
QA
UAT
Stage
Production

Will DevOps engineers manually click:

Create VPC
Create ALB
Create ECS
Create Security Groups
Create Target Groups
Create ECR

50 times?

No.

This is why Terraform exists.

LAB 4

Infrastructure as Code (Terraform)

Goal

Current situation:

Developer
    |
GitHub
    |
GitHub Actions
    |
ECR
    |
ECS
    |
ALB

But everything was created manually.

Goal:

Terraform
    |
    +-- VPC
    +-- Subnets
    +-- Security Groups
    +-- ALB
    +-- ECS
    +-- ECR

Everything created from code.

Enterprise Problem

Imagine:

Production breaks.

You must rebuild everything.

Without Terraform:

Nobody remembers:
- SG rules
- ECS settings
- ALB config
- Subnets

Disaster.

With Terraform:

terraform apply

Everything recreated.

Architecture

Students will build:

Terraform
    |
    V
VPC
    |
Subnets
    |
ALB
    |
ECS
    |
ECR

Step 1

Create Repository

Create new repo:

terraform-infrastructure

Structure:

terraform/
|
├── provider.tf
├── variables.tf
├── main.tf
├── outputs.tf
├── terraform.tfvars

Why Separate Repo?

Application repo:

Website Code

Infrastructure repo:

AWS Resources

Enterprise companies usually separate them.

Step 2

Install Terraform

Verify:

terraform version

Expected:

Terraform v1.x.x

Step 3

Create Provider

provider.tf

provider "aws" {
 region = "us-east-1"
}

Why Provider?

Terraform supports:

AWS
Azure
GCP
GitHub
Kubernetes

Provider tells Terraform:

Talk to AWS

Step 4

Configure Credentials

Never hardcode:

access_key="xxxx"
secret_key="xxxx"

Use:

aws configure

Verify:

aws sts get-caller-identity

Step 5

Create ECR

main.tf

resource "aws_ecr_repository" "website" {

  name = "student-website"
}

Why?

Before:

Student manually created ECR.

Now:

Terraform creates ECR.

Step 6

Create VPC

resource "aws_vpc" "main" {

  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "student-vpc"
  }
}

Enterprise Explanation

Everything lives inside a VPC.

Think:

AWS Datacenter
        |
        V
VPC
        |
        V
Your Private Network

Step 7

Create Public Subnets

resource "aws_subnet" "public1" {

 vpc_id = aws_vpc.main.id

 cidr_block = "10.0.1.0/24"

 availability_zone = "us-east-1a"
}

resource "aws_subnet" "public2" {

 vpc_id = aws_vpc.main.id

 cidr_block = "10.0.2.0/24"

 availability_zone = "us-east-1b"
}

Why Two Subnets?

Enterprise applications require:

High Availability

If one AZ dies:

Other AZ survives

Step 8

Create Security Group

resource "aws_security_group" "web" {

 name = "web-sg"

 vpc_id = aws_vpc.main.id

 ingress {

   from_port = 80
   to_port = 80

   protocol = "tcp"

   cidr_blocks = ["0.0.0.0/0"]
 }

 egress {

   from_port = 0
   to_port = 0

   protocol = "-1"

   cidr_blocks = ["0.0.0.0/0"]
 }
}

Why Security Groups?

Security Groups:

AWS Firewall

Protect resources.

Step 9

Create Application Load Balancer

resource "aws_lb" "website" {

 name = "student-alb"

 internal = false

 load_balancer_type = "application"

 security_groups = [
   aws_security_group.web.id
 ]

 subnets = [
   aws_subnet.public1.id,
   aws_subnet.public2.id
 ]
}

Why ALB?

ALB distributes traffic:

Users
  |
ALB
  |
Containers

Step 10

Create ECS Cluster

resource "aws_ecs_cluster" "main" {

 name = "student-cluster"
}

Why ECS Cluster?

Think:

ECS Cluster

as a parking lot.

Containers park inside.

Step 11

Terraform Init

Run:

terraform init

What Happens?

Terraform downloads:

AWS Provider
Plugins
Dependencies

Step 12

Terraform Validate

terraform validate

Expected:

Success

Step 13

Terraform Plan

terraform plan

Example:

+ create VPC
+ create ALB
+ create ECS
+ create ECR

Why Plan?

Plan allows engineers to review:

What will change?

Before touching production.

Step 14

Terraform Apply

terraform apply

Type:

yes

Terraform creates:

VPC
Subnets
Security Group
ALB
ECS
ECR

Enterprise Rule

Never do:

terraform apply -auto-approve

against production.

Always review.

Step 15

Verify

AWS Console:

Check:

VPC
Subnets
Security Groups
ALB
ECS
ECR

Everything should exist.

Step 16

Destroy Environment

This is the coolest part.

Run:

terraform destroy

Confirm:

yes

Terraform removes:

ALB
ECS
ECR
Subnets
VPC

Why Destroy?

Cloud costs money.

DevOps engineers often create:

Temporary environments

for:

Developers
QA
Testing
Training

Destroying saves money.

Important DevOps Concepts

Desired State

Terraform says:

I want:
1 VPC
2 Subnets
1 ECS Cluster
1 ALB

Terraform compares:

Desired State
vs
Current State

and fixes differences.

State File

Terraform creates:

terraform.tfstate

Think:

Database of infrastructure

Never delete it.

Enterprise companies store state in:

Amazon S3

and lock it using:

Amazon DynamoDB

What Students Must Understand

Without Terraform:

Click
Click
Click
Click
Click

No documentation.

No repeatability.

No automation.

With Terraform:

Code
Commit
Review
Apply

Infrastructure becomes:

Version Controlled
Auditable
Repeatable
Recoverable

Student Deliverables

Each student submits:

GitHub Repository with Terraform code
Screenshot of terraform init
Screenshot of terraform plan
Screenshot of terraform apply
Screenshot showing:

VPC
Subnets
Security Group
ECS Cluster
ALB
ECR
1. Screenshot of terraform destroy

Lab 5 Preview

Lab 5 is where everything becomes enterprise-grade.

Students will create:

Terraform
      |
GitHub
      |
Pull Request
      |
GitHub Actions
      |
Terraform Plan
      |
Manager Approval
      |
Terraform Apply
      |
AWS Infrastructure

At that point they will have:

Infrastructure as Code
+
CI/CD
+
Containerization
+
Cloud Deployment

which is very close to how modern DevOps teams operate in production.

Lab 5 is where students stop being "people who know tools" and start understanding how enterprise DevOps teams actually work.

Up to now they learned:

Lab 1  Website -> Docker
Lab 2  Docker -> ECS
Lab 3  GitHub Actions -> CI/CD
Lab 4  Terraform -> Infrastructure as Code

But there is still a huge problem.

Everything is being done by one person.

Real companies do not work like that.

LAB 5

Enterprise DevOps Workflow

Goal

Students will simulate a real company.

Instead of:

Developer
    |
Production

they will build:

Developer
    |
Pull Request
    |
Code Review
    |
Terraform Plan
    |
Approval
    |
Terraform Apply
    |
Production

Business Problem

Imagine:

A developer accidentally changes:

resource "aws_db_instance"

and deletes production database.

Who should stop this?

Terraform?

No.

GitHub?

No.

DevOps Process.

This is why enterprise companies use:

Pull Requests
Approvals
Code Reviews
Change Control

Architecture

Students will build:

Developer
    |
Feature Branch
    |
Pull Request
    |
GitHub Actions
    |
Terraform Plan
    |
Approval
    |
Merge
    |
Terraform Apply
    |
AWS

Learning Objectives

Students learn:

Git workflow
Pull Requests
Branching strategy
Terraform Plan
Terraform Apply
Approvals
Production deployment
Change management

Step 1

Create Branch

Never work directly on:

main

Create:

git checkout -b feature/new-alb

Verify:

git branch

Output:

main
* feature/new-alb

Enterprise Rule

Bad:

Developer changes production directly

Good:

Developer
   |
Feature Branch

Step 2

Modify Terraform

Example:

Add new security group.

Example:

resource "aws_security_group" "web" {
 ...
}

Commit:

git add .

git commit -m "Add web security group"

Push:

git push origin feature/new-alb

Step 3

Create Pull Request

GitHub:

Compare & Pull Request

Create PR.

Title:

Add Security Group For Web Layer

Why Pull Requests?

Pull Requests allow:

Review
Discussion
Approval
Audit Trail

Enterprise Example

Developer writes:

cidr_blocks = ["0.0.0.0/0"]

Reviewer asks:

Why open to the internet?

Bug found before production.

Step 4

Terraform Plan in GitHub Actions

Create:

.github/workflows/terraform-plan.yml

Example:

name: Terraform Plan

on:
 pull_request:

jobs:

 plan:

   runs-on: ubuntu-latest

   steps:

   - uses: actions/checkout@v4

   - uses: hashicorp/setup-terraform@v3

   - run: terraform init

   - run: terraform plan

What Happens?

Developer opens PR.

Automatically:

Terraform Init
Terraform Validate
Terraform Plan

runs.

Why?

Before merge:

Everyone sees:

What will Terraform change?

Example Output

+ Create Security Group

+ Create ALB

~ Modify ECS Service

Management can review.

Step 5

Add Validation

Add:

- run: terraform fmt -check

- run: terraform validate

Why?

Checks:

Syntax
Formatting
Errors

before deployment.

Enterprise Rule

Bad:

Merge first
Fix later

Good:

Validate first
Merge later

Step 6

Branch Protection

GitHub

Settings

Branches

Add Rule

Protect:

main

Require:

Pull Request
Review
Successful Checks

Why?

Nobody can directly push:

git push origin main

Enterprise Example

Without protection:

Developer deletes VPC
Pushes code
Production outage

With protection:

Review required

Mistake caught.

Step 7

Reviewer Approval

Student A creates PR.

Student B reviews.

Checklist:

Terraform valid?
Naming correct?
Resources required?
Security issues?

Approve.

Why?

Four eyes principle.

At least two people see changes.

Step 8

Merge PR

After approval:

Merge Pull Request

Now code reaches:

main

Step 9

Production Pipeline

Create:

terraform-apply.yml

Trigger:

on:

 push:

   branches:

     - main

Pipeline:

terraform init

terraform validate

terraform apply

What Happens?

Developer merges.

Automatically:

Terraform Apply

runs.

Infrastructure updates.

Enterprise Workflow

Feature Branch
      |
Pull Request
      |
Terraform Plan
      |
Approval
      |
Merge
      |
Terraform Apply
      |
AWS Updated

Step 10

Remote State

Current:

terraform.tfstate

stored locally.

Dangerous.

Laptop lost:

State lost

Create S3 Backend

terraform {

 backend "s3" {

   bucket = "student-terraform-state"

   key = "prod/terraform.tfstate"

   region = "us-east-1"
 }
}

Why?

State shared by team.

Step 11

State Locking

Create DynamoDB table.

Terraform:

dynamodb_table = "terraform-locks"

Why?

Prevent:

Engineer A
Engineer B

running:

terraform apply

at same time.

Enterprise Problem

Without locking:

Corrupted state
Broken infrastructure

Step 12

Environment Strategy

Current:

One Environment

Enterprise:

Dev
QA
Stage
Production

Example

terraform/dev

terraform/qa

terraform/prod

terraform workspace

Why?

Never test directly in production.

Step 13

Approval Gates

Production pipeline:

Terraform Plan
      |
Manager Approval
      |
Terraform Apply

GitHub Environments:

Production

Require:

Manual Approval

before apply.

Enterprise Example

Black Friday.

Someone changes:

Load Balancer

Management wants review.

Approval gate prevents accidents.

DevOps Engineer Responsibilities

Students must understand:

A DevOps engineer is not paid for:

Creating EC2
Creating ECS
Running Terraform

They are paid for:

Preventing outages
Building automation
Reducing risk
Creating repeatable deployments

What Students Must Pay Attention To

Security

Never:

0.0.0.0/0 everywhere

Cost

Always review:

terraform plan

before apply.

Naming

Use:

dev-web-alb

qa-web-alb

prod-web-alb

not:

alb1
alb2
alb3

State

Never manually edit:

terraform.tfstate

Student Deliverables

Each student submits:

Git

Feature branch screenshot
Pull Request screenshot

GitHub Actions

Terraform Plan successful

Review

PR approved by another student

Terraform

Successful Apply

AWS

Screenshots showing:

VPC
Subnets
ALB
Security Groups
ECS
ECR

Architecture Diagram

Student must draw:

Developer
    |
Feature Branch
    |
Pull Request
    |
GitHub Actions
    |
Terraform Plan
    |
Approval
    |
Terraform Apply
    |
AWS

Lab 6 Preview

Lab 6 is where the application becomes a true enterprise application.

Students will split their website into:

Frontend
Backend API
Database

and deploy:

React Frontend
     |
NodeJS API
     |
PostgreSQL RDS

using:

Terraform
+
GitHub Actions
+
ECS Fargate
+
ALB
+
Route53

This is usually the first lab where students see a complete production architecture similar to what many companies run today.

This is the lab where students finally understand why microservices exist.

Up to Lab 5 they deployed a simple website.

The problem is:

Everything is one application.

In enterprise companies, applications are usually separated.

For example:

Netflix

Frontend
Backend APIs
Authentication
Payments
Recommendations
Database
Monitoring

all separate services.

LAB 6

Transform Website into a Microservices Application

Business Problem

Current architecture:

Browser
   |
Website

Problem:

If one piece fails:

Entire application fails

Problem:

One developer changes website.

Another changes API.

They interfere with each other.

Problem:

Cannot scale independently.

Goal

Convert student website into:

Browser
   |
Frontend
   |
Backend API
   |
Database

What Students Will Build

Architecture:

Internet
    |
Application Load Balancer
    |
Frontend Container
    |
Backend Container
    |
RDS PostgreSQL

Real Enterprise Example

Think about Amazon.

Frontend:

amazon.com

Backend:

Products API
Orders API
Users API

Database:

PostgreSQL
Aurora
DynamoDB

Separate systems.

Step 1

Create New Repositories

Current:

student-website

Split into:

frontend

backend

terraform-infra

Why?

Different teams may own:

Frontend Team

Backend Team

Infrastructure Team

Step 2

Frontend Service

Students keep:

index.html

style.css

app.js

Convert into:

frontend/
|
├── index.html
├── style.css
├── app.js
├── Dockerfile

Container:

FROM nginx:alpine

COPY . /usr/share/nginx/html

Purpose

Frontend should only:

Display information
Collect user input
Call APIs

Step 3

Create Backend API

Folder:

backend/
|
├── app.js
├── package.json
├── Dockerfile

Simple API:

const express = require("express");

const app = express();

app.get("/students",(req,res)=>{

 res.json([
  {
   name:"John"
  }
 ]);

});

app.listen(5000);

Why Backend?

Frontend should not contain business logic.

Bad:

HTML
|
Database

Good:

Frontend
|
Backend API
|
Database

Step 4

Backend Dockerfile

FROM node:20-alpine

WORKDIR /app

COPY . .

RUN npm install

EXPOSE 5000

CMD ["node","app.js"]

What Students Learn

One application now has:

Frontend Image

Backend Image

instead of:

One Giant Image

Step 5

Build Images

Frontend:

docker build -t frontend:v1 .

Backend:

docker build -t backend:v1 .

Verify:

docker images

Step 6

Run Locally

Backend:

docker run -d -p 5000:5000 backend:v1

Frontend:

docker run -d -p 8080:80 frontend:v1

Verify API

Open:

http://localhost:5000/students

Should return:

[
 {
  "name":"John"
 }
]

Step 7

Frontend Calls API

JavaScript:

fetch("http://localhost:5000/students")

Now:

Frontend
     |
Backend

communicate.

Enterprise Concept

This is called:

Service Communication

Every modern application does this.

Step 8

Create RDS Database

Terraform:

resource "aws_db_instance" "postgres" {

 engine = "postgres"

 instance_class = "db.t3.micro"

 allocated_storage = 20

 username = "postgres"

 password = "ChangeMe123!"
}

Why Database?

Current API:

Hardcoded Data

Bad.

Need:

Persistent Storage

Step 9

Store Data in Database

Backend:

GET /students

POST /students

DELETE /students

Data stored in:

PostgreSQL

Enterprise Concept

Application should never store data:

Inside Container

Containers die.

Data must survive.

Step 10

ECS Architecture

Current:

One ECS Service

New:

Frontend ECS Service

Backend ECS Service

Architecture

ALB
|
|--- Frontend Service
|
|--- Backend Service

Step 11

ALB Routing

Frontend:

Backend:

/api/*

Example:

goes to:

Frontend

Example:

/ api / students

goes to:

Backend

Why?

Users see:

https://website.com

But ALB routes traffic internally.

Step 12

Push Images to ECR

Students now have:

frontend:v1

backend:v1

Push both.

Repositories:

frontend

backend

inside ECR.

Step 13

GitHub Actions

Frontend pipeline:

Build Frontend

Push ECR

Deploy ECS

Backend pipeline:

Build Backend

Push ECR

Deploy ECS

Separate pipelines.

Enterprise Concept

Frontend deployment should not break backend deployment.

Teams deploy independently.

Step 14

Environment Variables

Never hardcode:

password="123"

Use:

DATABASE_HOST

DATABASE_USER

DATABASE_PASSWORD

from ECS environment variables.

Enterprise Security

Even better:

AWS Secrets Manager

Step 15

Monitoring

Add:

CloudWatch Logs

For:

Frontend Logs

Backend Logs

What DevOps Engineers Check

Frontend:

Is website loading?

Backend:

Are APIs responding?

Database:

Is RDS healthy?

Enterprise Scaling Example

Current traffic:

100 users

Need:

5000 users

Scale:

Backend x10

Keep:

Frontend x2

Microservices allow independent scaling.

Common Problems

Frontend Cannot Reach API

Bad:

localhost

inside ECS.

Must use:

Internal ALB

Service Discovery

DNS

Database Security Group

Backend cannot connect.

Need:

5432 allowed

from backend SG.

Hardcoded Secrets

Bad:

GitHub repository

Good:

Secrets Manager

Student Deliverables

Frontend

GitHub Repository
Dockerfile
Running ECS Service

Backend

GitHub Repository
Dockerfile
Running ECS Service

Database

RDS Screenshot

Architecture Diagram

Browser
    |
ALB
    |
Frontend ECS
    |
Backend ECS
    |
PostgreSQL RDS

GitHub Actions

Frontend pipeline
Backend pipeline

What Students Learn

At the end of Lab 6 they understand:

Frontend
Backend
Database
Containers
ECS
ALB
Terraform
GitHub Actions
CI/CD
RDS

This is the first architecture that starts looking like a real enterprise application.

Lab 7 Preview

Lab 7 will introduce:

Prometheus
Grafana
CloudWatch
Loki
Alerting

Students will learn:

How to know something is broken
How to collect metrics
How to troubleshoot production issues
How DevOps engineers monitor applications

Because a production application is not finished when it is deployed. A production application is finished only when it can be monitored, troubleshot, and recovered when something goes wrong.

LAB 7 – Monitoring, Logging, Alerting & Troubleshooting

This is one of the most important DevOps labs.

Many junior engineers think:

Deploy = Job Done

In reality:

Deploy = Beginning of Operations

The first question management asks after deployment is:

"How do we know if the application is healthy?"

If your answer is:

I open browser and check.

You are not operating at enterprise level.

Business Scenario

Your architecture now looks like:

Internet
    |
ALB
    |
Frontend ECS
    |
Backend ECS
    |
RDS PostgreSQL

Imagine:

Website suddenly slow

Management asks:

Why?

Can students answer?

No.

Because they have:

No Metrics
No Logs
No Alerts

This lab fixes that.

Goal

Students will build:

Frontend ECS
        |
Backend ECS
        |
CloudWatch Logs
        |
Prometheus
        |
Grafana
        |
Alerting

Learning Objectives

Students will understand:

What monitoring is
What logging is
What metrics are
What alerts are
Difference between CloudWatch and Prometheus
Difference between logs and metrics
Root cause analysis

Enterprise Architecture

Users
   |
ALB
   |
Frontend ECS
   |
Backend ECS
   |
RDS
   |
--------------------
Monitoring Stack
--------------------
CloudWatch
Prometheus
Grafana
Alerting

Part 1 Understanding Logs

Example:

Backend error:

Database Connection Failed

How do we know?

Logs.

Example log:

2026-08-15 10:15:22 ERROR Database Connection Failed

Logs tell:

What happened
When
Where

What DevOps Engineers Use Logs For

Examples:

Application crash
Database failure
Memory issue
Security event

Without logs:

Guessing

With logs:

Evidence

Part 2 CloudWatch Logs

Open ECS Service.

Enable:

CloudWatch Logging

Container definition:

logConfiguration

Example:

{
 "logDriver":"awslogs"
}

Why CloudWatch?

AWS automatically stores:

Container Logs
Application Logs
System Logs

Verify Logs

Open:

CloudWatch

Navigate:

Log Groups

Students should see:

/frontend

/backend

Exercise

Break API intentionally.

Example:

throw new Error("Database Failed");

Deploy.

Find error inside CloudWatch.

Lesson

Production engineers spend enormous amounts of time reading logs.

Part 3 Metrics

Logs tell:

What happened

Metrics tell:

How healthy system is

Examples:

CPU
Memory
Requests
Latency
Errors

Example

Website loads slowly.

Metrics show:

CPU = 98%

Problem found.

Part 4 Install Prometheus

Create ECS Service:

Prometheus

Container image:

prom/prometheus

Port:

What Prometheus Does

Prometheus collects:

CPU
Memory
Requests
Response Time
Errors

every few seconds.

Think:

CloudWatch = AWS monitoring

Prometheus = Application monitoring

Architecture

Backend
   |
Prometheus

Prometheus continuously collects metrics.

Part 5 Expose Application Metrics

Backend application:

Install:

npm install prom-client

Add endpoint:

/app.get("/metrics")

Example:

http://backend:5000/metrics

What Happens?

Prometheus visits:

/metrics

every few seconds.

Collects:

Request Count
Errors
Response Time

Enterprise Concept

This is called:

Instrumentation

Applications must expose metrics.

Part 6 Install Grafana

Create ECS Service:

grafana/grafana

Port:

Open:

http://grafana:3000

Default:

admin
admin

What Grafana Does

Prometheus stores data.

Grafana visualizes data.

Example:

CPU Usage Chart
Memory Usage Chart
Request Count
Error Count

Architecture

Application
     |
Prometheus
     |
Grafana

Exercise

Create dashboard:

CPU
Memory
Requests
Errors

What DevOps Engineers Watch Daily

Infrastructure

CPU
Memory
Disk
Network

Application

Response Time
Errors
Requests
Availability

Database

Connections
Latency
Storage

Part 7 Alerting

Management does not want:

Engineer watching dashboard 24 hours

Need:

Automatic alerts

Example Alert

If:

CPU > 80%
for 5 minutes

Send:

Email
Slack
Teams
PagerDuty

Create Alert

Grafana Alert:

CPU > 80%

Trigger:

Email Notification

Enterprise Example

2 AM:

Application crashes

Alert fires.

Engineer wakes up.

Problem fixed.

Part 8 CloudWatch Alarms

Create:

ECS CPU Alarm

Condition:

CPU > 75%

Action:

SNS Notification

Why CloudWatch Alarms?

AWS infrastructure monitoring.

Examples:

ECS
ALB
RDS
Lambda

Part 9 Load Testing

Install:

ApacheBench

k6

Generate:

100 requests
1000 requests

Observe:

CPU
Memory
Response Time

inside Grafana.

What Students Learn

Applications behave differently under load.

Part 10 Troubleshooting Exercise

Instructor intentionally breaks:

Scenario 1

Backend stopped.

Students must identify:

CloudWatch Logs
Grafana
Prometheus

Scenario 2

Database unavailable.

Students identify:

Connection failures

inside logs.

Scenario 3

High CPU.

Students identify:

Metric spike

inside Grafana.

Root Cause Analysis

DevOps engineers do not say:

System down.

They explain:

Backend container restarted.

Database connections exhausted.

CPU reached 95%.

Response time increased.

Service unavailable.

Student Deliverables

Monitoring

Screenshot:

Prometheus Targets

Grafana

Dashboard showing:

CPU
Memory
Requests
Errors

CloudWatch

Log Group screenshot.

Alerting

Alert rule screenshot.

Troubleshooting

Student documents:

Problem
Root Cause
Fix

What Students Have Learned So Far

After Lab 7:

GitHub
Docker
ECR
ECS
ALB
Terraform
GitHub Actions
Frontend
Backend
RDS
CloudWatch
Prometheus
Grafana
Alerting

At this point they understand how a real production application is built and monitored.

LAB 8 Preview

Lab 8 is usually where I introduce:

Security
Secrets Manager
IAM
WAF
SSL/TLS
Vulnerability Scanning
Trivy
Least Privilege

because the next question management asks is:

"The application works. Is it secure?"

That is where students begin learning DevSecOps.

LAB 8 – DevSecOps: Securing the Production Environment

This lab changes the mindset of students.

Up to now they learned:

GitHub
Docker
ECR
ECS
Terraform
RDS
GitHub Actions
Prometheus
Grafana
CloudWatch

The application works.

Management asks:

"What happens if we get hacked tomorrow?"

Most junior engineers focus on deployment.

Senior DevOps engineers focus on:

Availability
Reliability
Security
Compliance

Business Scenario

Current architecture:

Internet
   |
ALB
   |
Frontend ECS
   |
Backend ECS
   |
RDS

Potential problems:

Hardcoded passwords
Open Security Groups
Exposed API Keys
Vulnerable Docker Images
No SSL
No WAF
Overprivileged IAM Roles

Goal

Secure the entire application.

Final architecture:

Internet
   |
AWS WAF
   |
HTTPS (SSL)
   |
ALB
   |
Frontend ECS
   |
Backend ECS
   |
Secrets Manager
   |
RDS

Learning Objectives

Students will learn:

IAM Least Privilege
Secrets Manager
SSL/TLS
AWS WAF
Container Security
Trivy Scanning
Image Hardening
Security Groups
Compliance Concepts
Security in CI/CD

Part 1 Principle of Least Privilege

Most beginners create:

AdministratorAccess

for everything.

Bad practice.

Example

Developer needs:

Push Docker Images

Permission:

ECR Access

Only.

Not:

AdministratorAccess

Exercise

Current:

GitHub Actions User

Permissions:

AdministratorAccess

Students replace with:

AmazonEC2ContainerRegistryFullAccess

AmazonECS_FullAccess

Or even more restrictive custom policies.

Enterprise Rule

Always ask:

What is minimum access required?

Part 2 Secrets Manager

Current backend:

DB_PASSWORD="mypassword"

inside source code.

Very bad.

Problem

Repository leaked.

Now attacker knows:

Database Password
API Keys
Tokens

Solution

Store secrets in:

AWS Secrets Manager

Create Secret

Store:

DB_USERNAME
DB_PASSWORD
OPENAI_API_KEY

Backend Reads Secret

Instead of:

password="mypassword"

Use:

process.env.DB_PASSWORD

Enterprise Concept

Never hardcode:

Passwords
Tokens
Keys
Secrets

inside:

GitHub
Docker Images
Terraform Files

Part 3 Security Groups Review

Current:

0.0.0.0/0

everywhere.

Exercise

Review:

ALB

Allow:

80
443

from internet.

ECS

Allow:

80
5000

only from ALB SG.

RDS

Allow:

only from Backend SG.

Enterprise Concept

Never expose database directly.

Bad:

Internet
     |
PostgreSQL

Good:

Internet
     |
ALB
     |
Backend
     |
Database

Part 4 Enable HTTPS

Current:

http://

Not encrypted.

Problem

Attacker can capture:

Passwords
Session IDs
Personal Data

Create Certificate

Use:

AWS Certificate Manager

Request certificate:

studentdomain.com

Attach To ALB

ALB Listener:

443 HTTPS

Result

Traffic becomes:

Encrypted

Enterprise Concept

Never expose login pages over HTTP.

Part 5 AWS WAF

What if someone sends:

1 million requests

or SQL Injection attempts?

Add WAF

Create:

AWS WAF

Attach to ALB.

Enable managed rules:

SQL Injection

Cross Site Scripting

Known Bad Inputs

Bot Protection

Enterprise Concept

WAF protects before requests reach application.

Part 6 Container Vulnerability Scanning

Current image:

FROM node:20

May contain vulnerabilities.

Install Trivy

Students install:

brew install trivy

sudo apt install trivy

Scan Image

trivy image backend:v1

Output:

CRITICAL
HIGH
MEDIUM
LOW

Exercise

Students identify:

High Vulnerabilities

and document findings.

Enterprise Concept

Every production image should be scanned before deployment.

Part 7 Security in GitHub Actions

Current:

Build
Push
Deploy

Add Security Stage

Pipeline:

Checkout
     |
Trivy Scan
     |
Build
     |
Push
     |
Deploy

Rule

If vulnerabilities found:

Pipeline Fails

Why?

Stop insecure software before production.

Part 8 Docker Image Hardening

Bad:

FROM ubuntu

Huge image.

Better:

FROM node:20-alpine

Benefits

Smaller
Faster
Less Attack Surface

Enterprise Rule

Use smallest possible image.

Part 9 IAM Roles for ECS

Current:

Access Keys

inside containers.

Bad.

Solution

Create:

Task Role

Attach permissions.

Example:

Read Secrets Manager

Benefits

No access keys stored.

Enterprise Concept

Containers should use IAM Roles, not credentials.

Part 10 Logging Security Events

CloudWatch Logs.

Log:

Failed Login

Unauthorized Access

API Errors

Suspicious Requests

Exercise

Students create:

Security Log Group

Part 11 Compliance Discussion

Introduce:

SOC2
HIPAA
PCI DSS
ISO 27001

Example

Healthcare Application:

Need:

Encryption
Audit Logs
Access Controls

Banking Application

Need:

Least Privilege
Monitoring
Incident Response

Part 12 Security Incident Simulation

Scenario:

GitHub repository leaked

Students answer:

What secrets exposed?

What rotates?

What logs reviewed?

What systems affected?

Scenario 2

Database exposed.

Students answer:

How discovered?

How isolated?

How fixed?

Scenario 3

Critical Trivy finding.

Students answer:

Can deployment continue?

Who approves exception?

What DevOps Engineers Must Check Daily

IAM

Unused Users
Unused Roles
Excessive Permissions

Containers

Vulnerabilities
Outdated Images

Infrastructure

Open Ports
Public Resources

Secrets

Rotation
Expiration
Access

Student Deliverables

Security

Screenshots:

Secrets Manager

IAM Role

Security Groups

HTTPS Listener

WAF Rules

Container Security

Trivy Scan Report

CI/CD

Pipeline with:

Security Scan Stage

Documentation

Student writes:

Top 5 security risks

How risks were mitigated

What Students Have Learned After Lab 8

GitHub
Docker
ECR
ECS
Terraform
GitHub Actions
Frontend
Backend
RDS
CloudWatch
Prometheus
Grafana
Alerting
IAM
Secrets Manager
WAF
SSL/TLS
Trivy
DevSecOps

At this point students can build, deploy, monitor, and secure a production application.

LAB 9 Preview

Lab 9 is usually the capstone project:

High Availability
Auto Scaling
Blue/Green Deployment
Disaster Recovery
Multi-AZ
Backup Strategy
Route53
Production Readiness Review

This is where students learn how large companies keep applications available even when servers, containers, databases, or entire availability zones fail.

LAB 9 – Production Readiness, High Availability, Auto Scaling & Disaster Recovery

This is the capstone project.

Up to now students built:

Frontend
Backend
RDS
Terraform
GitHub Actions
Docker
ECS
Monitoring
Security

The application works.

The application is secure.

The application is monitored.

Management now asks:

"What happens if an Availability Zone fails?"

"What happens if traffic increases 100x?"

"What happens if a deployment breaks production?"

"What happens if somebody deletes the database?"

This is where real DevOps engineering begins.

Goal

Students will transform:

Good Application

into

Production-Ready Application

Final Architecture

Users
   |
Route53
   |
ALB
   |
-------------------
AZ-A      AZ-B
-------------------
Frontend  Frontend
Backend   Backend
-------------------
       |
RDS Multi-AZ
       |
Backups
       |
Monitoring

Learning Objectives

Students will learn:

High Availability
Multi-AZ Design
Auto Scaling
Blue/Green Deployment
Disaster Recovery
Backup Strategy
Route53
Production Readiness Reviews
SLA / SLO / Error Budgets

Part 1 High Availability

Current:

ALB
 |
Frontend
 |
Backend

Problem:

One container dies

Application unavailable.

Solution

Run multiple tasks.

Frontend:

Desired Tasks = 2

Backend:

Desired Tasks = 2

Architecture:

ALB
 |
 |------ Frontend 1
 |
 |------ Frontend 2

 |
 |------ Backend 1
 |
 |------ Backend 2

Enterprise Concept

Never deploy:

1 container

for production.

Minimum:

2 containers

across multiple AZs.

Exercise

Students stop:

Frontend Task 1

Verify:

Website still works

Part 2 Multi-AZ Deployment

Current:

One Availability Zone

Problem:

AWS AZ outage.

Entire application unavailable.

Solution

Deploy ECS into:

us-east-1a

us-east-1b

ALB:

AZ-A

AZ-B

Frontend:

AZ-A

AZ-B

Backend:

AZ-A

AZ-B

Enterprise Rule

Production workloads should span:

Multiple Availability Zones

Exercise

Students diagram:

AZ-A Failure

and explain:

Why application survives

Part 3 Auto Scaling

Business Problem

Traffic:

100 users

becomes:

10,000 users

One backend container cannot handle load.

ECS Auto Scaling

Create policy:

CPU > 70%

Scale:

2 Tasks -> 4 Tasks

Example

Current:

Backend x2

Traffic spike:

Backend x4

Automatically.

Enterprise Concept

Scaling should happen:

Automatically

not:

Engineer manually clicking buttons

Exercise

Generate traffic using:

k6

ApacheBench

Observe:

Scaling Event

inside ECS.

Part 4 Route53

Current:

alb-123.us-east-1.elb.amazonaws.com

Not professional.

Create Domain

Example:

studentproject.com

Route53:

A Record

pointing to:

ALB

Result

Users visit:

https://studentproject.com

instead of:

https://alb-123.us-east-1.elb.amazonaws.com

Enterprise Concept

Customers never see AWS resource names.

Part 5 Blue/Green Deployment

Current Deployment

Version 1

Replace with:

Version 2

Risk:

Deployment breaks

Production down.

Blue Environment

Current Production

Green Environment

New Version

Architecture:

ALB
 |
Blue
 |
Version 1

ALB
 |
Green
 |
Version 2

Test Green

Verify:

Frontend
Backend
Database

working.

Switch Traffic

100%

moves to Green.

Rollback

If broken:

Traffic back to Blue

within seconds.

Enterprise Concept

Many large companies deploy this way.

Part 6 Database Backups

Question:

Database deleted

Now what?

RDS Automated Backups

Enable:

7 Day Retention

30 Day Retention

Create Snapshot

Manual Snapshot:

Pre-Release Backup

before deployment.

Exercise

Student documents:

Restore Procedure

Enterprise Rule

Every deployment should have:

Rollback Plan

Part 7 Disaster Recovery

Scenario:

Entire Region Fails

Example:

us-east-1 unavailable

Discussion

Recovery Options

Backup Restore

Hours

Pilot Light

Minimal Environment

running elsewhere.

Warm Standby

Reduced Environment

already running.

Multi-Region Active

Full Environment

in two regions.

Enterprise Concept

Recovery costs money.

Management chooses:

Cost
vs
Recovery Time

Part 8 SLA / SLO / Error Budget

Students learn:

SLA

Customer contract.

Example:

99.9%

availability.

SLO

Internal goal.

Example:

99.95%

availability.

Error Budget

Allowed downtime.

Example:

43 minutes/month

for 99.9%.

Exercise

Calculate:

99.9%
99.95%
99.99%

allowed downtime.

Part 9 Production Readiness Review

Before deployment students answer:

Architecture

Multi-AZ?

Monitoring

Prometheus?
Grafana?
Alerts?

Security

IAM?
Secrets Manager?
HTTPS?
WAF?

Backups

Snapshots?
Retention?

Scaling

Auto Scaling?

Disaster Recovery

Recovery Plan?

Part 10 Failure Simulation Day

Instructor intentionally breaks:

Scenario 1

Stop:

Backend Task

Students verify:

Application survives

Scenario 2

Deploy broken version.

Students:

Rollback

using Blue/Green.

Scenario 3

Database issue.

Students:

Restore Snapshot

Scenario 4

Traffic spike.

Students verify:

Auto Scaling

triggered.

What DevOps Engineers Must Think About

Junior Engineer:

Can I deploy?

Senior Engineer:

Can I recover?

Student Deliverables

Architecture Diagram

Route53
   |
ALB
   |
Frontend x2
   |
Backend x2
   |
RDS Multi-AZ

Auto Scaling

Screenshot:

Scaling Policy

Route53

Domain working.

Blue/Green

Deployment demo.

Backup

Snapshot screenshot.

Disaster Recovery

Written recovery plan.

Production Readiness Report

Students submit:

Architecture
Security
Monitoring
Scaling
Backups
Recovery
Risks

Final Result After Labs 1–9

Students have built:

GitHub
     |
GitHub Actions
     |
Terraform
     |
Docker
     |
ECR
     |
ECS Fargate
     |
ALB
     |
Route53
     |
Frontend
     |
Backend
     |
RDS
     |
CloudWatch
Prometheus
Grafana
     |
Secrets Manager
IAM
WAF
TLS
     |
Auto Scaling
Blue/Green
Backups
Disaster Recovery

This sequence gives students something most bootcamps miss:

They don't just learn Docker, Terraform, AWS, and GitHub separately—they see exactly how a developer's code becomes a production application and how DevOps engineers keep it running, secure, scalable, and recoverable.