DEV Community: Oluwademilade Oyekanmi

Build Your First Serverless App on AWS — Step by Step

Oluwademilade Oyekanmi — Wed, 03 Dec 2025 08:03:17 +0000

Who this is for: You've written some code before and you're curious about AWS, but you've never touched serverless. By the end of this guide you'll have a real, deployed API with zero servers to manage.

What is "serverless" really?

"Serverless" doesn't mean there are no servers. It means you don't have to think about them.

Traditionally, if you wanted your code to run on the internet, you had to rent a server, an EC2 instance on AWS, for example, keep it running 24/7, patch it, monitor it, and pay for it whether anyone was using your app or not. That's a lot of overhead.

With serverless, you just upload your code. AWS handles provisioning, scaling, and maintenance. Your code runs only when it's triggered, when someone makes an API call, uploads a file, or inserts a record. When there's nothing to do, nothing runs, and nothing charges you.

Traditional server	Serverless
Always-on, always billed	✅ Pay only when code runs
You manage OS patches	✅ AWS manages all of that
Manual scaling	✅ Auto-scales to thousands of requests
Complex setup	✅ Deploy in minutes
—	⚠️ Cold starts (small latency on first run)

💡 The AWS Free Tier is generous. Lambda gives you 1 million free requests per month and 400,000 GB-seconds of compute, forever, not just for 12 months. The project we're building today will cost you essentially nothing.

Our project: a URL shortener

We'll build a minimal URL shortener. You send it a long URL, it gives you back a short code. You visit that code, it redirects you to the original URL.

Simple enough to build in one sitting, but real enough to teach you how the pieces actually connect.

Services we'll use:

🔗 API Gateway — receives HTTP requests from the internet
⚡ AWS Lambda — runs your business logic
🗄 DynamoDB — stores the short code → URL mapping

The architecture

Here's how the three services talk to each other. Read top to bottom, that's the path a request takes.

User's browser or curl
    ↓  POST /shorten  { url: "https://example.com/very/long/path" }

[ API Gateway ]
  → validates the request, forwards to Lambda
  → you never expose Lambda directly to the internet

    ↓ triggers

[ Lambda function — Node.js 20 ]
  → generates a short code (e.g. "abc123")
  → writes { code → url } to DynamoDB
  → returns the short URL to the user

    ↓ reads / writes

[ DynamoDB table — "urls" ]
  partition key: shortCode (String)
  attribute:     originalUrl (String)
  TTL:           expiresAt (auto-deletes old records = saves money)

💡 Why API Gateway in front of Lambda? It lets you add rate limiting, authentication, and HTTPS without touching your function code. Lambda stays focused on business logic, the gateway handles the door.

Step-by-step build

Step 1 — Create the DynamoDB table

Go to the AWS Console → DynamoDB → Create table.

Table name: urls
Partition key: shortCode (String)
Capacity mode: On-demand (you pay per read/write, not per hour. Perfect for low or unpredictable traffic)

Leave everything else as default and hit Create.

Step 2 — Create an IAM role for Lambda

Go to IAM → Roles → Create role.

Trusted entity: Lambda
Attach managed policy: AWSLambdaBasicExecutionRole (gives Lambda permission to write logs to CloudWatch)
Add an inline policy for DynamoDB (see below, don't skip this step)

The IAM inline policy (least privilege):

Only give Lambda permission to do exactly what it needs on exactly your table, nothing else.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "dynamodb:PutItem",
      "dynamodb:GetItem"
    ],
    "Resource": "arn:aws:dynamodb:us-east-1:YOUR_ACCOUNT_ID:table/urls"
  }]
}

⚠️ Replace YOUR_ACCOUNT_ID with your actual 12-digit AWS account ID, and us-east-1 with whichever region you created your table in. These must match exactly.

Step 3 — Create the Lambda function

Go to Lambda → Create function.

Runtime: Node.js 20.x
Execution role: Use the IAM role you created in Step 2
Memory: 128 MB
Timeout: 5 seconds

Paste the following code into the inline editor:

import { DynamoDBClient, PutItemCommand, GetItemCommand } from "@aws-sdk/client-dynamodb";

// Reuse the client across invocations — keeps cold starts short
const db = new DynamoDBClient({});
const TABLE = "urls";

export const handler = async (event) => {
  const { routeKey, body, pathParameters } = event;

  // POST /shorten — create a new short URL
  if (routeKey === "POST /shorten") {
    const { url } = JSON.parse(body);
    const code = Math.random().toString(36).slice(2, 8);

    // TTL: auto-delete this record after 90 days (cost saving)
    const ttl = Math.floor(Date.now() / 1000) + 90 * 86400;

    await db.send(new PutItemCommand({
      TableName: TABLE,
      Item: {
        shortCode:   { S: code },
        originalUrl: { S: url },
        expiresAt:   { N: String(ttl) }
      }
    }));

    return { statusCode: 201, body: JSON.stringify({ short: `/${code}` }) };
  }

  // GET /{code} — resolve a short URL
  if (routeKey === "GET /{code}") {
    const result = await db.send(new GetItemCommand({
      TableName: TABLE,
      Key: { shortCode: { S: pathParameters.code } }
    }));

    if (!result.Item) return { statusCode: 404, body: "Not found" };

    return {
      statusCode: 301,
      headers: { "Location": result.Item.originalUrl.S }
    };
  }
};

Step 4 — Create the API Gateway endpoint

Go to API Gateway → Create API → HTTP API (simpler and ~70% cheaper than REST API).

Add two routes:
- POST /shorten
- GET /{code}
Integration: Point both routes to your Lambda function
Deploy to a stage called prod

Copy the URL it gives you. That's your live API endpoint.

Step 5 — Test it

# Shorten a URL
curl -X POST https://YOUR_API_ID.execute-api.us-east-1.amazonaws.com/shorten \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/a/very/long/path"}'

# Response: {"short": "/abc123"}

# Resolve it (your browser will follow the redirect automatically)
curl -L https://YOUR_API_ID.execute-api.us-east-1.amazonaws.com/abc123

If you see a redirect to your original URL, you just shipped a serverless app. No server. No Docker. No SSH session.

Security best practices

Security in serverless is different from traditional server security, there's no OS to harden, no SSH port to close. The risks shift, so the defences shift too.

1. Principle of least privilege (IAM)

Your Lambda function's IAM role is its identity. Only give it permission to do exactly what it needs. In the policy above, we granted only PutItem and GetItem on one specific table, not DeleteItem, not Scan, not access to any other table or service.

The rule: if your function doesn't need a permission, don't give it that permission. An attacker who compromises your function can only do what the function is allowed to do.

2. Never hardcode secrets in your code

If your function needs an API key or a password, don't paste it in the source code. Use AWS Secrets Manager or Parameter Store. Your function fetches the secret at runtime using its IAM role, no secrets in code, no secrets in environment variables that show up in console screenshots.

import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm";
const ssm = new SSMClient({});

const param = await ssm.send(new GetParameterCommand({
  Name: "/myapp/api-key",
  WithDecryption: true  // decrypts KMS-encrypted values automatically
}));
const apiKey = param.Parameter.Value;

3. Validate all input

Your Lambda is reachable from the internet via API Gateway. Anyone can send it anything. Always validate the shape and content of incoming data before using it, check that the URL is actually a URL, that required fields exist, that strings aren't suspiciously long.

4. Enable CloudWatch logs and set a retention period

Lambda logs every invocation to CloudWatch automatically. But by default, logs are kept forever (and charged per GB stored). Set a retention period: 7 or 30 days is usually enough.

In the AWS Console: CloudWatch → Log groups → your function's log group → Actions → Edit retention setting.

Keeping costs near zero

Here's what this project actually costs at light traffic (~10,000 requests/month):

Service	What you're billed for	Estimated cost
Lambda	10k requests × 128 MB × ~100ms	$0.00 (free tier)
API Gateway (HTTP API)	$1.00 per million requests	~$0.01
DynamoDB (on-demand)	~$0.00025 per read/write unit	~$0.01
Total		~$0.02 / month

Cost-saving choices we made and why

HTTP API over REST API in API Gateway
HTTP API costs 70% less than the older REST API and is more than sufficient for most use cases. REST API has additional features (request transformation, WAF integration) but you probably don't need them yet.

DynamoDB on-demand capacity
With provisioned capacity, you pay for throughput 24/7 even when idle. On-demand billing means you pay per actual read and write, perfect for apps with unpredictable or low traffic.

DynamoDB TTL for automatic cleanup
We set an expiresAt timestamp in the Lambda function. DynamoDB reads this attribute and automatically deletes old records for free, no scheduled jobs, no manual cleanup, no growing table size eating into your storage bill.

128 MB Lambda memory
Lambda charges for memory × duration. 128 MB is the minimum, and for a simple DynamoDB read/write, it's more than enough. Only increase memory if your function is slow; more memory also means more CPU allocated to your function.

What to build next

You now have a working, secured, cost-optimised serverless app. Here are natural next steps, each one introduces a new piece of the AWS serverless ecosystem.

Add a frontend with S3 + CloudFront
Upload a simple HTML form to an S3 bucket and serve it through CloudFront (AWS's CDN). Now your users have a UI instead of a curl command. S3 static hosting costs cents per GB stored.

Add authentication with Cognito
Require users to sign in before shortening URLs. API Gateway can validate Cognito JWTs automatically, your Lambda function never sees unauthenticated requests.

Move to infrastructure-as-code with AWS CDK or SAM
Right now you've clicked through the console. The next level is writing your infrastructure as code so you can deploy it repeatably, version it in git, and tear everything down cleanly when you're done experimenting.

Wrapping up

Here's what we built, and why each decision was made:

API Gateway keeps Lambda off the public internet and gives you rate limiting and HTTPS for free
Lambda runs only when triggered, no idle billing, no patching, no servers
DynamoDB on-demand scales with your traffic and charges nothing when nobody's using your app
Least-privilege IAM means a compromised function can only do what the function is supposed to do
TTL on DynamoDB records means old data cleans itself up automatically

That's the serverless model in its purest form. No servers. No patching. No idle billing. One Lambda function, two AWS services, and a handful of IAM rules.

Found this helpful? Follow along. The next post covers adding a frontend with S3 and CloudFront.

From Flask App to Production on AWS EKS: A Complete CI/CD Walkthrough

Oluwademilade Oyekanmi — Mon, 28 Apr 2025 11:23:00 +0000

I recently deployed a Flask REST API to Amazon EKS with a full CI/CD pipeline, PostgreSQL on Kubernetes, and Prometheus + Grafana monitoring — all wired together with GitHub Actions. This post is a step-by-step walkthrough of exactly how I did it.

The app itself is a Titanic passenger API (a classic dataset) but the architecture is production-grade: containerized app, ECR image registry, EKS cluster, persistent storage, and alerting to Slack. Let me walk you through it.

What We're Building

Here's the full stack:

App: Python Flask REST API with PostgreSQL
Containerization: Docker + Amazon ECR
Orchestration: Amazon EKS (Kubernetes)
CI/CD: GitHub Actions
Monitoring: Prometheus + Grafana via Helm (kube-prometheus-stack)
Alerting: Slack via Alertmanager

The source code is at: github.com/MsOluwademilade/titanic-test-app

The Application

The app is a CRUD REST API built with Flask and SQLAlchemy. It manages a people table seeded with Titanic passenger data.

Endpoints:

GET /people — list all passengers
GET /people/<uuid> — get one passenger
POST /people — add a passenger
PUT /people/<uuid> — update a passenger
DELETE /people/<uuid> — remove a passenger

The data model (src/models/person.py) maps to a PostgreSQL table:

class Person(db.Model):
    __tablename__ = 'people'
    uuid = db.Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    survived = db.Column(db.Integer)
    passengerClass = db.Column(db.Integer)
    name = db.Column(db.String(255))
    sex = db.Column(db.String(6))
    age = db.Column(db.Float)
    siblingsOrSpousesAboard = db.Column(db.Integer)
    parentsOrChildrenAboard = db.Column(db.Integer)
    fare = db.Column(db.Float)

The app factory pattern in src/app.py supports both development and production configs, pulling DATABASE_URL from environment variables — which makes it Kubernetes-friendly from day one.

Step 1: Dockerizing the App

The Dockerfile uses a python:3.12-slim base image to keep things lean.

FROM python:3.12-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    build-essential \
    libpq-dev \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt && \
    pip install --no-cache-dir Flask==2.2.5 SQLAlchemy==1.4.49 setuptools==70.0.0 && \
    pip uninstall -y psycopg2 || true && \
    pip install --no-cache-dir psycopg2-binary==2.9.9

COPY . .

EXPOSE 5000

ENV DATABASE_URL=postgresql+psycopg2://user:password@db:5432/postgres

CMD ["python", "run.py"]

A few things worth noting here:

We explicitly uninstall psycopg2 and reinstall psycopg2-binary — this avoids compilation errors in slim images that don't have full build toolchains.
The DATABASE_URL env var set in the Dockerfile is just a default. In Kubernetes, we'll override it per-deployment.

Testing locally with Docker Compose:

Before pushing to EKS, test the full stack locally with compose.yml:

services:
  db:
    image: postgres:latest
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  app:
    build: .
    ports:
      - "5000:5000"
    environment:
      DATABASE_URL: postgresql+psycopg2://user:password@db:5432/postgres
    depends_on:
      db:
        condition: service_healthy

The condition: service_healthy on the db dependency is important — it ensures Postgres is actually ready to accept connections before Flask tries to connect.

docker compose up --build

Hit http://localhost:5000/people to confirm it's working.

Step 2: Push to Amazon ECR

Before setting up the full pipeline, create your ECR repository manually (you only do this once):

aws ecr create-repository \
  --repository-name titanic-app \
  --region <your-region>

Note the repository URI, you'll need it as a GitHub secret.

Step 3: Provision the EKS Cluster

Create your cluster with eksctl (the easiest way to get started):

eksctl create cluster \
  --name titanic-cluster \
  --region <your-region> \
  --nodegroup-name titanic-nodes \
  --node-type t3.medium \
  --nodes 2

This provisions the control plane, worker nodes, and configures your local kubeconfig automatically. The CI/CD pipeline will later call aws eks update-kubeconfig to do the same in GitHub Actions.

Why t3.medium? The kube-prometheus-stack (Prometheus + Grafana + Alertmanager) is resource-hungry. t3.small nodes will struggle. Budget for at least t3.medium if you're running monitoring.

Step 4: The Kubernetes Manifests

All Kubernetes configs live in the k8s/ folder. Here's what each file does:

PostgreSQL Setup

k8s/postgres-configmap.yaml — Holds the SQL init script that creates the people table and seeds initial data. Kubernetes mounts this as a volume into the Postgres container at /docker-entrypoint-initdb.d/, which Postgres automatically runs on first start.

apiVersion: v1
kind: ConfigMap
metadata:
  name: titanic-sql
data:
  init.sql: |
    CREATE TABLE IF NOT EXISTS people (
      uuid VARCHAR(255) PRIMARY KEY DEFAULT gen_random_uuid()::text,
      survived INTEGER NOT NULL,
      ...
    );
    INSERT INTO people (...) VALUES (...);

k8s/postgres-deployment.yaml — Deploys Postgres with a PersistentVolumeClaim for durable storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: gp2

The storageClassName: gp2 tells EKS to provision an AWS EBS volume. This is what survives pod restarts and gives you actual persistence — without it, every time Postgres restarts you'd lose your data.

The deployment mounts both the PVC (for data) and the ConfigMap (for init scripts):

volumeMounts:
  - name: postgres-data
    mountPath: /pgdata
  - name: init-script
    mountPath: /docker-entrypoint-initdb.d

k8s/postgres-service.yaml — Exposes Postgres internally as a ClusterIP service on port 5432. The Flask app reaches it via the service name postgres.

Flask App

k8s/app-deployment.yaml — Deploys the Flask app. Notice the IMAGE_URI_PLACEHOLDER:

containers:
  - name: titanic-app
    image: IMAGE_URI_PLACEHOLDER
    ports:
      - containerPort: 5000
    env:
      - name: DATABASE_URL
        value: postgresql+psycopg2://user:password@postgres:5432/postgres

The CI/CD pipeline replaces IMAGE_URI_PLACEHOLDER with the actual ECR image URI using sed before applying the manifest. This keeps the manifest clean in version control while allowing dynamic image URIs in the pipeline.

k8s/app-service.yaml — Exposes the Flask app as a LoadBalancer service. EKS provisions an AWS ELB automatically:

spec:
  type: LoadBalancer
  ports:
    - protocol: TCP
      port: 80
      targetPort: 5000

Traffic hits port 80 on the load balancer and gets forwarded to port 5000 on the Flask container.

Step 5: Monitoring with Prometheus & Grafana

The monitoring stack lives in monitoring/. Rather than writing raw Kubernetes manifests for Prometheus, we use the kube-prometheus-stack Helm chart — it bundles Prometheus, Grafana, and Alertmanager and pre-configures them to scrape Kubernetes metrics out of the box.

monitoring/prometheus-pvcs.yaml — Creates PVCs for Prometheus and Alertmanager data persistence:

# Prometheus: 8Gi on gp2 EBS
# Alertmanager: 2Gi on gp2 EBS

monitoring/prometheus-values.yaml — This is where the interesting configuration lives.

Grafana gets a LoadBalancer service so it's accessible externally, and persistence so dashboards survive pod restarts:

grafana:
  enabled: true
  adminPassword: "your-password"
  service:
    type: LoadBalancer
  persistence:
    enabled: true
    storageClassName: gp2
    size: 5Gi

Alertmanager is configured to route all alerts to Slack:

alertmanager:
  config:
    route:
      receiver: 'slack-notifications'
    receivers:
      - name: 'slack-notifications'
        slack_configs:
          - api_url_file: /etc/alertmanager/secrets/alertmanager-slack-webhook/url
            channel: '#all-titanic-alert'
            title: '🚨 Kubernetes Alert - {{ .CommonLabels.alertname }}'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Labels.alertname }}
              *Severity:* {{ .Labels.severity }}
              *Summary:* {{ .Annotations.summary }}
              {{ end }}
            send_resolved: true

The Slack webhook URL is mounted from a Kubernetes secret (not hardcoded) — the api_url_file path points to a mounted secret volume.

The defaultRules section enables pre-built alert rules covering etcd, Kubernetes API server, node resources, pod health, storage, and more — all without writing a single PromQL rule yourself.

Step 6: The CI/CD Pipeline

This is where everything ties together. The GitHub Actions workflow in .github/workflows/deploy.yml triggers on every push to main and handles the full deploy sequence.

1. AWS Authentication

- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    aws-region: ${{ secrets.AWS_REGION }}

AWS credentials are stored as GitHub repository secrets — never hardcoded.

2. Build & Push to ECR

- name: Login to Amazon ECR
  id: login-ecr
  uses: aws-actions/amazon-ecr-login@v2

- name: Build, tag, and push image to ECR
  env:
    ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
    ECR_REPOSITORY: ${{ secrets.ECR_REPOSITORY }}
  run: |
    IMAGE_URI=$ECR_REGISTRY/$ECR_REPOSITORY:latest
    docker build -t $IMAGE_URI .
    docker push $IMAGE_URI

The login-ecr step outputs the registry URL, which we capture and use to construct the full image URI.

3. Configure kubectl for EKS

- name: Update kubeconfig for EKS
  run: |
    aws eks update-kubeconfig \
      --name ${{ secrets.EKS_CLUSTER_NAME }} \
      --region ${{ secrets.AWS_REGION }}

This authenticates the GitHub Actions runner to your EKS cluster.

4. Deploy the Monitoring Stack

- name: Create monitoring namespace
  run: |
    kubectl get namespace monitoring >/dev/null 2>&1 || \
    kubectl create namespace monitoring

- name: Apply Prometheus PVCs
  run: kubectl apply -f monitoring/prometheus-pvcs.yaml

- name: Deploy Prometheus Stack with Helm
  run: |
    helm repo add prometheus-community \
      https://prometheus-community.github.io/helm-charts
    helm repo update
    helm upgrade --install prometheus \
      prometheus-community/kube-prometheus-stack \
      --namespace monitoring \
      --values monitoring/prometheus-values.yaml \
      --wait \
      --timeout 10m

helm upgrade --install is idempotent — it installs on first run and upgrades on subsequent runs. The --wait --timeout 10m flags make the pipeline block until the Helm release is healthy before proceeding.

5. Grafana with IP Restriction

- name: Create Grafana LoadBalancer Service with source restriction
  run: |
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Service
    metadata:
      name: prometheus-grafana
      namespace: monitoring
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-source-ranges: "${{ secrets.SOURCE_IP }}/32"
    spec:
      type: LoadBalancer
      selector:
        app.kubernetes.io/name: grafana
    EOF

The aws-load-balancer-source-ranges annotation tells AWS to restrict inbound traffic to a specific IP — so Grafana isn't publicly accessible to the entire internet. The allowed IP is stored as a GitHub secret.

6. Deploy the App

- name: Replace image placeholder with actual ECR URI
  run: |
    IMAGE_URI="$ECR_REGISTRY/$ECR_REPOSITORY:latest"
    sed -i "s|IMAGE_URI_PLACEHOLDER|$IMAGE_URI|g" k8s/app-deployment.yaml

- name: Delete existing postgres deployment
  run: |
    kubectl delete deployment postgres --ignore-not-found=true
    kubectl wait --for=delete pod -l app=postgres --timeout=60s || true

- name: Deploy application to EKS
  run: kubectl apply -f k8s/

The sed command swaps the placeholder with the real ECR image URI before applying. The Postgres deletion step forces a clean redeployment so the init script runs fresh on every deploy.

Step 7: Required GitHub Secrets

Before the pipeline runs, configure these secrets in your GitHub repo settings under Settings → Secrets and variables → Actions:

Secret	Value
`AWS_ACCESS_KEY_ID`	IAM user access key
`AWS_SECRET_ACCESS_KEY`	IAM user secret key
`AWS_REGION`	e.g. `us-east-1`
`ECR_REPOSITORY`	ECR repo name (e.g. `titanic-app`)
`EKS_CLUSTER_NAME`	Your EKS cluster name
`SOURCE_IP`	Your IP for Grafana access restriction

The IAM user needs permissions for ECR (push images) and EKS (describe cluster + apply manifests).

Step 8: Verify the Deployment

After the pipeline completes:

Check pods are running:

kubectl get pods
kubectl get pods -n monitoring

Get the app's load balancer URL:

kubectl get svc titanic-app

Hit the EXTERNAL-IP in your browser — you should see the welcome message. Then test the API:

# List all passengers
curl http://<EXTERNAL-IP>/people

# Add a passenger
curl -H "Content-Type: application/json" \
  -X POST http://<EXTERNAL-IP>/people \
  -d '{
    "survived": 1,
    "passengerClass": 1,
    "name": "Miss. Test User",
    "sex": "female",
    "age": 28.0,
    "siblingsOrSpousesAboard": 0,
    "parentsOrChildrenAboard": 0,
    "fare": 71.28
  }'

Access Grafana:

kubectl get svc prometheus-grafana -n monitoring

Navigate to http://<EXTERNAL-IP> — Grafana will be pre-loaded with Kubernetes dashboards showing cluster health, pod resource usage, and any firing alerts.

Architecture Summary

GitHub Push → GitHub Actions
                  ├── Build Docker image
                  ├── Push to Amazon ECR
                  ├── Deploy monitoring stack (Helm → EKS)
                  │       ├── Prometheus (8Gi EBS)
                  │       ├── Grafana (5Gi EBS, LoadBalancer, IP-restricted)
                  │       └── Alertmanager (2Gi EBS) → Slack
                  └── Deploy app manifests → EKS
                            ├── Flask app (LoadBalancer, port 80 → 5000)
                            └── PostgreSQL (ClusterIP, 1Gi EBS)

Key Takeaways

gp2 storage class matters. Every PVC in this setup uses storageClassName: gp2 — the default EKS storage class backed by AWS EBS. Without it, PVCs stay in Pending state indefinitely and your pods never start.

Health checks prevent race conditions. Without condition: service_healthy in Docker Compose (and equivalent readiness probes in Kubernetes), your app will try to connect before Postgres is ready and crash on startup.

helm upgrade --install is pipeline-friendly. It handles first-time install and subsequent upgrades in one idempotent command — no need to check whether a release already exists.

IP-restrict your admin interfaces. Using aws-load-balancer-source-ranges to lock down Grafana is a simple but effective security measure. Never expose monitoring dashboards to the public internet.

Keep secrets out of manifests. The IMAGE_URI_PLACEHOLDER pattern keeps image URIs out of version control. Combined with GitHub secrets for AWS credentials, nothing sensitive lives in the repo.

Resources

If this walkthrough helped you, drop a reaction or a comment, I'd love to hear how you're approaching Kubernetes deployments on AWS. And if you spot something that could be improved in the architecture, let's discuss it below!

🚀 Deploying a Flask API on AWS EC2 with Nginx & Gunicorn: My Journey from Zero to Production

Oluwademilade Oyekanmi — Mon, 14 Apr 2025 12:56:48 +0000

🚀 The Beginning: A Daunting Challenge

If you had told me a few months ago that I’d be deploying a Flask API on an AWS EC2 instance, setting up a reverse proxy with Nginx, and configuring Gunicorn like a pro, I would have laughed and probably asked, "Me? Deploying a full-fledged API? Abeg, be serious." 😅

But here we are. I did it. And I’m still in awe.

I recently started learning Python, and honestly, taking on this project felt like staring at a mountain with no climbing gear. It was intimidating. The thought of wiring everything together—Flask, EC2, Systemd, Gunicorn, Nginx—felt like an impossible task. But if there's anything I've learned, it's that when something looks impossible, you just have to dive in, make mistakes, consult AI (a lot! 😂), and smash bugs until it works.

This is the story of how I locked in, barely remembered to eat, and refused to sleep until I got my FunNumberAPI up and running!

🔗 Full source code available on GitHub: FunNumberAPI

🎯 The Mission: Build & Deploy a Number Classification API

The goal was simple—at least, in theory.

I wanted to build an API that could take a number and tell you fun stuff about it:

✅ Is it Prime?

✅ Is it Perfect?

✅ Is it an Armstrong number? (Yes, I had to Google what that was. 😅)

✅ Is it Odd or Even?

✅ What's its digit sum?

✅ Can we fetch a random fun fact about it?

Sounds fun, right? Well, it was—until I had to deploy it. That’s where the real battle began.

🔥 The Struggle: A Battle with Errors

Everything that could go wrong, went wrong.

First, Flask was fine on my local machine, but when I tried to deploy it… boom. Errors left, right, and centre.

Then Gunicorn started misbehaving.

Then systemd refused to start my service.

Then NGINX acted like it had never met me before.

At some point, I asked myself, "Why am I doing this again?" But the stubborn part of me refused to give up. So I dug deep, Googled endlessly, consulted AI (shoutout to my robotic mentor, ChatGPT 😆), and slowly started making progress.

🔧 The Breakthrough: Step-by-Step Deployment

After hours of debugging, frustration, and moments of victory, I finally got my Flask app LIVE on an AWS EC2 instance! 🎉

Here’s how I did it:

🛠 Step 1: Building the API

I built the backend using Flask, ensuring that it could classify numbers correctly and fetch fun facts. The full implementation, including all helper functions, can be found in my GitHub repository:

🔗 API Implementation: FunNumberAPI/app.py

At this point, the API worked locally. But making it accessible to the world? That’s where the real battle began.

☁ Step 2: Deploying to AWS EC2

I launched an Ubuntu EC2 instance, opened inbound rules for 22 (SSH), 80 (HTTP), and 5000 (Custom TCP) and connected to the instance via SSH

Next, I set up the environment:

sudo apt update && sudo apt upgrade -y # Update system
sudo apt install python3-pip nginx -y # Install Python & Pip
python3 -m venv .venv # Create a virtual environment
source .venv/bin/activate # Activate the virtual environment
pip install flask gunicorn flask_cors requests # Install required libraries

Next, I cloned my repo

git clone https://github.com/MsOluwademilade/FunNumberAPI.git
cd FunNumberAPI
python3 app.py

Then I ran the app

gunicorn --bind 0.0.0.0:5000 app:app

Tested it: http://<public-ip-of-ec2>:5000/api/classify-number?number=371 ✅

⚙ Step 3: Setting Up systemd for Service Management

To keep the API running in the background and restart it on failures, I created a systemd service:

sudo nano /etc/systemd/system/flask-app.service

And added:

[Unit] Description=Gunicorn instance to serve Flask app After=network.target

[Service] 
User=ubuntu 
WorkingDirectory=/home/ubuntu/FunNumberAPI ExecStart=/home/ubuntu/.venv/bin/gunicorn --workers 3 --bind 127.0.0.1:5000 app:app 
Restart=always

[Install] 
WantedBy=multi-user.target

Then, enabled and started the service:

sudo systemctl daemon-reload
sudo systemctl start flask-app
sudo systemctl enable flask-app

Now, even if the server rebooted, my API would start automatically.

🌐 Step 4: Configuring NGINX as a Reverse Proxy

To expose my API on port 80 and avoid manually specifying port 5000, I configured NGINX:

sudo nano /etc/nginx/sites-available/default

Replaced the contents with:

server {
    listen 80;
    server_name <public-ip-of-ec2>;

    location / {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Restarted NGINX

sudo nginx -t
sudo systemctl restart nginx

And finally, my API was live on:

http://<public-ip-of-ec2>/api/classify-number?number=371

🏆 Acknowledgements

Huge thanks to the HNG12 DevOps mentors for introducing this daunting yet rewarding challenge—it pushed me beyond my limits and taught me so much.
Flask for making Python APIs a breeze 🍃
Numbers API for the fun facts 🔢
You, for checking out this project! 🎉

🗨️ Final Thoughts

Deploying this API wasn’t just a technical challenge—it was a mental endurance test. There were moments when I felt stuck, times when I considered rewriting the entire thing, but pushing through taught me invaluable lessons about Flask, deployment, debugging, and perseverance.

🚀 If you’re new to backend development and cloud deployment, take on projects that scare you. You’ll learn more than you ever thought possible.

If you found this helpful, let’s connect! Drop a comment, share your own deployment struggles, or hit me up on GitHub. Let’s keep building! 🔥

Building an Isolated Application Environment on Linux (Without Docker)

Oluwademilade Oyekanmi — Sun, 09 Feb 2025 18:46:47 +0000

Introduction

In this guide, we will walk through the process of manually creating an isolated application environment on a Linux server without using Docker. By leveraging Linux namespaces, cgroups, chroot, and other system-level isolation techniques, we can simulate the core functionality of containers. These isolation mechanisms are essential for understanding how containers work and provide us with the ability to isolate applications from the host system.

By the end of this, you will have created a lightweight, container-like environment on your Linux system, suitable for running applications securely without affecting the host system.

Objectives

By the end of this guide, you will:

Set up a custom isolated environment for running applications.
Use Linux namespaces to isolate processes, networking, and file systems.
Use cgroups to limit CPU, memory, and disk usage.
Use chroot or pivot_root to create a separate filesystem for applications.
Ensure networking isolation so applications do not interfere with the host.
Validate each step with relevant commands and screenshots.

Pre-requisites

An AWS Ubuntu EC2 instance
Make sure you have two terminals open:
- One for running commands on the host.
- One for running commands inside the container.
Install the required tools inside the container before isolating the network:

apt update
apt install -y python3 seccomp iproute2 iputils-ping

Step 1: Process Isolation with Namespaces

The first step is to isolate processes within a custom namespace, making sure they don’t interfere with processes running on the host system.

1. Creating an Isolated Namespace

To create an isolated environment for the processes, use the following command:

sudo unshare --pid --fork --mount-proc --mount /bin/bash

What does the above command do?:

The unshare command in Linux is used to isolate a process from the host namespace, effectively creating a separate environment for the process to run.
When the --pid flag is used, it creates a new PID namespace, meaning that the process will have its own process ID tree, isolated from the host's process IDs.
The --fork flag ensures that a new shell is forked within this isolated namespace, allowing the user to interact with it.
Additionally, the --mount-proc flag mounts a new /proc filesystem, which reflects the process information within the new namespace rather than the host, ensuring complete process isolation.

Why is this useful?

It ensures that processes inside this new environment will not affect or be affected by processes outside of it on the host system. This is a key concept in containerisation.

Verification

To verify that the new namespace is active, run:

lsns | grep pid

You should see a new PID namespace with a different process ID. This confirms that your new isolated process environment is working as expected.

Step 2: Filesystem Isolation with Chroot

Next, we’ll set up a minimal filesystem within the isolated environment. This filesystem will allow us to run applications as if they are in a completely separate environment.

1. Setting Up the Root Filesystem

Create a directory to serve as the root filesystem:

mkdir -p ~/my_container/rootfs

Next, we use the debootstrap tool to create a minimal Debian-based system inside this directory. This tool allows us to install the most basic set of packages needed to run a system:

sudo apt update && sudo apt install debootstrap -y
sudo debootstrap --variant=minbase stable ~/my_container/rootfs http://deb.debian.org/debian

What do the above commands do?
The debootstrap tool sets up a minimal Debian-based system within a specified directory, providing a lightweight and isolated environment. By using the minbase variant, debootstrap includes only the most essential packages, significantly reducing the size of the setup while maintaining a functional minimal system. This approach is useful when creating isolated environments or containers where minimalism is crucial.

Verification

Check the contents of the root filesystem:

ls ~/my_container/rootfs

You should see essential directories like bin, lib, etc, and usr.

2. Mounting Required System Directories

To make the system fully functional, we need to mount certain directories like /proc, /sys, and /dev inside the container's root filesystem. These directories are required for processes to function correctly.

sudo mkdir -p ~/my_container/rootfs/proc
sudo mkdir -p ~/my_container/rootfs/sys
sudo mkdir -p ~/my_container/rootfs/dev
sudo mkdir -p ~/my_container/rootfs/dev/pts

sudo mount -t proc proc ~/my_container/rootfs/proc
sudo mount --rbind /sys ~/my_container/rootfs/sys
sudo mount --rbind /dev ~/my_container/rootfs/dev
sudo mount --rbind /dev/pts ~/my_container/rootfs/dev/pts

These commands create the necessary directories and mount them inside the container's root filesystem.

Verification
Check active mounts:

mount | grep ~/my_container/rootfs

This command lists the active mounts. You should see the /proc, /sys, /dev, and /dev/pts directories mounted correctly inside the container's root filesystem.

3: Entering the Chroot Environment

Now, we enter the chroot environment, which allows us to interact with the container as though it were a separate system:

sudo chroot ~/my_container/rootfs /bin/bash

Verification

Run the following command to confirm that you're inside the isolated environment:

df -h

Step 3: Resource Management with Cgroups

Cgroups allow us to limit the resources (like CPU, memory, and disk usage) that processes inside the container can use.

1. Limiting CPU Usage

To limit the CPU usage of our isolated container, we use the following commands:

mkdir -p /sys/fs/cgroup
mkdir -p /sys/fs/cgroup/my_container
echo "50000 100000" | tee /sys/fs/cgroup/my_container/cpu.max
echo $$ | tee /sys/fs/cgroup/my_container/cgroup.procs

What do these commands do?

mkdir -p /sys/fs/cgroup: Creates a directory for cgroups if it doesn't already exist.
echo "50000 100000" | tee /sys/fs/cgroup/my_container/cpu.max: Limits CPU usage to 50% of one CPU core by setting a CPU quota.
echo $$ | tee /sys/fs/cgroup/my_container/cgroup.procs: Places the current process (the shell) into the cgroup.

Verification
To check the CPU limits:

cat /sys/fs/cgroup/my_container/cpu.max

You should see the output 50000 100000, which confirms that the CPU limit has been applied.

2. Setting Memory Limits

Now, let's limit the amount of memory the container can use:

mkdir -p /sys/fs/cgroup/my_container
echo 268435456 > /sys/fs/cgroup/my_container/memory.max
echo $$ > /sys/fs/cgroup/my_container/cgroup.procs

Verification
To check the memory limits:

cat /sys/fs/cgroup/my_container/memory.max

You should see268435456 as the output, confirming the memory limit is set correctly.

3. Restricting Disk I/O

To restrict disk I/O for a specific process, use the following command. (This command should be run in the host, not in the container, so run exit to go into the host and run the following commands)

sudo ionice -c 2 -n 7 -p <PID>

This limits the disk I/O priority for the process identified by .

How to find the PID:
To find the process ID (PID) of the container, open a new terminal, SSH into your instance, run sudo su, and use the following command to find the PID:

ps aux | grep <container-name>

Verification
To verify the disk I/O restrictions:

sudo ionice -p <PID>

You should see output indicating the disk I/O priority for the specified process: best-effort: prio 7

Step 4: Security Hardening

In this step, we will enhance the security of our isolated environment by restricting certain system calls using seccomp. Seccomp (short for Secure Computing Mode) is a Linux kernel feature that allows us to filter and block specific system calls, thereby minimizing the attack surface of the container.

1. Restricting System Calls (Seccomp)

1. Install Seccop
First, we need to install the seccomp package, which provides the necessary tools to create and apply system call filters. Run the following command to install the seccomp package:

apt install seccomp -y

This command ensures that the seccomp tools are available for use in our container, allowing us to restrict system calls and improve security.

2. Create the Seccomp Profile Script
Next, we will create a Python script that will apply the seccomp profile to our container environment. The script will load the seccomp profile from a file and enforce the rules we define, such as blocking certain system calls like ptrace.
Run the following command to create and edit the script apply_seccomp.py:

cat > apply_seccomp.py << EOF
#!/usr/bin/python3
import seccomp
import sys
import json

# Load the profile
with open('seccomp_profile.json', 'r') as f:
    profile = json.load(f)

# Create a seccomp filter
f = seccomp.SyscallFilter(seccomp.ALLOW)

# Add the rules from our profile
for syscall in profile.get('syscalls', []):
    if syscall['action'] == 'SCMP_ACT_KILL':
        f.add_rule(seccomp.KILL, syscall['name'])

# Apply the filter
f.load()

# Execute the command provided as arguments
if len(sys.argv) > 1:
    import os
    os.execvp(sys.argv[1], sys.argv[1:])
EOF

What does this script do?

The script loads a seccomp profile from the seccomp_profile.json file, which contains the system call restrictions.
It creates a seccomp filter and sets the default action to allow all system calls.
The script then adds the specific rules from the profile. In this case, any system call marked with SCMP_ACT_KILL will be blocked and cause the process to terminate.
The script then applies the filter to restrict the system calls.
Finally, it executes the command provided as arguments to the script (if any), with the system call restrictions applied.

3. Make the script executable

chmod +x apply_seccomp.py

This command changes the file permissions of the script to allow it to be executed.
4. Run the Seccomp Script

./apply_seccomp.py

What happens here?
When you run the script, it will apply the seccomp profile and restrict system calls based on the rules defined in the seccomp_profile.json file. If the profile indicates that a specific system call, like ptrace, should be blocked, any attempt to use ptrace will result in the process being killed.

Verification
To test if ptrace is blocked, try to run the following command inside the container:

strace -e ptrace ls

You should see an error message like: Bad system call (core dumped).

Step 5: Networking Isolation

Networking isolation is an essential part of setting up a secure, isolated application environment. In this section, we will walk through how to create a network namespace, establish virtual network interfaces, and isolate networking between two different environments using Linux networking tools.

NB: Until stated otherwise, the commands in this section should be run on the host machine

1. Create a Network Namespace

In this first step, we create a network namespace named my_net. A network namespace is a separate network environment where we can manage network interfaces, IP addresses, and routing tables independently of the host system. The command sudo ip netns add my_net ensures that a new isolated network environment is created.

sudo ip netns add my_net

This command lists all the network namespaces present on the system. After creating my_net, it should appear in the list. This helps to verify that the namespace was successfully created.

ip netns list

2. Create Virtual Ethernet Interfaces

We now create a pair of virtual Ethernet interfaces, veth0 and veth1. These interfaces act as a bridge between the host system and the newly created network namespace. The command creates a virtual network interface (veth0) and its peer (veth1). These interfaces can communicate with each other, simulating network communication between the host system and the namespace.

sudo ip link add veth0 type veth peer name veth1

These commands display the details of the virtual interfaces veth0 and veth1. It helps us ensure that the interfaces are created correctly and are available for use. You should see output showing these interfaces and their current states (e.g., UP or DOWN).

ip link show veth0
ip link show veth1

3. Move veth0 into my_net

sudo ip link set veth0 netns my_net

4. Assign IP Addresses

In this step, we assign IP addresses to the network interfaces, enabling them to communicate within the isolated network. These interfaces will allow the isolated network namespace to interact with other networks or devices.

Assigning IP Address to veth0 Inside the Network Namespace
Next, we bring the veth0 interface up inside the my_net network namespace:

sudo ip netns exec my_net ip link set veth0 up

sudo ip netns exec my_net: Executes the command within the my_net network namespace.
ip link set veth0 up: This command activates the veth0 interface, allowing it to participate in network communication. Until the interface is brought up, it cannot send or receive packets.

This command assigns an IP address (192.168.1.1/24) to the interface veth0 within the my_net network namespace. By using sudo ip netns exec my_net, we execute the command inside the my_net namespace. This ensures that veth0 gets the specified IP address within the isolated environment.

sudo ip netns exec my_net ip addr add 192.168.1.1/24 dev veth0

Here, we bring the interface veth0 up inside the my_net namespace. This makes the interface active and able to participate in network communication.

sudo ip netns exec my_net ip link set veth0 up

Assigning IP Address to veth1 on the Host System
On the host side, we assign the IP address 192.168.1.2/24 to the peer interface veth1. This allows veth1 communicate with veth0 (and thus with the my_net namespace) through the network bridge.

sudo ip addr add 192.168.1.2/24 dev veth1

We activate veth1 by bringing it up. This allows the host system to use the interface to communicate with the isolated network namespace.

sudo ip link set veth1 up

5. Test the Network Connectivity

sudo ip netns exec my_net ping -c 3 192.168.1.2

Now that the interfaces are set up, we test the connectivity between the my_net namespace and the host system by using the ping command. This sends three ping requests (-c 3) from the my_net namespace to the IP address 192.168.1.2 (which is assigned to veth1). If the setup is correct, you should see successful ping responses.

6. Verify Network Interfaces Inside the Namespace

ip netns exec my_net ip a

This command displays the IP address and other network details for all interfaces inside the my_net network namespace. It should show veth0 with the IP address 192.168.1.1/24 as expected.

7. Linking to a Container Network Namespace

sudo ln -s /proc/<PID>/ns/net /var/run/netns/my_container

In this step, we link the network namespace of a running process to the container's network namespace. The ln -s command creates a symbolic link to the network namespace in the /var/run/netns/ directory, making it accessible for further network configuration.

8. Enter the Container with Network Namespace

sudo nsenter --net=/var/run/netns/my_net -- chroot /root/my_container/rootfs /bin/bash

Here, we use nsenter to enter the network namespace of the container. This command allows us to run a shell (/bin/bash) within the isolated network environment of the container. The --net option tells nsenter to use the network namespace we linked earlier, and the chroot command changes the root directory to the container's root filesystem (/root/my_container/rootfs).

9. Verify Network Configuration Inside the Container

ip a

Inside the container, running ip a should show the network interfaces of the container, including the virtual interface (veth0) that connects it to the network namespace. The IP address assigned to veth0 (192.168.1.1/24) should also be visible.

Step 6: Deploying an Application

In this step, we'll deploy a simple application within our isolated environment and test its accessibility both from the host and within the isolated network. We will set up a basic Python web server to demonstrate this process.

1. Installing Dependencies

Before running the application, recall that at the beginning of our setup, we installed Python in our container and its required dependencies inside the container. This ensures that Python is available to run our application inside the isolated environment.

2. Running the Web Server

To start the web server inside the isolated environment, run the following command within the container:

python3 -m http.server 8080

What does this command do?

python3 -m http.server: This command starts a simple HTTP server using Python's built-in library. It listens for incoming HTTP requests.
8080: This specifies that the web server should listen on port 8080.

The Python web server will now run inside the isolated container, serving files on port 8080.

Why this is important:
By running the server in this isolated environment, we can observe how the container handles networking and whether the isolation works as expected.

3. Verifying the Web Server from the Host

To verify that the web server is accessible from the host system (i.e., the machine running the container), run the following command on the host

curl 192.168.1.1:8080

curl: This command is used to transfer data from or to a server using various protocols, in this case, HTTP.
192.168.1.1:8080: This is the IP address of veth0 inside the my_net network namespace, and 8080 is the port where our web server is running.

Output:

The command should display the HTML content served by the Python web server, confirming that the server inside the isolated container is accessible from the host system. You should see a webpage or content indicating the server is running.

4. Verifying Network Isolation

Next, to test the network isolation, run the following command on the host:

curl localhost:8080

What does this command do?

curl localhost:8080: This command tries to access the web server by targeting the localhost address (i.e., the host system itself) on port 8080.

The command should not access the web server. This is because the web server is running inside the isolated network namespace (my_net), which has been configured with network isolation. Therefore, localhost on the host system does not have access to the container's network.

This confirms that our network isolation is working properly and the application is isolated within the container environment.

Conclusion

This guide demonstrated how to manually create an isolated application environment on Linux using namespaces, cgroups, chroot, and network isolation. By understanding these underlying concepts, you gain valuable insights into how containers like Docker work under the hood.

Thank You for Reading!

If you found this guide helpful, don’t forget to like, comment, and share! Let me know if you have any questions or need further assistance.

Happy isolating! 🚀

A Comprehensive Guide to AWS Monitoring with Prometheus and Grafana

Oluwademilade Oyekanmi — Sun, 09 Feb 2025 18:46:19 +0000

Introduction

In the fast-paced world of DevOps, monitoring and observability are essential for ensuring system reliability and cost efficiency. This guide will walk you through the process of configuring Prometheus and Grafana for AWS monitoring, setting up AWS-specific alerts and integrating AWS Cost Exporter. Whether you’re a beginner or an intermediate DevOps engineer, this guide is designed to be easy to follow and implement.

Who is this Guide for?

This guide is tailored for:

Beginners: Those new to DevOps and monitoring tools.
Intermediate Engineers: Those looking to deepen their understanding of Prometheus, Grafana, and DORA metrics.

What You’ll Learn

By the end of this guide, you’ll be able to:

Deploy Prometheus and Grafana on a cloud server.
Set up Node Exporter and Blackbox Exporter for system and uptime monitoring.
Configure DORA metrics tracking for CI/CD pipelines.
Set up an alerting system with Slack notifications.
Let’s dive in!

Part 1: Set Up Prometheus and Monitoring Tools

1. Setting Up Prometheus

Prometheus is an open-source monitoring system that collects and stores metrics as time series data. It’s like having a system constantly monitor your infrastructure, collecting performance data and alerting you before problems occur. Prometheus uses a pull-based model, scraping metrics from configured targets via HTTP endpoints.

In this section, we’ll install Prometheus, configure it to collect data from different sources, and ensure it’s properly storing and retrieving metrics.

1. Create User
We need to create a dedicated user for Prometheus. This enhances security by limiting the permissions and access of the Prometheus service.

sudo useradd --no-create-home --shell /bin/false prometheus

2. Create Directories
These directories will store Prometheus configuration files and data. Organizing them separately helps in managing and backing up data efficiently.

sudo mkdir -p /etc/prometheus /var/lib/prometheus

3. Download Prometheus
This step involves downloading and extracting the Prometheus binary.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz

4. Copy Binaries
Copy the binaries to /usr/local/bin, making them easily accessible and executable from anywhere in the system.

sudo cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/

5. Copy the Configuration Files
These files contain the necessary configurations and libraries for Prometheus to function correctly.

sudo cp -r prometheus-2.45.0.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-2.45.0.linux-amd64/console_libraries /etc/prometheus

6. Configure Prometheus
Edit /etc/prometheus/prometheus.yml and define scrape jobs for monitoring. This configuration defines how often Prometheus scrapes data and what endpoints it monitors.

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100']

  # Blackbox Exporter for HTTP endpoint checks
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - "https://example1.com"  # <your-url-here>
          - "https://example2.com"  # <your-url-here>
          - "https://example3.com"  # <your-url-here>
          - "https://example4.com"  # <your-url-here>
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: 'github-exporter'
    static_configs:
      - targets: ['github-exporter:9118']
    labels:
      environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alert_rules.yml"

7. Add the Alert Rules
Add the rules in the /etc/prometheus/alert_rules.yml file

groups:
  - name: blackbox_exporter_alerts
    rules:
      # Alert when an endpoint is down
      - alert: "Endpoint Down"
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
          target: "{{ $labels.instance }}"
        annotations:
          summary: "A Resolve Vote URL is Down" 
          description: "{{ $labels.instance }} URL is unreachable." # (job: {{ $labels.job }})."

      # Alert when an endpoint is back up
      #- alert: EndpointUp
      #  expr: probe_success == 1
      #  for: 1m
      #  labels:
      #    severity: info
      #    target: "{{ $labels.instance }}"
      #  annotations:
      #    summary: "Endpoint is Back Online"
      #    description: "{{ $labels.instance }} is now reachable (job: {{ $labels.job }})."

      # High latency alert
      - alert: "Latency"
        expr: probe_duration_seconds > 1
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High Latency"
          description: "{{ $labels.instance }} has high latency: {{ $value }}s."

      # SSL Certificate Expiry Alert (less than 7 days)
      - alert: "SSL Expiry"
        expr: probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 86400 * 7
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "SSL Certificate Expiry Warning"
          description: "SSL certificate for {{ $labels.instance }} expires in less than 7 days."

  - name: node_exporter_alerts
    rules:
      # High CPU Usage Alert
      - alert: High CPU Usage
        expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 80
        for: 1m
        labels:
          severity: critical
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High CPU Usage on {{ $labels.instance }}"
          description: "CPU usage has exceeded 80% for over 2 minutes."

      # High Memory Usage Alert
      - alert: HighMemoryUsage
        expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) * 100 > 80
        for: 1m
        labels:
          severity: critical
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High Memory Usage on {{ $labels.instance }}"
          description: "Memory usage ({{ $value | humanizePercentage }}%) exceeds 80% for 2m."

      # High Disk Usage Alert
      - alert: HighDiskUsage
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 20
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High Disk Usage on {{ $labels.instance }}"
          description: "Disk space usage is critically high, less than 20% available."

      # High System Load Alert
      - alert: HighSystemLoad
        expr: node_load1 > (count(node_cpu_seconds_total{mode="user"}) * 1.5)
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High System Load on {{ $labels.instance }}"
          description: "System load ({{ $value }}) is too high compared to available CPU cores."

8. Set Permissions for Prometheus User

This step ensures that the Prometheus user has the necessary access to its configuration and data files.

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

9. Create Systemd Service
Create a systemd service file at /etc/systemd/system/prometheus.service. This service file ensures Prometheus runs as a background process and starts automatically on boot.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

10. Enable and Start Prometheus
These commands reload the systemd manager configuration, enable Prometheus to start on boot, and start the Prometheus service immediately.

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

11. Check Prometheus Status

sudo systemctl status prometheus

2. Setting Up Node Exporter

Your computer or server is always running various processes — handling CPU load, managing memory, reading and writing to disks. But how do you keep track of these activities? Node Exporter acts as a sensor, continuously collecting system health data and making it available for Prometheus to analyze. It is a Prometheus exporter that provides detailed system metrics, including CPU, memory, disk I/O, and network statistics.

We’ll install Node Exporter, connect it to Prometheus, and visualize key metrics like CPU usage, memory consumption, and disk space. This will help in spotting performance issues before they impact your system.

1. Create a Node Exporter User
Similar to what we did with Prometheus, creating a dedicated user for Node Exporter enhances security.

sudo useradd --no-create-home --shell /bin/false node_exporter

2. Download Node Exporter
Gets the latest version of Node Exporter and extracts the files.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvf node_exporter-1.6.1.linux-amd64.tar.gz

3. Copy Binary
This step moves the executable to a system-wide directory and sets appropriate ownership.

sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

4. Create Systemd Service
Create /etc/systemd/system/node_exporter.service. This ensures Node Exporter runs as a background service and starts on boot.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

5. Enable and Start Node Exporter

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

6. Check Node Exporter Status

sudo systemctl status node_exporter

3. Setting Up Blackbox Exporter

Monitoring the health of your internal system is excellent, but what about services that users interact with, like websites and APIs? Blackbox Exporter is a tool that helps test whether these external services are reachable and responding correctly. It does this by simulating user interactions, such as:

Checking if a website is online and loading correctly
Measuring how long it takes for a webpage to respond
Verifying whether a database or application can be reached over the network

We’ll set up Blackbox Exporter to monitor critical services and ensure they stay accessible.

1. Create a System User for Blackbox

sudo useradd --no-create-home --shell /bin/false blackbox_exporter

2. Download Blackbox Exporter
Blackbox Exporter is used to monitor the availability and response time of network services.

cd /tmp
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
tar -xvf blackbox_exporter-0.24.0.linux-amd64.tar.gz

3. Copy Binary and Set Permissions
This step moves the executable to a system-wide directory and sets appropriate ownership.

sudo cp blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter /usr/local/bin/
sudo chown blackbox_exporter:blackbox_exporter /usr/local/bin/blackbox_exporter

4. Create Blackbox Config Directory

sudo mkdir -p /etc/blackbox_exporter
sudo chown blackbox_exporter:blackbox_exporter /etc/blackbox_exporter

5. Create Configuration
This configuration file defines how Blackbox Exporter should probe different types of services. Add the below config to /etc/blackbox_exporter/blackbox.yml

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      preferred_ip_protocol: "ip4"
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{}'

Create /etc/systemd/system/blackbox_exporter.service. This ensures Blackbox Exporter runs as a background service and starts on boot.

[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=blackbox_exporter
Group=blackbox_exporter
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter --config.file=/etc/blackbox_exporter/blackbox.yml

[Install]
WantedBy=multi-user.target

7. Enable and Start Blackbox Exporter

sudo systemctl daemon-reload
sudo systemctl enable blackbox_exporter
sudo systemctl start blackbox_exporter

8. Check Blackbox Exporter Status

sudo systemctl status blackbox_exporter

4. Setting Up Grafana

Staring at rows of numbers can be overwhelming — Grafana turns those numbers into beautiful, easy-to-read dashboards. It connects to Prometheus and helps visualize performance trends, making it easier to understand what’s happening in your system at a glance.

In this section, we’ll install Grafana, configure it to pull data from Prometheus and create dashboards that display critical system and application performance metrics. By the end, you’ll have real-time, interactive charts showing exactly how your infrastructure is performing.

1. Import Grafana GPG Key
Importing the GPG key ensures the authenticity of the Grafana packages.

sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

2. Add Repository
Adding the Grafana repository allows you to install Grafana using apt-get

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

3. Update and Install Grafana

sudo apt-get update
sudo apt-get install grafana-enterprise
sudo apt-get install grafana

4. Enable and Start Grafana

sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

5. Check Grafana Status

sudo systemctl status grafana-server

5. Setting Up GitHub Exporter for DORA Metrics

If you’re building software, you want to know how efficiently your team delivers updates. That’s where DORA (DevOps Research and Assessment) metrics come in. These four key metrics help measure software delivery performance:

Deployment Frequency (DF) — How often new code is deployed
Lead Time for Changes (LTC) — How long it takes for a code change to go live
Change Failure Rate (CFR) — How often deployments cause problems
Mean Time to Recovery (MTTR) — How quickly issues are fixed

GitHub doesn’t provide these insights directly, so we use GitHub Exporter, which collects data from GitHub repositories and makes it available to Prometheus. We’ll set up GitHub Exporter, connect it to Prometheus, and visualize DORA metrics in Grafana to track and improve software delivery speed and reliability.

1. Install dependencies
These dependencies are necessary for running the GitHub Exporter script

sudo apt-get install -y python3.12 python3-pip
sudo mkdir -p /opt/github_exporter
cd /opt/github_exporter

2. Create GitHub Exporter Script
Edit:/opt/github_exporter/github_exporter.py. This script fetches deployment data from GitHub and exposes it as Prometheus metrics.

import time
import requests
from datetime import datetime, timedelta
from prometheus_client import start_http_server, Gauge, Counter
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger('github-metrics-exporter')

# GitHub API Configuration with hardcoded token
GITHUB_TOKEN = "your_github_token_here"  # Replace with your actual GitHub token

# GitHub repositories to monitor
REPOS = [
    {"owner": "<repo-owner>", "repo": "<repo-name>"},
    {"owner": "<repo-owner>", "repo": "<repo-name>"}
]

# API headers
HEADERS = {
    'Authorization': f'token {GITHUB_TOKEN}',
    'Accept': 'application/vnd.github.v3+json'
}

# Prometheus metrics
deployment_frequency = Counter('github_deployment_frequency_total', 
                             'Total number of deployments', 
                             ['repository'])

lead_time_for_changes = Gauge('github_lead_time_for_changes_seconds', 
                             'Time from commit to production in seconds', 
                             ['repository'])

change_failure_rate = Gauge('github_change_failure_rate_percent', 
                           'Percentage of deployments that failed', 
                           ['repository'])

mean_time_to_restore = Gauge('github_mean_time_to_restore_seconds', 
                            'Mean time to recover from failures in seconds', 
                            ['repository'])

class GitHubMetricsCollector:
    def __init__(self, repos, headers):
        self.repos = repos
        self.headers = headers

    def get_workflows(self, owner, repo):
        """Get all workflows for a repository"""
        url = f"https://api.github.com/repos/{owner}/{repo}/actions/workflows"
        response = requests.get(url, headers=self.headers)
        if response.status_code != 200:
            logger.error(f"Failed to get workflows: {response.status_code}, {response.text}")
            return []
        return response.json().get('workflows', [])

    def get_workflow_runs(self, owner, repo, workflow_id, time_period_days=30):
        """Get workflow runs for a specific workflow"""
        since_date = (datetime.now() - timedelta(days=time_period_days)).isoformat()
        url = f"https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs?created=>{since_date}&per_page=100"
        response = requests.get(url, headers=self.headers)
        if response.status_code != 200:
            logger.error(f"Failed to get workflow runs: {response.status_code}, {response.text}")
            return []
        return response.json().get('workflow_runs', [])

    def get_commit_data(self, owner, repo, sha):
        """Get data for a specific commit"""
        url = f"https://api.github.com/repos/{owner}/{repo}/commits/{sha}"
        response = requests.get(url, headers=self.headers)
        if response.status_code != 200:
            logger.error(f"Failed to get commit data: {response.status_code}, {response.text}")
            return None
        return response.json()

    def calculate_deployment_frequency(self, owner, repo):
        """Calculate deployment frequency"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        total_deployments = 0
        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            successful_deployments = [r for r in runs if r['conclusion'] == 'success']
            total_deployments += len(successful_deployments)

        deployment_frequency.labels(repository=f"{owner}/{repo}").inc(total_deployments)
        logger.info(f"[{owner}/{repo}] Deployment Frequency: {total_deployments} deployments")
        return total_deployments

    def calculate_lead_time_for_changes(self, owner, repo):
        """Calculate lead time for changes"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        lead_times = []
        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            successful_deployments = [r for r in runs if r['conclusion'] == 'success']

            for run in successful_deployments:
                commit_sha = run.get('head_sha')
                if not commit_sha:
                    continue

                commit_data = self.get_commit_data(owner, repo, commit_sha)
                if not commit_data:
                    continue

                commit_time = datetime.strptime(commit_data['commit']['author']['date'], 
                                               "%Y-%m-%dT%H:%M:%SZ")
                deployment_time = datetime.strptime(run['updated_at'], 
                                                  "%Y-%m-%dT%H:%M:%SZ")

                lead_time = (deployment_time - commit_time).total_seconds()
                lead_times.append(lead_time)

        if lead_times:
            avg_lead_time = sum(lead_times) / len(lead_times)
            lead_time_for_changes.labels(repository=f"{owner}/{repo}").set(avg_lead_time)
            logger.info(f"[{owner}/{repo}] Lead Time for Changes: {avg_lead_time:.2f} seconds")
            return avg_lead_time
        return 0

    def calculate_change_failure_rate(self, owner, repo):
        """Calculate change failure rate"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        total_deployments = 0
        failed_deployments = 0

        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            total_deployments += len(runs)
            failed_deployments += len([r for r in runs if r['conclusion'] == 'failure'])

        if total_deployments > 0:
            failure_rate = (failed_deployments / total_deployments) * 100
            change_failure_rate.labels(repository=f"{owner}/{repo}").set(failure_rate)
            logger.info(f"[{owner}/{repo}] Change Failure Rate: {failure_rate:.2f}%")
            return failure_rate
        return 0

    def calculate_mttr(self, owner, repo):
        """Calculate Mean Time to Restore"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        recovery_times = []

        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            runs.sort(key=lambda x: datetime.strptime(x['created_at'], "%Y-%m-%dT%H:%M:%SZ"))

            # Find failure-success sequences
            for i in range(1, len(runs)):
                if runs[i-1]['conclusion'] == 'failure' and runs[i]['conclusion'] == 'success':
                    failure_time = datetime.strptime(runs[i-1]['updated_at'], "%Y-%m-%dT%H:%M:%SZ")
                    recovery_time = datetime.strptime(runs[i]['updated_at'], "%Y-%m-%dT%H:%M:%SZ")

                    time_to_restore = (recovery_time - failure_time).total_seconds()
                    recovery_times.append(time_to_restore)

        if recovery_times:
            mttr = sum(recovery_times) / len(recovery_times)
            mean_time_to_restore.labels(repository=f"{owner}/{repo}").set(mttr)
            logger.info(f"[{owner}/{repo}] Mean Time to Restore: {mttr:.2f} seconds")
            return mttr
        return 0

    def collect_metrics(self):
        """Collect all metrics for all repositories"""
        for repo_info in self.repos:
            owner = repo_info['owner']
            repo = repo_info['repo']

            logger.info(f"Collecting metrics for {owner}/{repo}")

            try:
                self.calculate_deployment_frequency(owner, repo)
                self.calculate_lead_time_for_changes(owner, repo)
                self.calculate_change_failure_rate(owner, repo)
                self.calculate_mttr(owner, repo)
            except Exception as e:
                logger.error(f"Error collecting metrics for {owner}/{repo}: {str(e)}")

def main():
    # Start Prometheus HTTP server
    port = 9118
    start_http_server(port)
    logger.info(f"Server started on port {port}")

    collector = GitHubMetricsCollector(REPOS, HEADERS)

    # Collect metrics every 15 minutes
    collection_interval = 15 * 60  # 15 minutes in seconds

    while True:
        collector.collect_metrics()
        logger.info(f"Metrics collection completed. Next collection in {collection_interval} seconds")
        time.sleep(collection_interval)

if __name__ == "__main__":
    main()

3. Install Required Python Packages
These packages are necessary for the GitHub Exporter script to function.

sudo pip3 install requests prometheus_client pytz

or run

sudo apt update
sudo apt install python3-requests python3-prometheus-client python3-tz

4. Create Systemd Service
Create /etc/systemd/system/github_exporter.service

[Unit]
Description=GitHub Metrics Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/github_exporter/github_exporter.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Enable and Start GitHub Exporter

sudo systemctl daemon-reload
sudo systemctl enable github_exporter
sudo systemctl start github_exporter

6. Check GitHub Exporter Status

sudo systemctl status github_exporter

4. Setting Up AlertManager

1. Install Binaries

wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

2. Create User
A dedicated user enhances security by limiting permissions.

sudo groupadd -f alertmanager
sudo useradd -g alertmanager --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager/templates
sudo mkdir /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

3. Unpack Prometheus AlertManager Binary
Untar and move the downloaded Prometheus AlertManager binary

tar -xvf alertmanager-0.21.0.linux-amd64.tar.gz
mv alertmanager-0.21.0.linux-amd64 alertmanager-files

4. Install Prometheus AlertManager
Copying the alertmanager and amtool binaries tousr/bin makes them globally accessible on your system. Changing ownership to the alertmanager user ensures that the AlertManager runs with the appropriate permissions, enhancing security.

sudo cp alertmanager-files/alertmanager /usr/bin/
sudo cp alertmanager-files/amtool /usr/bin/
sudo chown alertmanager:alertmanager /usr/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/bin/amtool

5. Install Prometheus AlertManager Configuration File
Move the alertmanager.yml file from alertmanager-files to the etc/alertmanager folder and change the ownership to alertmanager user.

sudo cp alertmanager-files/alertmanager.yml /etc/alertmanager/alertmanager.yml
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

6. Setup Prometheus AlertManager Service
Create the alertmanager service file at /usr/lib/systemd/system/alertmanager.service

sudo vi /usr/lib/systemd/system/alertmanager.service

Add the following configuration:

[Unit]
Description=AlertManager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/bin/alertmanager \
    --config.file /etc/alertmanager/alertmanager.yml \
    --storage.path /var/lib/alertmanager/

[Install]
WantedBy=multi-user.target

7. Set File Permissions
Setting the correct permissions ensures that the system can read and execute the service file without being modified by unauthorised users.

sudo chmod 664 /usr/lib/systemd/system/alertmanager.service

8. Create Configuration File
Edit configuration file /etc/alertmanager/alertmanager.yml

global:
resolve_timeout: 1m

route:
receiver: 'slack-notifications'
group_by: ['alertname', 'job']
repeat_interval: 1h

receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#<channel-name-here>'
  send_resolved: true
  icon_url: https://avatars3.githubusercontent.com/u/3380462
  api_url: 'https://hooks.slack.com/services/<api-url-here>'
  title: |-
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
    {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
      (
      {{- with .CommonLabels.Remove .GroupLabels.Names }}
        {{- range $index, $label := .SortedPairs -}}
          {{ if $index }}, {{ end }}
          {{- $label.Name }}="{{ $label.Value -}}"
        {{- end }}
      {{- end }}
      )
    {{- end }}
  text: >-
    {{ range .Alerts -}}
    *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
    *Description:* {{ .Annotations.description }}
    *Details:*
      {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
      {{ end }}
    {{ end }}

9. Reload Systemd and Start AlertManager
Reloading systemd ensures that it recognizes the new AlertManager service file. Starting the service ensures AlertManager is running and ready to handle alerts.

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

10. Check AlertManager Service Status
Checking the status ensures that AlertManager is running without errors. If there are issues, the status output will provide clues for troubleshooting.

sudo systemctl status alertmanager

By following these steps, you’ve successfully set up Prometheus AlertManager and configured it to send alerts to Slack. You’ve also prepared Grafana for visualising metrics and monitoring your system. This setup ensures that your team is notified of critical issues in real time, improving system reliability and efficiency.

Part 2: Configure Grafana Dashboards

Once you’ve set up Prometheus, Node Exporter, Blackbox Exporter, and AlertManager, the next step is to visualize your metrics using Grafana. Grafana is a powerful interface tool for visualising metrics and creating dashboards that help you monitor your system and CI/CD pipeline performance. By connecting it to Prometheus, you can monitor system performance, track DORA metrics, and set up alerts for critical issues. Here’s how to configure dashboards for Node Exporter, Blackbox Exporter, and DORA metrics.

Once you have all the components installed and running, you can:

Access Grafana at http://your-server-ip:3000 with the default credentials
Log in with the default credentials (username: admin, password: admin).
Set up dashboards to visualise metrics from Prometheus, Node Exporter, and Blackbox Exporter.

1. Configuring Node Exporter Dashboard
The Node Exporter dashboard provides insights into system metrics like CPU usage, memory usage, disk usage, and more. Here’s how to set it up:

Create a New Dashboard:
- Click on the “Create” button (plus icon) in the left sidebar.
- Select “Import” from the dropdown menu.
Import Node Exporter Dashboard:
- In the “Import via grafana.com” textbox, enter the Node Exporter dashboard UID: 1860.
- Click “Load”

Select Data Source:
- Choose Prometheus as your data source
- Click "Import"
View Your Dashboard
- You'll now see a fully configured Node Exporter dashboard with panels for CPU, memory, disk, and other system metrics.

2. Configuring Blackbox Exporter Dashboard
The Blackbox Exporter dashboard helps you monitor uptime, HTTP response times, and SSL certificate expiration. Here's how to set it up:

Create a New Dashboard:
- Click on the "Create" button (plus icon) in the left sidebar.
- Select "Import" from the dropdown menu.
Import Blackbox Exporter Dashboard:
- Same as you did with Node Exporter, in the "Import via grafana.com" textbox, enter the Blackbox Exporter dashboard UID: 7587.
- Click "Load"
Select Data Source:
- Choose Prometheus as your data source
- Click "Import"
View Your Dashboard
- You'll now see a fully configured Blackbox Exporter dashboard with panels for CPU, memory, disk, and other system metrics.

3. Configuring DORA Metrics Dashboard
The DORA metrics dashboard tracks key CI/CD performance indicators like Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Restore. Here's how to set it up:

Create a New Dashboard:
- Click on the "Create" button (plus icon) in the left sidebar.
- Select "New Dashboar" from the dropdown menu.
Add Panels for DORA Metrics:
- Click on the "Add" dropdown and select "Visualization".
- In the Queries panel, select the following metrics: github_deployment_frequency_total github_change_failure_rate_percent github_mean_time_to_restore_seconds github_lead_time_for_changes_seconds
Save Your Dashboard:
- Click "Save" to save your dashboard.
- Give it a meaningful name, like "DORA Metrics Dashboard".
View and Customize:
- You can now view your DORA metrics in real time.
- Feel free to edit, move panels around, or change visualization types (e.g., graphs, gauges, tables).

4. Customizing Your Dashboards
Grafana is highly customizable, so don't be afraid to get creative! Here are some tips:

Edit Panels: Click on a panel title and select "Edit" to change the visualization type or query.
Move Panels: Drag and drop panels to rearrange them.
Add Alerts: Set up alerts directly from Grafana panels to notify your team of critical issues.
Save Changes: Always save your dashboard after making changes.

Part 3: Implementing AWS Cost Exporter

1. Create Cost Exporter Directory

sudo mkdir -p /opt/cost_exporter

2. Create the Python Cost Exporter Script
Create the file at /opt/cost_exporter/cost_exporter.py:

import time
import boto3
from datetime import datetime, timedelta
from prometheus_client import start_http_server, Gauge
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger('aws-cost-exporter')

# Create metrics
aws_service_cost = Gauge('aws_service_cost_dollars', 'Cost in dollars by AWS service', ['service'])
aws_total_cost = Gauge('aws_total_cost_dollars', 'Total AWS cost in dollars', [])
aws_budget_usage = Gauge('aws_budget_usage_percent', 'Budget usage percentage', ['budget_name'])

def collect_cost_metrics():
    """Collect AWS cost metrics and update Prometheus gauges"""
    ce_client = boto3.client('ce')
    budgets_client = boto3.client('budgets')

    # Get current date and start of month
    end_date = datetime.utcnow().strftime('%Y-%m-%d')
    start_date = datetime(datetime.utcnow().year, datetime.utcnow().month, 1).strftime('%Y-%m-%d')

    try:
        # Get cost by service
        response = ce_client.get_cost_and_usage(
            TimePeriod={
                'Start': start_date,
                'End': end_date
            },
            Granularity='MONTHLY',
            Metrics=['UnblendedCost'],
            GroupBy=[
                {
                    'Type': 'DIMENSION',
                    'Key': 'SERVICE'
                }
            ]
        )

        total_cost = 0

        # Process service costs
        for group in response['ResultsByTime'][0]['Groups']:
            service_name = group['Keys'][0]
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            total_cost += cost
            aws_service_cost.labels(service=service_name).set(cost)
            logger.info(f"Service: {service_name}, Cost: ${cost:.2f}")

        # Set total cost
        aws_total_cost.set(total_cost)
        logger.info(f"Total Cost: ${total_cost:.2f}")

        # Get budgets and their usage
        try:
            budgets_response = budgets_client.describe_budgets(
                AccountId=boto3.client('sts').get_caller_identity().get('Account')
            )

            for budget in budgets_response.get('Budgets', []):
                budget_name = budget['BudgetName']
                calculated_spend = float(budget.get('CalculatedSpend', {}).get('ActualSpend', {}).get('Amount', 0))
                budget_limit = float(budget.get('BudgetLimit', {}).get('Amount', 0))

                if budget_limit > 0:
                    usage_percent = (calculated_spend / budget_limit) * 100
                    aws_budget_usage.labels(budget_name=budget_name).set(usage_percent)
                    logger.info(f"Budget: {budget_name}, Usage: {usage_percent:.2f}%")
        except Exception as e:
            logger.error(f"Error getting budget information: {str(e)}")

    except Exception as e:
        logger.error(f"Error collecting cost metrics: {str(e)}")

def main():
    # Start up the server to expose the metrics.
    port = 9108
    start_http_server(port)
    logger.info(f"AWS Cost Exporter started on port {port}")

    # Update metrics every hour
    while True:
        collect_cost_metrics()
        time.sleep(3600)  # 1 hour

if __name__ == '__main__':
    main()

3. Install Required Python Packages

sudo apt install python3-boto3

4. Create a Systemd Service for the Cost Exporter
Create /etc/systemd/system/cost_exporter.service file and add the following to it:

[Unit]
Description=AWS Cost Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/cost_exporter/cost_exporter.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Start the Cost Exporter

sudo systemctl daemon-reload
sudo systemctl enable cost_exporter
sudo systemctl start cost_exporter

6. Update Prometheus Configuration
Add the following to /etc/prometheus/prometheus.yml under the scrape_configs section:

- job_name: 'aws_cost'
  static_configs:
    - targets: ['localhost:9108']

7. Restart Prometheus

sudo systemctl restart prometheus

Part 4: Setting Up AWS-Specific Alerts

1. Add AWS Alert Rules to Prometheus
Add the following to /etc/prometheus/alert_rules.yml under a new group:

- name: aws_alerts
  rules:
    - alert: HighEC2CPUUsage
      expr: avg(aws_ec2_cpuutilization_average{instance=~".*"}) by (instance) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU Usage on EC2 Instance {{ $labels.instance }}"
        description: "EC2 Instance {{ $labels.instance }} has high CPU usage ({{ $value }}%) for 5 minutes."

    - alert: RDSHighCPUUsage
      expr: aws_rds_cpuutilization_average > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU Usage on RDS Instance {{ $labels.dbinstance_identifier }}"
        description: "RDS Instance {{ $labels.dbinstance_identifier }} has high CPU usage ({{ $value }}%) for 5 minutes."

    - alert: LambdaErrors
      expr: increase(aws_lambda_errors_sum[1h]) > 10
      labels:
        severity: warning
      annotations:
        summary: "High Error Rate on Lambda Function {{ $labels.function_name }}"
        description: "Lambda Function {{ $labels.function_name }} has more than 10 errors in the past hour."

    - alert: BudgetNearLimit
      expr: aws_budget_usage_percent > 90
      labels:
        severity: warning
      annotations:
        summary: "Budget Usage Approaching Limit"
        description: "Budget {{ $labels.budget_name }} is at {{ $value | printf \"%.2f\" }}% of its limit."

2. Restart Prometheus

sudo systemctl restart prometheus

Conclusion

By following these steps, you've successfully configured Grafana dashboards for Node Exporter, Blackbox Exporter, DORA metrics, AWS-specific alerts and integrated AWS Cost Exporter into Prometheus and Grafana. These dashboards provide a clear view of your system's performance and CI/CD pipeline efficiency, helping you make data-driven decisions.
Don't forget to explore Grafana's extensive library of pre-built dashboards and plugins to further enhance your monitoring setup.

Thank You for Reading!

If you found this guide helpful, don't forget to like, comment, and share! Let me know if you have any questions or need further assistance.

Happy monitoring! 🚀

References

NGINX Configuration: My HNG DevOps Stage 0 Experience

Oluwademilade Oyekanmi — Wed, 29 Jan 2025 02:36:16 +0000

Introduction

As part of the HNG DevOps Stage 0 task, I was required to set up and configure NGINX on a fresh Ubuntu server using any cloud platform of my choice, ensuring it serves a custom HTML page. This exercise aimed to demonstrate my ability to configure a web server and deploy a simple static site.

At first glance, this may seem like a simple task—after all, installing a web server and displaying a basic page shouldn’t be that difficult, right? But as with most things in DevOps, the real value isn’t in blindly following steps; it’s in understanding why each step matters, how to troubleshoot issues efficiently, and how to apply these concepts in real-world environments.

In this post, I’ll walk through my approach to completing this task, the challenges I encountered, how I resolved them, and the broader lessons this exercise reinforced about working as a DevOps engineer.

Steps Taken

1. Provisioning an AWS EC2 Instance
A good DevOps engineer must be comfortable working with cloud infrastructure. I chose AWS as my cloud provider. I launched an EC2 instance using the Ubuntu AMI, carefully configuring security groups to allow HTTP traffic (port 80) and SSH access (port 22).

2. Downloading the Key and Logging In
After the instance was provisioned, I downloaded the private key (.pem file) and logged in using SSH:

ssh -i my-key.pem ubuntu@<server-ip>

At this stage, an error message could have slowed me down. I initially forgot where my key file was saved, but a quick check of my download directory saved me from unnecessary frustration.

3. Updating the System Packages
Before installing NGINX, I made sure my package list was up to date:

sudo apt update

This step is essential because outdated package lists can cause unexpected installation errors—a common oversight that can lead to unnecessary debugging later.

4. Installing NGINX
With the system updated, I installed NGINX by running:

sudo apt install nginx -y

5. Verifying NGINX Installation
To confirm that NGINX was installed and running correctly, I checked its status:

systemctl status nginx

If it was not running, I would have started it manually with:

sudo systemctl start nginx

6. Locating and Modifying the Default HTML File
Here’s where I encountered one of my biggest challenges.

I wasn’t immediately sure where the default NGINX HTML file was. Was it /usr/share/nginx/html/index.html or /var/www/html/index.html? I debated between both paths before finally checking the NGINX configuration in /etc/nginx/sites-enabled/default. A simple look there earlier would have saved me time!

Once I confirmed the correct location (/var/www/html/index.html), I edited the file:

cd /var/www/html
sudo nano index.html

I replaced the existing content with the required message:

<!DOCTYPE html>
<html>
<head>
    <title>Welcome</title>
</head>
<body>
    <h1>Welcome to DevOps Stage 0 - [Your name here]/[Your username here]</h1>
</body>
</html>

7. Restarting NGINX
After saving the file, I restarted NGINX to apply the changes:

sudo systemctl restart nginx

8. Testing the Configuration

The final step was to confirm everything worked as expected. I opened a browser and navigated to:

http://<your-server-ip>/

Seeing my custom message confirmed that everything was set up correctly.

Challenges and How I Overcame Them

Misplacing the Private Key File
At first, I couldn’t log in via SSH because my terminal couldn’t locate the .pem file. The solution? Checking the correct directory and rerunning the command. This was a reminder that even small errors can waste time if not caught quickly.
Uncertainty About the NGINX Default Directory
I spent unnecessary time debating between different file paths. A quick check of /etc/nginx/sites-enabled/default would have clarified it instantly. The lesson here? Read instructions carefully before jumping into troubleshooting.
Permission Errors While Editing HTML
While modifying index.html, I ran into permission errors. Using sudo solved the problem, but it reinforced the importance of understanding file permissions and privilege management.

Personal Growth

This task may have seemed straightforward, but the growth it sparked was immense. Every step reinforced the importance of resilience, adaptability, and problem-solving—qualities that go beyond just configuring a web server.

Patience and Attention to Detail – I learnt the hard way that missing a simple detail, like the correct HTML file path, can lead to unnecessary troubleshooting. Slowing down and reading configurations properly saved me time.
Confidence in Debugging – Every roadblock was an opportunity to sharpen my troubleshooting mindset. Instead of panicking, I approached problems systematically and solved them efficiently.
Building Resilience – I faced errors, hit dead ends, and had to restart processes, but that’s exactly what makes DevOps exciting. Overcoming challenges is part of the journey, and this experience solidified my ability to push through obstacles.
Embracing a Growth Mindset – DevOps isn’t about memorising commands; it’s about constantly learning, adapting, and improving. This task was just one step, but it reminded me that every challenge is an opportunity to grow.

Ultimately, this wasn’t just about setting up NGINX, it was about becoming a better problem solver, a more confident engineer, and a more resilient individual.

At this point, it’s clear that HNG doesn’t just teach you technical skills—it prepares you to think like a DevOps engineer. But this task? It’s just the beginning.

HNG: A Proven Hub for Elite DevOps Talent

This project, while valuable, barely scratches the surface of what HNG engineers are capable of.

HNG isn’t just another training program—it’s an intensive, real-world simulation that pushes engineers beyond basic configurations. It tests problem-solving skills, adaptability, and the ability to function in high-stakes DevOps environments.

HNG has produced top-tier engineers who have mastered automation, cloud-native technologies, infrastructure management, security, and scalability—all essential for today’s evolving tech landscape. These graduates emerge with battle-tested expertise, making them an asset to any organisation looking for skilled professionals.

Looking to Hire the Best? Start with HNG!

If you're a company seeking top DevOps, Cloud, and Infrastructure Engineering talent, hiring from HNG should be your first priority.

HNG alumni have demonstrated their ability to thrive in high-pressure environments, adapt to industry trends, and deliver production-ready solutions.

Explore elite HNG talents available for hire:

With a track record of producing world-class engineers, HNG remains the premier destination for companies seeking elite tech talent.

So, if you're looking for talents who can truly deliver, start your search here.

Deploying the Spring PetClinic Sample Application to an EKS Cluster with ECR

Oluwademilade Oyekanmi — Wed, 20 Mar 2024 23:34:10 +0000

As indicated by the title, our objective is to set up the well-known Spring PetClinic Sample Application on EKS starting from the very beginning.

Prerequisites

An AWS account
Java 17
Maven
A knowledge of Kubernetes
kubectl
eksctl
Docker
AWS CLI

Cloning and running the Pet Clinic Application.
Creating a container image from the application.
Upload a custom image to the Elastic Container Registry (ECR).
Create an Elastic Kubernetes Service (EKS) cluster.
Deploy the created ECR image to the EKS cluster.
Delete all resources.

Clone the application

For the first step, head over to Github to clone the application

git clone https://github.com/spring-projects/spring-petclinic.git

Next, run the application

cd spring-petclinic
./mvnw package
java -jar target/*.jar

Creating a container image from the application.

First, create a file; we'll name it Dockerfile. Note that it should be created in the root of the project.

FROM eclipse-temurin:17-jdk-jammy
WORKDIR /app
COPY .mvn/ .mvn 
COPY mvnw pom.xml ./
RUN ./mvnw dependency:resolve
COPY src ./src
CMD ["./mvnw", "spring-boot:run"]

Now, run the following:

docker build -t petclinic .

This command builds Docker images from the Dockerfile we created.

Then,

docker run -p 8080:8080 petclinic

This command runs a command in a new container, pulls the image if needed, and starts the container.

After this step, you should see the app running in your browser using localhost:8080. If, by chance, port 8080 is busy, you can change the port.

Upload a custom image to the Elastic Container Registry (ECR)

To begin, you'll want to set up an ECR registry. Start by logging into the AWS console and locating ECR using the search bar. Once you've found it, click on the "Create repository" button, which will switch the interface to a form where you'll need to provide more details.

Now, assign a name to your repository. I recommend setting it to private for added security. Finally, proceed by clicking on "Create repository" using the default settings.

For the next step, we login to the newly created private repository using this command:

aws ecr get-login-password --region _your-chosen-region_ | docker login --username AWS --password-stdin _your-account-id_.dkr.ecr.us-east-1.amazonaws.com/_your-repository-name_

We now use the following command to modify the tag of our local image:

docker tag 87fe7e888a17 _your-account-id_.dkr.ecr.us-east-1.amazonaws.com/_your-repository-name:v1

87fe7e888a17 represents the local image tag; you can find it using the docker images command.

Finally, we can push this image using the following command. Remember, you have the flexibility to utilize any tag in place of v1:

docker push _your-account-id_.dkr.ecr.us-east-1.amazonaws.com/_your-repository-name:v1_

Create an Elastic Kubernetes Service (EKS) cluster

With the previous steps, it's time to create an EKS cluster and use the image created in the previous step.

To create a cluster, run the following command:

eksctl create cluster --name _your-cluster-name_ --region _your-chosen-region_ --node-type t2.medium

Next, create a namespace. (A Kubernetes namespace is a logical abstraction used to organise and partition resources within a Kubernetes cluster.)

kubectl create namespace petclinic-one

Deploy the created ECR image to the EKS cluster

To add a Kubernetes resource. We will use two files.

A deployment file pointing to our image of the ECR
An AWS application load balancer that is automatically created for us by a service of the type LoadBalancer and network load balancer; we will point to port 8080 since that is where our application runs.

petclinic-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: petclinic
  namespace: petclinic-one
  labels:
    name: petclinic
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: petclinic
  template:
    metadata:
      labels:
        app: petclinic
    spec:
      containers:
        - name: petclinic
          image: _your-account-id_.dkr.ecr.us-east-1.amazonaws.com/_your-repository-name:v1_
          ports:
            - containerPort: 80

petclinic-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: petclinic
  namespace: petclinic-one
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-internal: "false"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  ports:
    - name: web
      port: 80
      targetPort: 8080
  selector:
    app: petclinic

Next, apply the following commands:

kubectl apply -f petclinic-deployment.yaml
kubectl apply -f petclinic-service.yaml

Next, check the status

kubectl get pods -n eks-demo-app

kubectl get svc -n eks-demo-app

Now use that external IP in your browser to hit your API.

Delete all resources

To avoid a huge bill overnight, run the following command to delete the ECR repository EKS cluster:

eksctl delete cluster --name petclinic --region _your-chosen-region_

Automating the Provisioning of AWS EKS Cluster Using Terraform and CircleCI

Oluwademilade Oyekanmi — Wed, 29 Mar 2023 21:22:20 +0000

What is CircleCI?

CircleCI is one of the potent tools used for continuous integration and continuous deployment (CI/CD), for those who are unfamiliar. With the aid of CI/CD, a group of ideas, methods, and tools, consumers can receive software updates of any kind in a quick, effective, repeatable, and secure manner.

In the link below, I was able to go into great detail about CI/CD, including what it is, why it is important, and various CI/CD tools.

Jenkins vs CircleCI

What is Terraform?

Terraform is a provisioning management tool that enables infrastructure automation for provisioning, compliance, and management of any cloud, datacenter, and service.

Allocating resources from a cloud service provider to a client is known as cloud provisioning. It specifies how a client acquires resources and cloud services from a supplier, making it a crucial part of cloud computing. It may also be described as the process of automating the management of cloud computing service installation, configuration, and management.

What's Needed

For this tutorial you'll need the following

An AWS account
An access key and secret access key file
An Integrated Development Environment (IDE) of your choice.
A GitHub account
A CircleCI account. You can sign up for a CircleCI account using your GitHub account if you don't already have one.

Step 1: Clone this repository

https://github.com/MsOluwademilade/learn-terraform-eks-circleci

Step 2: Review the CircleCI configuration file

The CircleCI configuration (which was gotten from here) will complete four jobs during the process of automating the Terraform workflow.

1. The “plan-apply” Job:
The image "hashicorp/terraform: light," which, according to HashiCorp, includes a Terraform binary, will be used by this job. The "terraform init" command and the checkout procedures are carried out by this task. A file named "tfapply" will also be created by executing the "terraform plan -out" command.

2. The “apply” Job:
The "attach_workspace" step in the apply job loads the previously persisted workspace in the "plan" job, according to HashiCorp's description of this task. The "apply" job will execute the "terraform apply" command, which we created in the previous step, to run "tfapply."

3. The “plan-destroy” Job:
A strategy to destroy the distributed infrastructure is generated by this job.

4. The “destroy” Job:
The strategy to destroy the infrastructure is carried out in this step of the procedure.

NB: Users are advised by HashiCorp to monitor the "plan-destroy" and "destroy" tasks carefully while they are in operation. Users will be expected to keep tabs on the status of each job despite the possibility of interruptions.

Workflow

The final element of this CircleCI setup is the workflow. It coordinates the sequencing and specifications for each pipeline task. According to HashiCorp, the "plan_approve_apply" workflow follows a sequential process that creates applications for each stage along the way. Therefore, the prior job must have run successfully in order for the current job to run.

Step Three: Set up the project in CircleCI

Since our GitHub and CircleCI accounts are connected, you can already see the forked repository displayed under "Projects." Choose the blue "Set Up Project" button to move to the following page when attempting this project.

Select "Fastest" on the subsequent screen, then press the "Start Building" button.

CircleCI will try to execute the jobs inside the pipelines right away, but it will fail. That outcome happened as a consequence of the absence of two essential components for a smooth operation. To continue with our project, we must input our AWS Access Key ID and AWS Secret Access Key in the "Environment Variables" screen and save them.

Use AWS_ACCESS_KEY_ID as the name and enter the value for the Access Key ID. Use AWS_SECRET_ACCESS_KEY as the name and enter the value for the Secret Access Key.

The name of the project that is currently in focus, the pipeline's job statuses, the workflow, the repo branch/commit, and settings are all shown on the accompanying screen by CircleCI. Start the project.

Step Four: Destroy your resources

In a later article, I'll decsribe how to access the deployed resources.

Jenkins vs CircleCI

Oluwademilade Oyekanmi — Mon, 30 Jan 2023 12:49:25 +0000

What are They?

For those who don't already know, Jenkins and CircleCI are two out of the numerous continuous integration and continuous delivery (CI/CD) tools. Continuous Delivery (CD) is a software development technique where code changes are automatically prepared for a release to production. Continuous Integration (CI) is a technique for automating the integration of code changes from many developers into a single software project. Consequently, CI/CD may be described as a combination of ideas, methods, and tools that enables software updates of all kinds to reach users in a timely, effective, repeatable, and secure manner. It is a technique for regularly delivering apps to users by automating the various stages of app development.

Benefits of CI/CD

Detect Vulnerabilities in Security: security is a key concern in every industry. CI/CD immediately identifies any vulnerability in codes before they are deployed to production, as it even gives it more priority, thereby saving saving money by preventing humiliating and/or expensive security vulnerabilities
Automated Rollback Triggered by Job Failure: There are occasions when a code will include problems after it has been deployed (put into use). Therefore, attempting to identify the error's cause can take minutes, hours, days, or even weeks. However, with CI/CD, there is a rapid way to return to the code's prior working state, protecting revenue and saving time
Automate Infrastructure Creation: It automates the delivery of software or infrastructure-as-code from source code to production. This leads to less human error and faster deployments which eventually leads to avoidance of cost.
Faster and More Frequent Production Deployments: A CI/CD procedure that runs without hiccups can enable numerous daily releases. Without much manual labour, teams may automatically create, test, and deliver features, hence new value-generating features get released more quickly and the efficiency of the team is released.
Codes Get Deployed to Production Faster: Automating the integration and deployment of codes makes it possible to deploy code to the production state considerably, more quickly than with human checks. As a result, the development and operations teams spend less time reviewing and rechecking the code's quality, which boosts productivity and saves time, thereby boosting revenue

CI/CD Tools

CI/CD tools include, but not limited to the following;

Jenkins
CircleCI
GitLab
TravisCI
TeamCity
Bamboo
Buddy

Jenkins

Jenkins provides a straightforward method for automating other common development chores as well as setting up a continuous integration or continuous delivery (CI/CD) system for virtually any collection of languages and source code repositories. Jenkins gives you a faster and more reliable means to integrate your full cycle of build, test, and deployment tools than you can easily develop yourself, even though it doesn't completely eliminate the need to write scripts for individual processes.

Advantages of Jenkins

Easy Installation: Nobody enjoys going through a difficult installation. Jenkins is a standalone Java-based programme that comes with packages for Windows, Linux, macOS, and other Unix-like operating systems and may be used right away after installation.
Easy Configuration: Jenkins' web interface enables straightforward setup and configuration thanks to its built-in tutorials and real-time error checking.
Extensible: With a vast supply of community-contributed plugins, Jenkins is expandable and offers practically limitless potential uses.

CircleCI

CircleCI offers simple installation and upkeep without any problems. Because it is a cloud-based system, even for business customers, there is no requirement for a dedicated server and no need for server maintenance or administration with a free plan. CircleCI is compatible with GitHub, Amazon EC2, Appfog, dotCloud, and other platforms.

Advantages of CircleCI

Allows developers to debug in the build using SSH: It can be difficult and time-consuming to debug code on resources and in settings other than a developer's typical development environment.
Parallel builds are possible for rapid implementation of multiple tasks: Most frequently used build processes, including tests, are done sequentially, meaning that each step runs independently and only follows the one before it. The steps would be considered to be running concurrently or in parallel if we divided our lengthy single build process into several parts and executed them simultaneously.
Allows Slack integration: Everyone on your team can stay informed about the status of your most recent builds thanks to CircleCI's chat alerts. When a build succeeds or fails, you can see which commit caused it as well as who was in charge of sending the code to GitHub. It's a terrific way to keep track of what your team members are working on as well as to identify and swiftly fix problematic builds.

Jenkins vs CircleCi

Open-Source vs Commercial: CircleCI has a free version and premium plans, but Jenkins is an open-source technology.
Complexity: Because of its notoriously intricate configuration, Jenkins is best suited to seasoned developers. On the other hand, CircleCI is simpler to use and more user-friendly for beginners.
Scalability: CircleCI is made to manage smaller projects, but Jenkins can handle large-scale projects.
Community and Support: There is a sizable and vibrant community for Jenkins, and there are a tonne of plugins and resources available. Because it is a commercial solution, CircleCI offers official support and a smaller user base.
Integrations: Both technologies support a variety of other programmes and infrastructure, including Github, Bitbucket, and AWS. In contrast to Jenkins, CircleCI provides a smaller selection of integrations.

Conclusion

So, between CircleCI and Jenkins, which should one pick? Well, that depends on what you expect from a continuous integration tool in terms of convenience. We've seen the advantages and disadvantages of CircleCI and Jenkins; one can select a tool based on a project's needs, including its budget, timeline, and other factors. When selecting a CI/CD tool for your organisation, community support and resource availability are important considerations. Jenkins has a sizable community. On the other hand, CircleCI tries its best by offering comprehensive and useful content as well as events to address the majority of issues.

Deploying a Static Website on AWS

Oluwademilade Oyekanmi — Sun, 10 Jul 2022 12:56:53 +0000

Static websites have predetermined content and can be created without the use of programming languages. It is built with HTML, CSS, and JavaScript and is the simplest type of website to develop. It consists of a number of HTML files, each of which represents a certain internet page physically.

I've outlined below a step-by-step approach for deploying a static website using AWS, leveraging S3 Bucket and CloudFront.

I'll be breaking this into six steps.

Creating the S3 bucket.
Uploading the files to the created S3 bucket.
Securing the bucket using IAM (Identity and Access Management).
Configuring the S3 bucket.
Distribute website using AWS CloudFront.
Accessing website in web browser.

Creating the S3 Bucket

Step 1: In the "Find Services" box, enter "S3," click it, and then select "Create Bucket."

Step 2: Your bucket name must be distinct for the name. It is suggested that you incorporate your 12-digit AWS account ID into the name of your bucket. Your bucket's name can be my-123456789012-bucket if your AWS account ID is 123456789012.

Step 3: In the "Bucket options for Block Public Access section," uncheck "Block all public access." The public will be able to access the bucket objects by using the S3 object URL. Which, in my opinion, is the main goal of having a website hosted in the first place.

Step 4: To create your S3 bucket, click "Next" and then "Create Bucket."

Uploading the Files to the Newly Created S3 Bucket.

Step 1: Open the bucket and click on the 'Upload' button

Step 2: To upload the files and folders of the website you wish to host from your local computer to the S3 bucket, click the "Add files" and "Add folder" buttons.

Step 3: Click "Upload"

You should see this page after that.

Then,

Lastly,

Securing the bucket using IAM (Identity and Access Management).

Step 1: Open your S3 bucket and select the "Permissions" option.

Step 2: Click "Edit" after scrolling down to "Bucket policy."

Step 3: Enter the bucket policy listed below, which is written in the code block below. Keep in mind that you are substituting "my-123456789012-bucket" with the name of your bucket.

{
"Version":"2012-10-17",
"Statement":[
 {
   "Sid":"AddPerm",
   "Effect":"Allow",
   "Principal": "*",
   "Action":["s3:GetObject"],
   "Resource":["arn:aws:s3:::my-123456789012-bucket/*"]
 }
]
}

Step 4: Click "Save changes"

Configuring the S3 Bucket.

Step 1: Scroll down to the "Static website hosting" section of the "Properties" tab to make changes. It was the last item on the list as of the time of this documentation. When you do, select "Edit."

Step 2: Enable hosting for static websites, then provide the name of your index and error document. Next, select "Save changes."

Step 3: For later usage, make a copy of the "Bucket website endpoint."

Using AWS CloudFront, Distribute Website.

Step 1: In the text field labelled "Search for service, features, blogs, docs, and more", type "CloudFront."

Step 2: Click "Create a CloudFront Distribution"

Step 3: You now need to select the "Origin domain." NOTE: Don't make a selection from the drop-down menu. Instead, substitute the endpoint for hosting static websites in the format "[bucket-name].s3-website-region.amazonaws.com."

Step 4: Click "Create Distribution" after changing the "Viewer protocol policy" to "Redirect HTTP to HTTPS."

Accessing Website in Web Browser.

Your CloudFront domain name, S3 object URL, and bucket website-endpoint should all display the same index.html content as confirmation that you followed the correct procedure.

1. CloudFront domain name

2. S3 object URL

3. Bucket website-endpoint

Containerisation

Oluwademilade Oyekanmi — Sun, 03 Jul 2022 22:57:15 +0000

The process of containerisation comprises packing a software component, together with all of its dependencies, configuration, and environment, into a standalone container. This enables the uniform deployment of an application across all computing environments, including on-premises and cloud-based ones.

What is a Container?

A container is a standardised software component that wraps up code and all of its dependencies to ensure that an application will run swiftly and consistently in different computing environments.
Anything from a small micro service or software process to a huge application could be operated inside of a single container. All required executables, binary code, libraries, and configuration files are contained inside a container. However, operating system images are not present in containers, unlike server or machine virtualisation methods.

Benefits of Containerisation

Portability There is a popular saying when it comes to containerisation “write once, run anywhere.” You may take your application almost anywhere without having to recompile it to take into account a different environment because a container bundles all dependencies.
Efficiency One of the most effective virtualisation techniques for developers is containerisation. Efficiency by delivering greater computational resource utilisation and using significantly fewer resources than VMs. Applications may be scaled, patched, or deployed more quickly thanks to containers.
Agility Being agile means having the capacity to move swiftly. Containers may be quickly created, deployed to any environment, and utilised to address a wide range of DevOps concerns. The universality and usability of the development tools further encourages the quick creation, packaging, and deployment of containers across all operating systems.
Greater speed Containers aren't overwhelmed by extra overheads because they share a machine's operating system. With a slight increase in start-up speed, this lightweight construction improves server efficiency. The improved efficiency and performance also result in decreased server and licencing expenses.

Virtualisation

The process of creating a virtual version of something, such as an operating system, a server, a storage device, or network resources, as opposed to an actual one, is known as virtualisation. It most frequently refers to using many operating systems at once on a computer system. The operating system, libraries, and other programmes that make up the guest virtualisation system are distinct from the host operating system that runs below it, giving the impression to applications running on top of the virtualised machine that they are on their own dedicated computer.

Containerisation vs Virtualisation

While virtualisation allows for total separation from the host operating system and other VMs, containerisation often offers minimal isolation from the host and other containers but lacks the same level of security as a virtual machine.

With virtualisation, virtually any operating system can be run inside the virtual computer. But, containerisation utilises the same version of the operating system as the host.

Virtualisation can imitate and represent your real hardware, such as CPU cores, memory, and discs, as a separate machine, while OS-level virtualisation is containerisation. As it only partially simulates the actual machine.

Virtualisation is heavyweight, while containerisation is lightweight