DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on • Edited on

Project idea – “Real-Time Orders Platform”

What This Image Represents

This is your entire project at a birds-eye view:

Multiple microservices in ECS

All produce or consume events

Events go to Confluent Cloud Kafka topics

Postgres (historical data) → Kafka via JDBC source

Kafka (real-time analytics) → Couchbase via sink connector

ksqlDB enriches and joins streams

ALB exposes web-backend + frontend UI

Event Flow (Producer → Kafka → Consumers → Couchbase)

What This Image Describes

This represents EXACTLY how your order platform works:

order-producer generates events

Events go into Kafka topics

Consumer microservices in ECS Fargate read the events:

payment-service

fraud-service

analytics-service

analytics-service writes results to order-analytics topic

Kafka Connect sink pushes analytics to Couchbase

Your UI reads from Couchbase to show real-time results

This image is PERFECT for understanding “how data moves.”

Kafka Topics & Microservices Interaction

What This Image Represents

This diagram explains:

Topic Who Writes? Who Reads?
orders order-producer payment, fraud, analytics
payments payment-service analytics
fraud-alerts fraud-service analytics or UI
order-analytics analytics-service Couchbase sink connector

Database & Connector Architecture (Postgres → Kafka → Couchbase)

What This Represents

This is the heart of your project’s data integration:

JDBC Source Connector reads historical customer data from Postgres

Streams it into orders topic

Real-time events enrich that data

Sink Connector pushes analytics into Couchbase

Couchbase becomes your real-time NoSQL operational store.

ECS + ALB + Terraform + CI/CD Architecture

What This Represents

This image shows your cloud deployment layer:

GitHub Actions CI/CD

Terraform provisioning

ECS Fargate cluster

IAM roles

ALB + target groups

VPC + subnets

Security groups

Secrets from GitHub → ECS

This is what recruiters want to see in interviews.

Reference Architecture: Event-Driven Microservices with Apache Kafka | Heroku Dev Center
Building a Microservices Ecosystem with Kafka Streams and KSQL

Reference Architecture: Event-Driven Microservices with Apache Kafka | Heroku Dev Center
Building a Microservices Ecosystem with Kafka Streams and KSQL

Image 1 — High-Level Architecture (Entire System Overview)

Kubernetes, Kafka Event Sourcing Architecture Patterns and Use Case Examples | HPE Developer Portal
Reference Architecture: Event-Driven Microservices with Apache Kafka | Heroku Dev Center
Kafka Architecture

What this diagram represents (your project version):

  • User accesses your Frontend (React).
  • Frontend calls your Backend API (FastAPI).
  • Backend fetches analytics / orders from Couchbase + Postgres.
  • Producer generates orders → Kafka.
  • Fraud / Payment / Analytics services consume events.
  • Kafka Connect pulls legacy data from Postgres → Kafka.
  • Couchbase sink stores analytics from Kafka → bucket.

Image 2 — Kafka Event Streaming Flow (Core Streaming Logic)

Shows ONLY Kafka + Producers + Consumers + Connect + Couchbase.

This matches your exact design:

  1. Order Producer sends events into topic orders.
  2. Fraud Service consumes → flags fraud.
  3. Payment Service consumes → marks PAID.
  4. Analytics Service consumes → stores analytics in Couchbase.
  5. Kafka Connect JDBC Source reads Postgres → topic legacy_orders.
  6. Kafka Connect Couchbase Sink writes topic order-analytics → Couchbase bucket.

Image 3 — AWS Production Deployment Diagram

Shows how Phase-2 & Phase-3 deployment works:

This is what your final production environment looks like:

  • ECS Fargate runs:

    • order-producer
    • fraud-service
    • payment-service
    • analytics-service
    • backend
    • frontend
  • AWS ALB exposes only:

    • Frontend
    • Backend /api
  • RDS Postgres replaces local Postgres.

  • Confluent Cloud replaces local Kafka (optional).

  • Couchbase Capella replaces local Couchbase (optional).


Image 4 — Kubernetes + ArgoCD + Helm Deployment Flow

Shows your GitHub → CI/CD → ECR/ GHCR → ArgoCD → EKS → Services.

This aligns with your repo structure:

  • GitHub Actions builds 6 images
  • Pushes to GHCR
  • ArgoCD reads Git repo
  • Syncs Helm charts
  • Deploys frontend + backend to EKS
  • Ingress exposes public domain
  • Microservices talk to Confluent Cloud + RDS

🎯 Overview of Your System

Your docker-compose.yml has 15 components:

Kafka ecosystem (Confluent Platform)

  1. Zookeeper
  2. Kafka Broker
  3. Schema Registry
  4. Kafka Connect
  5. ksqlDB Server
  6. ksqlDB CLI
  7. Confluent Control Center
  8. Kafdrop

Your microservices

  1. order-producer
  2. payment-service
  3. fraud-service
  4. analytics-service

Databases

  1. Postgres
  2. Couchbase

System

  1. Buildkit (Docker internal)

Now we will go one-by-one.


🟦 1. Zookeeper

What it is

Zookeeper was originally required by Kafka for storing metadata:

  • cluster ID
  • broker metadata
  • topic configurations
  • ACLs
  • consumer offsets (old)

Why you need it

Kafka will not start without Zookeeper (unless using KRaft mode).
It acts like a coordination service.

What DevOps should know

  • In production, Zookeeper is deployed as 3 or 5 nodes.
  • If Zookeeper dies → Kafka cannot elect a leader → cluster stops working.

🟦 2. Kafka Broker

What it is

Kafka is a distributed message streaming platform.

Responsibilities

  • Stores messages (orders, payments, fraud alerts)
  • Replicates data across brokers
  • Handles producers and consumers
  • Maintains partitions

Why you need it

Your entire project is built on event-driven microservices:

  • order-producer → writes events
  • consumers → read events
  • fraud-service → listens to order events
  • analytics → listens to everything

Kafka decouples your microservices.

DevOps responsibility

  • Tune brokers
  • Manage retention
  • Create topics
  • Observe consumer lag
  • Monitor partitions

🟦 3. Schema Registry

What it is

A service that stores schemas for Kafka messages.
Schemas describe:

  • fields
  • data types
  • structure

Why you need it

Without Schema Registry, microservices might send incompatible JSON or Avro.

Example

Producer sends:

{ "order_id": 1, "amount": 100 }
Enter fullscreen mode Exit fullscreen mode

If consumer expects:

{ "order_id": 1, "amount": 100, "currency": "USD" }
Enter fullscreen mode Exit fullscreen mode

Schema Registry prevents breaking changes.

DevOps role

  • Manage schema evolution
  • Validate compatibility modes

🟦 4. Kafka Connect

What it is

Framework for moving data between Kafka and external systems.

Examples:

  • Postgres → Kafka
  • Kafka → Elasticsearch
  • Kafka → S3
  • Kafka → Couchbase

Why you need it

Connect lets you build pipelines without writing code.

In your system

You don’t yet use a connector, but Connect is a standard component of Confluent Platform.

DevOps role

  • Install connectors
  • Manage connectors JSON configs
  • Monitor connector errors
  • Scale task workers

🟦 5. ksqlDB Server

What it is

A SQL engine for Kafka.

You write SQL-like queries on topics:

SELECT * FROM ORDERS EMIT CHANGES;
Enter fullscreen mode Exit fullscreen mode

What you can do

  • Streaming joins
  • Aggregations
  • Filters
  • Enrichments
  • Generate new topics

Why you need it

It allows your students to process streams without writing Java or Python.

DevOps responsibility

  • Deploy ksqlDB clusters
  • Manage memory
  • Monitor state store
  • Debug persistent queries

🟦 6. ksqlDB CLI

What it is

A command-line tool to connect to ksqlDB.

Why you need it

You can run:

  • SHOW STREAMS
  • PRINT 'orders'
  • CREATE STREAM fraud AS SELECT ...

Useful for students to practice SQL.


🟦 7. Confluent Control Center (C3)

What it is

The GUI for Confluent Platform.

What you can see

  • Kafka cluster status
  • Producers/consumers
  • Lag
  • Schemas
  • Connect connectors
  • ksqlDB queries
  • Throughput graphs

Why you need it

C3 gives you visual full control over Kafka.

DevOps role

  • Monitor lag
  • Check failed consumers
  • Check unhealthy connectors
  • View cluster performance

🟦 8. Kafdrop

What it is

A small lightweight UI for Kafka topics.

Why you need it

It is a simple way to:

  • browse topics
  • view messages
  • inspect partitions

Difference from Control Center

Control Center = full Confluent enterprise GUI
Kafdrop = quick, lightweight debugger


🟦 9. order-producer (Python)

What it does

This microservice generates order events:

{
  "order_id": 123,
  "amount": 500,
  "country": "FR"
}
Enter fullscreen mode Exit fullscreen mode

Why you need it

It simulates a real e-commerce platform generating events.

DevOps role

  • Deploy it in containers
  • Monitor logs
  • Debug producer failure

🟦 10. payment-service (Python consumer)

What it does

Consumes messages from orders topic.

Processes payment logic:

  • simulated approval
  • simulated failure
  • emits new events

Why you need it

To show microservice chaining.


🟦 11. fraud-service (Python consumer)

What it does

Listens to orders and identifies fraud.

Rules example:

  • amount > 400
  • risky countries

Writes fraud alerts to fraud-alerts topic.

Why you need it

To show real-time fraud detection on Kafka.


🟦 12. analytics-service

What it does

Consumes:

  • orders
  • payments
  • fraud alerts

Stores aggregated results into:

  • Postgres
  • Couchbase bucket

Why you need it

To show multi-system analytics.


🟦 13. Postgres

What it is

A relational SQL database.

Why you use it

Analytics-service writes:

  • order history
  • payment results

It simulates corporate reporting databases.


🟦 14. Couchbase

What it is

A NoSQL JSON database.

Why you use it

It stores:

  • real-time analytics
  • dashboard snapshots

Simulates high-speed NoSQL storage.


🟦 15. BuildKit / buildx

What it is

An internal Docker service for building multi-arch images.

Why it appears

Your system created it automatically because of build commands.

You can ignore it.


🎯 SUMMARY FOR YOU AND YOUR STUDENTS

This system is a full enterprise-grade data streaming platform:

  • Kafka = message backbone
  • Schema Registry = validates message formats
  • Connect = data pipeline engine
  • ksqlDB = real-time SQL on streams
  • Control Center = enterprise monitoring
  • Kafdrop = simple Kafka UI
  • Zookeeper = coordination
  • Python microservices = real-time business logic
  • Postgres = traditional reporting database
  • Couchbase = NoSQL analytics store

This is the same architecture used at Uber, Netflix, LinkedIn, Goldman Sachs, Capital One, Chase, American Airlines.

Why we used Docker Compose (NOT separate docker run commands)

Because your project is not one container.

Your Kafka enterprise project has 15+ containers:

  • Zookeeper
  • Kafka broker
  • Schema Registry
  • Kafka Connect
  • Control Center
  • ksqlDB Server
  • ksqlDB CLI
  • Kafdrop
  • Postgres
  • Couchbase
  • Order Producer
  • Payment Service
  • Fraud Service
  • Analytics Service
  • Connect plugin folders
  • Volumes

All of them must start together and talk to each other.

If you try to run them manually with docker run, each container must be:

  • configured manually
  • linked to the others
  • placed on the same network
  • started in the correct order
  • given the correct environment variables
  • restarted if any dependency fails

This becomes impossible to manage.


🚨 Important: Regular docker run CANNOT handle dependencies

Example:

Kafka cannot start until Zookeeper is ready.

Control Center cannot start until Kafka, Schema Registry, Connect, and ksqlDB are ALL running.

You would need 15 separate terminals with commands like:

docker run -d --name zookeeper ...
docker run -d --name kafka --link zookeeper ...
docker run -d --name schema-registry --link kafka ...
docker run -d --name connect --link kafka --link schema-registry ...
...
Enter fullscreen mode Exit fullscreen mode

And you must remember:

  • all ports
  • all environment variables
  • all dependencies
  • all long settings

This is almost impossible for a real system.


Docker Compose is MADE for multi-container systems like Kafka

With Compose:

1️⃣ All containers start with ONE command:

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

2️⃣ All containers stop with ONE command:

docker-compose down
Enter fullscreen mode Exit fullscreen mode

3️⃣ All networking is automatic

Every service can reach each other by name:

  • kafka:9092
  • zookeeper:2181
  • schema-registry:8081
  • postgres:5432
  • couchbase:8091

You do NOT need:

  • --network
  • --link
  • --add-host
  • --publish

Compose handles it.

4️⃣ All env variables stored in ONE file

Instead of typing 300+ env variables by hand with docker run.

5️⃣ Correct startup order (depends_on)

Kafka waits for Zookeeper
Schema Registry waits for Kafka
Control Center waits for Kafka, Connect, Schema Registry, ksqlDB

You cannot do this cleanly with docker run.

6️⃣ Automatically rebuild your microservices

Compose builds your Python containers:

build: ./producer
Enter fullscreen mode Exit fullscreen mode

docker run cannot do this.

7️⃣ You can restart only one service

Example:

docker-compose restart kafka
docker-compose restart control-center
Enter fullscreen mode Exit fullscreen mode

Tools you added inside your Docker Compose

Here is EXACTLY the ecosystem you created using Compose:

Component Purpose
Zookeeper Manages Kafka metadata
Kafka Broker Message broker (core Kafka)
Schema Registry Manages Avro / JSON schemas
Kafka Connect Connectors to DB/S3/etc
Control Center Confluent GUI (monitoring)
ksqlDB Server SQL engine for streaming
ksqlDB CLI Terminal for ksqlDB
Kafdrop Simple Kafka topic browser
Postgres Stores orders analytics data
Couchbase Stores real-time aggregated data
order-producer Python microservice generating messages
payment-service Python consumer
fraud-service Python consumer
analytics-service Python consumer + DB writer

This is a full event-driven microservices platform.

No one launches this with 15 docker run commands.


💡 Summary: Why Docker Compose?

Because:

  • You have many services
  • They depend on each other
  • They need a shared network
  • They need correct startup order
  • They need hundreds of environment variables
  • They must be managed together
  • You are teaching Kafka for DevOps (real-world practice)

Compose = Infrastructure as Code for Docker.
docker run = manual work that does not scale.

1. What Docker can do

Docker is designed to run one container at a time.

Example:

docker run postgres
docker run kafka
docker run my-app
Enter fullscreen mode Exit fullscreen mode

Docker simply takes:

  • image
  • command
  • environment variables
  • ports
  • volumes

…and runs the container.

✔ Docker is good at:

  • Running a single service
  • Packaging applications
  • Isolating environments

📌 BUT Docker does NOT know how containers depend on each other.


2. What Docker CANNOT do (and why you struggled with Kafka)

Docker CANNOT do:

✘ Start Zookeeper first

✘ Wait for Zookeeper to become healthy

✘ Start Kafka only after Zookeeper is ready

✘ Wait for Kafka metadata to initialize

✘ Start Schema Registry after Kafka

✘ Start Kafka Connect after Schema Registry

✘ Start Control Center after ALL of them

✘ Start your Python services after Kafka

Docker run does NOT understand relationships between containers.


🧠 3. Example: Why Kafka needs dependencies (real-world example)

Kafka cannot start without Zookeeper.

If you run:

docker run kafka
Enter fullscreen mode Exit fullscreen mode

It fails with:

ERROR: zookeeper not available
Enter fullscreen mode Exit fullscreen mode

Docker does NOT automatically:

  • retry
  • wait
  • sequence
  • orchestrate

This is why your Control Center was stuck at Loading… — because Kafka was not fully ready when Control Center started.

Docker cannot guarantee startup order.


🔥 4. What Docker Compose DOES (that docker run CANNOT)

Docker Compose introduces a critical concept:

depends_on

Example:

control-center:
  depends_on:
    - kafka
    - schema-registry
    - connect
    - ksqldb-server
Enter fullscreen mode Exit fullscreen mode

This tells Compose:

  • start Kafka first
  • start Schema Registry next
  • start Connect next
  • start ksqlDB next
  • THEN start Control Center

Docker alone cannot do this.


🟦 5. But isn’t Docker “image + dependency”?

No.

Docker = runs containers
Docker Compose = orchestrates containers

Docker itself does not understand:

  • dependency graph
  • service order
  • network resolution
  • health checks
  • shared environments
  • multi-container architecture

The idea that Docker handles dependency is a misunderstanding.

Docker handles 1 container.

Compose handles multiple containers that need each other.


6. REAL EXAMPLE: Your Project

Your system has:

  • 1 Zookeeper
  • 1 Kafka
  • 1 Schema Registry
  • 1 Kafka Connect
  • 1 Control Center
  • 1 ksqlDB
  • 1 Kafdrop
  • 1 Postgres
  • 1 Couchbase
  • 4 microservices

This is 12 services.

Docker alone CANNOT:

  • create bridge network
  • wire internal hostnames (kafka:9092)
  • ensure startup order
  • build microservices
  • mount volumes
  • share configs
  • restart dependencies

Compose does all of this with ONE FILE.


🟢 7. Visual Summary (very simple)

Docker (single container)

docker run postgres
Enter fullscreen mode Exit fullscreen mode

You control everything manually.


Docker Compose (full application)

One command:

docker-compose up
Enter fullscreen mode Exit fullscreen mode

Everything starts in the right order:

zookeeper → kafka → schema-registry → connect → ksql → control center → your services
Enter fullscreen mode Exit fullscreen mode

Docker runs one container.
Docker Compose runs a system of containers.

Docker does not know dependencies.
Docker Compose understands the dependency graph.

Docker is an engine.
Compose is the orchestrator.

Business story

You’re building a real-time order processing platform for an e-commerce company:

  • Existing Oracle database with orders & customers (in the lab we’ll use PostgreSQL to simulate Oracle).
  • New Couchbase (NoSQL) for fast customer session & cart data (you can simulate with Couchbase or Mongo if easier).
  • Need Confluent Kafka in the middle to stream events:

    • order-service writes new orders to Kafka.
    • payment-service, fraud-service, analytics-service consume.
    • Kafka Connect syncs data from Oracle → Kafka and Kafka → Couchbase.
    • ksqlDB / Kafka Streams does real-time aggregations (e.g., sales per minute, fraud rules).

What this lets you talk about:

  • Kafka architecture & streaming design
  • Confluent components: Brokers, Schema Registry, Connect, ksqlDB, Control Center
  • SQL + NoSQL integration (Oracle/Postgres + Couchbase)
  • Topics, partitions, replication, consumer groups, offsets
  • Reliability, scale, monitoring, security
  • How you “owned” the Confluent environment and became the escalation point

2. High-level architecture

Describe this in interviews like this:

  1. Producers
  • order-service (REST API) → publishes orders events to Kafka.

    1. Kafka / Confluent cluster
  • 3 Kafka brokers (or 1 for lab).

  • Topics:

    • orders (3 partitions, RF=2)
    • payments
    • fraud-alerts
    • order-analytics
  • Schema Registry for Avro/JSON schemas.

    1. Stream processing
  • ksqlDB or Kafka Streams app:

    • joins orders with payments
    • flags potential fraud
    • writes fraud-alerts & order-analytics.
      1. Connectors
  • JDBC Source Connector: Oracle/Postgres → topic legacy_orders.

  • Sink Connector: order-analytics → Couchbase collection.

    1. Consumers
  • payment-service → consumes orders, writes to DB and publishes to payments.

  • fraud-service → consumes orders+payments, publishes to fraud-alerts.

  • analytics-service → consumes orders & writes summary to NoSQL / analytics DB.

    1. Ops
  • Confluent Control Center or CLI tools for monitoring.

  • Basic ACLs / SSL (at least conceptually).


3. Tech stack for the lab

  • Platform: Docker Compose on your Mac
  • Core: Confluent Platform images:

    • zookeeper (or KRaft mode if you want modern setup)
    • kafka brokers
    • schema-registry
    • ksqldb-server + ksqldb-cli
    • connect
    • control-center
  • Databases

    • postgres (simulating Oracle)
    • couchbase (or Mongo if Couchbase is too heavy)
  • Microservices

    • Language you like (Python or Node.js) for producer/consumer services.
  • UI

    • kafdrop or Confluent Control Center to browse topics.

4. Step-by-step project plan

Step 1 – Bring up the Confluent stack with Docker Compose

Goal: show you can set up a Confluent environment from scratch.

  • Create docker-compose.yml with:

    • Zookeeper (optional if not using KRaft)
    • 1–3 Kafka brokers
    • Schema Registry
    • Connect
    • ksqlDB
    • Control Center
    • Postgres
    • Couchbase
  • Verify with:

    • docker ps
    • kafka-topics CLI listing
    • Control Center UI opens in browser.

Interview mapping:
“Tell me about a Confluent environment you set up.”
“How many brokers, what replication factor, how did you run it locally for POCs?”


Step 2 – Design & create Kafka topics

Goal: talk like an architect about topics, partitions & replication.

  • Design topics:

    • orders – 3 partitions, RF=1 or 2 (lab).
    • payments – 3 partitions.
    • fraud-alerts – 1 or 2 partitions.
    • order-analytics – 3 partitions.
  • Use kafka-topics CLI or Control Center to create them.

  • Decide partition key (e.g., order_id or customer_id).

Interview mapping:
“How do you decide number of partitions?”
“How do you handle ordering?”
“What replication factor do you choose and why?”


Step 3 – Implement an order producer service

Goal: show hands-on Kafka client experience.

  • Build order-service:

    • Simple REST endpoint /orders that accepts an order JSON.
    • Validates & publishes to orders topic using Kafka client library.
    • Adds headers (source, correlation-id) to show best practices.
  • Demonstrate:

    • Fire a few orders.
    • Watch them appear in orders topic (Kafdrop or kafka-console-consumer).

Interview mapping:
“Walk me through a producer you wrote.”
“How do you handle retries, acks, idempotence?”


Step 4 – Implement consumer microservices

Goal: talk about consumer groups, scaling, offset management.

  1. payment-service
  • Consumes from orders (group payments-group).
  • “Processes payment” (simulated) and publishes event to payments topic.
  1. fraud-service
  • Consumes from orders & payments (either directly or via order-payments stream later).
  • Simple rule: if amount > X and country is Y → publish alert to fraud-alerts.
  1. analytics-service
  • Consumes from orders & writes to order_analytics table in Postgres (or pushes to order-analytics topic for Connect).

Show:

  • Scaling a consumer group: run 2 instances of payment-service and watch partition assignment change.
  • Show offset lag using kafka-consumer-groups --describe.

Interview mapping:
“How do consumer groups work?”
“What happens when you add/remove consumers?”
“How do you handle reprocessing / replay?”


Step 5 – Integrate with Oracle (Postgres) using Kafka Connect

Goal: show Connect and JDBC connectors.

  • In Postgres create tables:

    • legacy_orders
    • Insert some sample historical orders.
  • Configure JDBC Source Connector:

    • Source: Postgres legacy_orders.
    • Sink: legacy_orders topic.
  • Verify:

    • Rows from DB appear as messages in legacy_orders.

Interview mapping:
“Have you used Kafka Connect?”
“Explain how you brought data from Oracle into Kafka.”
“How do you handle schema changes?”


Step 6 – Sink analytics to Couchbase via Connect

Goal: show Kafka → NoSQL integration.

  • Create Couchbase bucket/collection order_analytics.
  • Configure Sink Connector:

    • Source topic: order-analytics.
    • Target: Couchbase.
  • Verify:

    • Aggregated analytics events appear as documents in Couchbase.

Interview mapping:
“Tell us about integrating Kafka with NoSQL / Couchbase.”
“How did you configure your sink connectors?”
“How do you handle retries / DLQs?”


Step 7 – Stream processing with ksqlDB or Kafka Streams

Goal: cover Kafka Streams / kSQL & streaming architecture.

Using ksqlDB:

  • Define streams:

    • ORDERS_STREAM on orders.
    • PAYMENTS_STREAM on payments.
  • Build a joined stream:

    • ORDERS_WITH_PAYMENTS joining by order_id.
  • Create aggregations:

    • Total sales per country per minute.
    • Count of “suspicious orders” flagged by simple rule.
  • Output results to order-analytics topic (used by sink connector above).

Interview mapping:
“What’s your experience with Kafka Streams / kSQL?”
“How do you build a real-time pipeline end-to-end?”
“How do you design stateful vs stateless processing?”


Step 8 – Schema Registry & message evolution

Goal: talk about schemas, compatibility & governance.

  • Define an Avro or JSON schema for Order and register it in Schema Registry.
  • Configure producer & consumers to use that schema.
  • Demonstrate:

    • Add a new optional field (e.g., promo_code) → show compatibility (backward/forward).
    • Talk about what happens if you make a breaking change.

Interview mapping:
“Have you used Schema Registry?”
“How do you manage schema evolution?”
“How do you avoid breaking consumers?”


Step 9 – Reliability, monitoring & troubleshooting

Goal: show you can be the escalation point for Kafka.

Do some experiments:

  • Kill one consumer instance and watch rebalance.
  • Stop a connector → see lag build up, restart and recover.
  • Configure producer with acks=all, retries, and discuss durability.
  • Use:

    • Control Center dashboards or CLI to check:
    • Consumer lag
    • Broker health
    • Topic throughput

Prepare talking points:

  • How you would monitor in prod (Prometheus/Grafana, alerts on lag, disk, ISR count).
  • Backup & disaster recovery strategy (snapshots, multi-AZ, mirror topics across clusters).

Interview mapping:
“How do you monitor Kafka?”
“What are common failure scenarios?”
“If a consumer is lagging badly, how do you troubleshoot?”


Step 10 – Security & access control (conceptual + minimal lab)

Goal: speak about security architecture even if lab is simple.

In lab (optional but nice):

  • Enable SASL/PLAINTEXT or at least explain how you’d:

    • Use SASL/SCRAM or mTLS for auth.
    • Use ACLs to restrict which services can read/write which topics.
    • Use encryption in transit (TLS) and at rest (disks).

Interview mapping:
“How do you secure a Kafka / Confluent environment?”
“How do you isolate teams & applications?”


Step 11 – Your 2-minute “project story” for interviews

Practice saying something like:

“I recently built a real-time orders platform using Confluent Kafka. The company had an existing Oracle database and was adding Couchbase as a NoSQL store.
I designed the Kafka architecture with multiple topics (orders, payments, fraud-alerts, order-analytics) and set up a Confluent stack with Schema Registry, Connect, and ksqlDB using Docker.
An order-service publishes orders to Kafka; payment-service, fraud-service, and analytics-service consume them in different consumer groups.
I used a JDBC Source Connector to stream historical data from Oracle (simulated with Postgres) into Kafka, and a sink connector to push real-time analytics into Couchbase.
On top of that I used ksqlDB to join orders and payments, detect potential fraud, and compute per-minute sales metrics.
I monitored consumer lag and broker health through Control Center, experimented with failures, and documented how to scale consumers and handle schema evolution using Schema Registry.
This project gave me end-to-end experience as the person owning the Confluent platform and integrating it with both SQL and NoSQL systems.”

1. Project structure

Create a folder, for example:

mkdir kafka-enterprise-orders
cd kafka-enterprise-orders
Enter fullscreen mode Exit fullscreen mode

Inside it, create this structure:

kafka-enterprise-orders/
├── docker-compose.yml
├── .env
├── db/
│   └── init.sql
├── connect/
│   ├── Dockerfile
│   └── connectors/
│       ├── jdbc-source.json
│       └── couchbase-sink.json
├── ksql/
│   └── streams.sql
├── producer/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── order_producer.py
└── consumers/
    ├── fraud-service/
    │   ├── Dockerfile
    │   ├── requirements.txt
    │   └── fraud_consumer.py
    ├── payment-service/
    │   ├── Dockerfile
    │   ├── requirements.txt
    │   └── payment_consumer.py
    └── analytics-service/
        ├── Dockerfile
        ├── requirements.txt
        └── analytics_consumer.py
Enter fullscreen mode Exit fullscreen mode

Now fill each file as below.


2. .env

# Kafka
KAFKA_BROKER=kafka:9092

# Topics
ORDERS_TOPIC=orders
PAYMENTS_TOPIC=payments
FRAUD_ALERTS_TOPIC=fraud-alerts
ORDER_ANALYTICS_TOPIC=order-analytics

# Producer config
ORDER_PRODUCER_INTERVAL_SECONDS=3

# Fraud rules
FRAUD_AMOUNT_THRESHOLD=400
FRAUD_RISKY_COUNTRIES=RU,FR,BR

# Analytics settings
ANALYTICS_PRINT_EVERY=10

# Postgres (simulating Oracle)
POSTGRES_DB=ordersdb
POSTGRES_USER=orders_user
POSTGRES_PASSWORD=orders_pass

# Couchbase (you will set these in UI)
COUCHBASE_HOST=couchbase
COUCHBASE_BUCKET=order_analytics
COUCHBASE_USERNAME=Administrator
COUCHBASE_PASSWORD=password
Enter fullscreen mode Exit fullscreen mode

3. docker-compose.yml

services:
  # ---------------------------
  # Zookeeper & Kafka broker
  # ---------------------------
  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.1
    container_name: zookeeper
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka:7.6.1
    container_name: kafka
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9092"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"

  # ---------------------------
  # Schema Registry
  # ---------------------------
  schema-registry:
    image: confluentinc/cp-schema-registry:7.6.1
    container_name: schema-registry
    depends_on:
      - kafka
    ports:
      - "8081:8081"
    environment:
      SCHEMA_REGISTRY_HOST_NAME: schema-registry
      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: "PLAINTEXT://kafka:9092"
      SCHEMA_REGISTRY_LISTENERS: "http://0.0.0.0:8081"

  # ---------------------------
  # ksqlDB
  # ---------------------------
  ksqldb-server:
    image: confluentinc/cp-ksqldb-server:7.6.1
    container_name: ksqldb-server
    depends_on:
      - kafka
      - schema-registry
    ports:
      - "8088:8088"
    environment:
      KSQL_CONFIG_DIR: "/etc/ksqldb"
      KSQL_BOOTSTRAP_SERVERS: "kafka:9092"
      KSQL_HOST_NAME: ksqldb-server
      KSQL_LISTENERS: "http://0.0.0.0:8088"
      KSQL_CACHE_MAX_BYTES_BUFFERING: 0
      KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
      KSQL_KSQL_SERVICE_ID: "ksql-service"

  ksqldb-cli:
    image: confluentinc/cp-ksqldb-cli:7.6.1
    container_name: ksqldb-cli
    depends_on:
      - ksqldb-server
    entrypoint: /bin/sh
    tty: true

  # ---------------------------
  # Kafka Connect (custom image)
  # ---------------------------
  connect:
    build: ./connect
    container_name: connect
    depends_on:
      - kafka
      - schema-registry
    ports:
      - "8083:8083"
    environment:
      CONNECT_BOOTSTRAP_SERVERS: "kafka:9092"
      CONNECT_REST_PORT: 8083
      CONNECT_GROUP_ID: "connect-cluster"
      CONNECT_CONFIG_STORAGE_TOPIC: "_connect-configs"
      CONNECT_OFFSET_STORAGE_TOPIC: "_connect-offsets"
      CONNECT_STATUS_STORAGE_TOPIC: "_connect-status"
      CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_REST_ADVERTISED_HOST_NAME: "connect"
      CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components"
      CONNECT_LOG4J_ROOT_LOGLEVEL: "INFO"

  # ---------------------------
  # Confluent Control Center
  # ---------------------------
  control-center:
    image: confluentinc/cp-enterprise-control-center:7.6.1
    container_name: control-center
    depends_on:
      - kafka
      - schema-registry
      - connect
      - ksqldb-server
    ports:
      - "9021:9021"
    environment:
      CONTROL_CENTER_BOOTSTRAP_SERVERS: "kafka:9092"
      CONTROL_CENTER_ZOOKEEPER_CONNECT: "zookeeper:2181"
      CONTROL_CENTER_CONNECT_CLUSTER: "connect:8083"
      CONTROL_CENTER_KSQL_KSQLDB-SERVER_URL: "http://ksqldb-server:8088"
      CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
      CONTROL_CENTER_REPLICATION_FACTOR: 1
      CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1
      CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1
      CONFLUENT_METRICS_TOPIC_REPLICATION: 1
      PORT: 9021

  # ---------------------------
  # Kafdrop
  # ---------------------------
  kafdrop:
    image: obsidiandynamics/kafdrop:latest
    container_name: kafdrop
    depends_on:
      - kafka
    ports:
      - "9000:9000"
    environment:
      KAFKA_BROKERCONNECT: "kafka:9092"
      SERVER_SERVLET_CONTEXTPATH: "/"

  # ---------------------------
  # PostgreSQL (simulating Oracle)
  # ---------------------------
  postgres:
    image: postgres:15
    container_name: postgres
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql

  # ---------------------------
  # Couchbase (single node)
  # ---------------------------
  couchbase:
    image: couchbase:community-7.2.0
    container_name: couchbase
    ports:
      - "8091-8094:8091-8094"
      - "11210:11210"
    environment:
      COUCHBASE_ADMINISTRATOR_USERNAME: ${COUCHBASE_USERNAME}
      COUCHBASE_ADMINISTRATOR_PASSWORD: ${COUCHBASE_PASSWORD}

  # ---------------------------
  # Order producer
  # ---------------------------
  order-producer:
    build: ./producer
    container_name: order-producer
    depends_on:
      - kafka
    environment:
      KAFKA_BROKER: ${KAFKA_BROKER}
      ORDERS_TOPIC: ${ORDERS_TOPIC}
      ORDER_PRODUCER_INTERVAL_SECONDS: ${ORDER_PRODUCER_INTERVAL_SECONDS}

  # ---------------------------
  # Fraud service
  # ---------------------------
  fraud-service:
    build: ./consumers/fraud-service
    container_name: fraud-service
    depends_on:
      - kafka
    environment:
      KAFKA_BROKER: ${KAFKA_BROKER}
      ORDERS_TOPIC: ${ORDERS_TOPIC}
      FRAUD_AMOUNT_THRESHOLD: ${FRAUD_AMOUNT_THRESHOLD}
      FRAUD_RISKY_COUNTRIES: ${FRAUD_RISKY_COUNTRIES}

  # ---------------------------
  # Payment service
  # ---------------------------
  payment-service:
    build: ./consumers/payment-service
    container_name: payment-service
    depends_on:
      - kafka
    environment:
      KAFKA_BROKER: ${KAFKA_BROKER}
      ORDERS_TOPIC: ${ORDERS_TOPIC}
      PAYMENTS_TOPIC: ${PAYMENTS_TOPIC}

  # ---------------------------
  # Analytics service
  # ---------------------------
  analytics-service:
    build: ./consumers/analytics-service
    container_name: analytics-service
    depends_on:
      - kafka
      - couchbase
    environment:
      KAFKA_BROKER: ${KAFKA_BROKER}
      ORDERS_TOPIC: ${ORDERS_TOPIC}
      ORDER_ANALYTICS_TOPIC: ${ORDER_ANALYTICS_TOPIC}
      ANALYTICS_PRINT_EVERY: ${ANALYTICS_PRINT_EVERY}
      COUCHBASE_HOST: ${COUCHBASE_HOST}
      COUCHBASE_BUCKET: ${COUCHBASE_BUCKET}
      COUCHBASE_USERNAME: ${COUCHBASE_USERNAME}
      COUCHBASE_PASSWORD: ${COUCHBASE_PASSWORD}
Enter fullscreen mode Exit fullscreen mode

4. PostgreSQL init script: db/init.sql

CREATE TABLE IF NOT EXISTS legacy_orders (
    id SERIAL PRIMARY KEY,
    order_id INT NOT NULL,
    customer_id INT NOT NULL,
    amount NUMERIC(10,2) NOT NULL,
    currency VARCHAR(10) NOT NULL,
    country VARCHAR(10) NOT NULL,
    created_at TIMESTAMP NOT NULL
);

INSERT INTO legacy_orders (order_id, customer_id, amount, currency, country, created_at) VALUES
(1001, 2001, 123.45, 'USD', 'US', NOW() - INTERVAL '1 day'),
(1002, 2002, 555.00, 'USD', 'FR', NOW() - INTERVAL '2 days'),
(1003, 2003, 75.99, 'USD', 'DE', NOW() - INTERVAL '3 days');
Enter fullscreen mode Exit fullscreen mode

5. Kafka Connect image: connect/Dockerfile

This installs JDBC and Couchbase connectors via Confluent Hub.

FROM confluentinc/cp-kafka-connect:7.6.1

# Install JDBC connector
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:latest

# Install Couchbase Kafka connector
RUN confluent-hub install --no-prompt couchbase/kafka-connect-couchbase:latest

# Default command
CMD ["bash", "-c", "connect-distributed /etc/kafka/connect-distributed.properties"]
Enter fullscreen mode Exit fullscreen mode

6. Kafka Connect connector configs

6.1 connect/connectors/jdbc-source.json

This pulls from Postgres legacy_orders → Kafka topic legacy_orders.

{
  "name": "jdbc-source-legacy-orders",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "tasks.max": "1",
    "connection.url": "jdbc:postgresql://postgres:5432/ordersdb",
    "connection.user": "orders_user",
    "connection.password": "orders_pass",
    "mode": "incrementing",
    "incrementing.column.name": "id",
    "table.whitelist": "legacy_orders",
    "topic.prefix": "legacy_",
    "poll.interval.ms": "10000"
  }
}
Enter fullscreen mode Exit fullscreen mode

6.2 connect/connectors/couchbase-sink.json

This writes from topic order-analytics → Couchbase bucket order_analytics.

{
  "name": "couchbase-sink-order-analytics",
  "config": {
    "connector.class": "com.couchbase.connect.kafka.CouchbaseSinkConnector",
    "tasks.max": "1",
    "topics": "order-analytics",
    "couchbase.bootstrap.servers": "couchbase",
    "couchbase.bucket": "order_analytics",
    "couchbase.username": "Administrator",
    "couchbase.password": "password",
    "couchbase.enable.tls": "false",
    "couchbase.document.id": "${topic}-${partition}-${offset}",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter"
  }
}
Enter fullscreen mode Exit fullscreen mode

You’ll POST these JSONs to http://localhost:8083/connectors after everything is up.


7. ksqlDB: ksql/streams.sql

Minimal example: create stream on orders, aggregate sales per country.

-- Run this inside ksqldb-cli (docker exec -it ksqldb-cli /bin/sh then ksql http://ksqldb-server:8088)

CREATE STREAM ORDERS_STREAM (
  order_id INT,
  customer_id INT,
  amount DOUBLE,
  currency VARCHAR,
  country VARCHAR,
  status VARCHAR,
  created_at VARCHAR,
  source VARCHAR
) WITH (
  KAFKA_TOPIC = 'orders',
  VALUE_FORMAT = 'JSON'
);

CREATE TABLE SALES_PER_COUNTRY AS
  SELECT country,
         COUNT(*) AS order_count,
         SUM(amount) AS total_amount
  FROM ORDERS_STREAM
  WINDOW TUMBLING (SIZE 1 MINUTE)
  GROUP BY country
  EMIT CHANGES;

CREATE STREAM ORDER_ANALYTICS_STREAM
  WITH (KAFKA_TOPIC = 'order-analytics', VALUE_FORMAT = 'JSON') AS
  SELECT country,
         COUNT(*) AS order_count,
         SUM(amount) AS total_amount
  FROM ORDERS_STREAM
  WINDOW TUMBLING (SIZE 1 MINUTE)
  GROUP BY country
  EMIT CHANGES;
Enter fullscreen mode Exit fullscreen mode

8. Producer service

8.1 producer/requirements.txt

kafka-python==2.0.2
Enter fullscreen mode Exit fullscreen mode

8.2 producer/Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY order_producer.py .

CMD ["python", "order_producer.py"]
Enter fullscreen mode Exit fullscreen mode

8.3 producer/order_producer.py

import json
import os
import random
import time
from datetime import datetime, timezone

from kafka import KafkaProducer


def get_env(name: str, default: str) -> str:
    return os.environ.get(name, default)


KAFKA_BROKER = get_env("KAFKA_BROKER", "kafka:9092")
ORDERS_TOPIC = get_env("ORDERS_TOPIC", "orders")
INTERVAL_SECONDS = float(get_env("ORDER_PRODUCER_INTERVAL_SECONDS", "3"))


def create_producer() -> KafkaProducer:
    producer = KafkaProducer(
        bootstrap_servers=KAFKA_BROKER,
        value_serializer=lambda v: json.dumps(v).encode("utf-8"),
        key_serializer=lambda k: str(k).encode("utf-8"),
    )
    print(f"[order-producer] Connected to Kafka at {KAFKA_BROKER}, topic '{ORDERS_TOPIC}'")
    return producer


def random_country() -> str:
    countries = ["US", "CA", "GB", "DE", "FR", "RU", "BR", "AU", "IN"]
    return random.choice(countries)


def random_currency() -> str:
    return "USD"


def random_customer_id() -> int:
    return random.randint(1000, 9999)


def random_amount() -> float:
    return round(random.uniform(10, 600), 2)


def random_source() -> str:
    return random.choice(["web", "mobile", "api"])


def generate_order(order_id: int) -> dict:
    now = datetime.now(timezone.utc).isoformat()
    order = {
        "order_id": order_id,
        "customer_id": random_customer_id(),
        "amount": random_amount(),
        "currency": random_currency(),
        "country": random_country(),
        "status": "NEW",
        "created_at": now,
        "source": random_source(),
    }
    return order


def main():
    producer = create_producer()
    order_id = 1

    while True:
        order = generate_order(order_id)
        producer.send(ORDERS_TOPIC, key=order["order_id"], value=order)
        producer.flush()

        print(f"[order-producer] Sent order: {order}")
        order_id += 1
        time.sleep(INTERVAL_SECONDS)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

9. Fraud service

9.1 consumers/fraud-service/requirements.txt

kafka-python==2.0.2
Enter fullscreen mode Exit fullscreen mode

9.2 consumers/fraud-service/Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY fraud_consumer.py .

CMD ["python", "fraud_consumer.py"]
Enter fullscreen mode Exit fullscreen mode

9.3 consumers/fraud-service/fraud_consumer.py

import json
import os
from typing import List

from kafka import KafkaConsumer


def get_env(name: str, default: str) -> str:
    return os.environ.get(name, default)


KAFKA_BROKER = get_env("KAFKA_BROKER", "kafka:9092")
ORDERS_TOPIC = get_env("ORDERS_TOPIC", "orders")

AMOUNT_THRESHOLD = float(get_env("FRAUD_AMOUNT_THRESHOLD", "400"))
RISKY_COUNTRIES_RAW = get_env("FRAUD_RISKY_COUNTRIES", "RU,FR,BR")
RISKY_COUNTRIES: List[str] = [c.strip().upper() for c in RISKY_COUNTRIES_RAW.split(",") if c.strip()]


def create_consumer() -> KafkaConsumer:
    consumer = KafkaConsumer(
        ORDERS_TOPIC,
        bootstrap_servers=KAFKA_BROKER,
        group_id="fraud-service-group",
        value_deserializer=lambda v: json.loads(v.decode("utf-8")),
        key_deserializer=lambda k: int(k.decode("utf-8")) if k else None,
        auto_offset_reset="earliest",
        enable_auto_commit=True,
    )
    print(f"[fraud-service] Connected to Kafka at {KAFKA_BROKER}, topic '{ORDERS_TOPIC}'")
    print(f"[fraud-service] Amount threshold: {AMOUNT_THRESHOLD}, risky countries: {RISKY_COUNTRIES}")
    return consumer


def is_fraud(order: dict) -> (bool, str):
    amount = order.get("amount", 0)
    country = order.get("country", "").upper()

    if amount >= AMOUNT_THRESHOLD and country in RISKY_COUNTRIES:
        return True, "HIGH_AMOUNT_RISKY_COUNTRY"
    if amount >= AMOUNT_THRESHOLD:
        return True, "HIGH_AMOUNT"
    if country in RISKY_COUNTRIES:
        return True, "RISKY_COUNTRY"

    return False, ""


def main():
    consumer = create_consumer()

    for message in consumer:
        order = message.value
        key = message.key

        print(f"[fraud-service] Received order: key={key}, value={order}")

        flagged, reason = is_fraud(order)
        if flagged:
            alert = {
                "order_id": order.get("order_id"),
                "reason": reason,
                "amount": order.get("amount"),
                "country": order.get("country"),
                "status": "CANCELLED",
            }
            print(f"[fraud-service] FRAUD ALERT: {alert}")
        else:
            print(f"[fraud-service] Order is clean, id={order.get('order_id')}")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

10. Payment service

10.1 consumers/payment-service/requirements.txt

kafka-python==2.0.2
Enter fullscreen mode Exit fullscreen mode

10.2 consumers/payment-service/Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY payment_consumer.py .

CMD ["python", "payment_consumer.py"]
Enter fullscreen mode Exit fullscreen mode

10.3 consumers/payment-service/payment_consumer.py

import json
import os

from kafka import KafkaConsumer, KafkaProducer


def get_env(name: str, default: str) -> str:
    return os.environ.get(name, default)


KAFKA_BROKER = get_env("KAFKA_BROKER", "kafka:9092")
ORDERS_TOPIC = get_env("ORDERS_TOPIC", "orders")
PAYMENTS_TOPIC = get_env("PAYMENTS_TOPIC", "payments")


def create_consumer() -> KafkaConsumer:
    consumer = KafkaConsumer(
        ORDERS_TOPIC,
        bootstrap_servers=KAFKA_BROKER,
        group_id="payment-service-group",
        value_deserializer=lambda v: json.loads(v.decode("utf-8")),
        key_deserializer=lambda k: int(k.decode("utf-8")) if k else None,
        auto_offset_reset="earliest",
        enable_auto_commit=True,
    )
    print(f"[payment-service] Connected consumer to Kafka at {KAFKA_BROKER}, topic '{ORDERS_TOPIC}'")
    return consumer


def create_producer() -> KafkaProducer:
    producer = KafkaProducer(
        bootstrap_servers=KAFKA_BROKER,
        value_serializer=lambda v: json.dumps(v).encode("utf-8"),
        key_serializer=lambda k: str(k).encode("utf-8"),
    )
    print(f"[payment-service] Connected producer to Kafka at {KAFKA_BROKER}, topic '{PAYMENTS_TOPIC}'")
    return producer


def main():
    consumer = create_consumer()
    producer = create_producer()

    for message in consumer:
        order = message.value
        key = message.key

        print(f"[payment-service] Received order: key={key}, value={order}")

        if order.get("status") == "NEW":
            print(f"[payment-service] Processing payment for order {order.get('order_id')} "
                  f"amount={order.get('amount')} {order.get('currency')}")
            payment_event = {
                "order_id": order.get("order_id"),
                "customer_id": order.get("customer_id"),
                "amount": order.get("amount"),
                "currency": order.get("currency"),
                "status": "PAID",
            }
            producer.send(PAYMENTS_TOPIC, key=payment_event["order_id"], value=payment_event)
            producer.flush()
            print(f"[payment-service] Payment successful, event sent: {payment_event}")
        else:
            print(f"[payment-service] Skipping order {order.get('order_id')} with status={order.get('status')}")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

11. Analytics service

This one both prints stats and (optionally) writes to Couchbase directly, so even if Connect is not configured yet you still have Kafka → Couchbase path.

11.1 consumers/analytics-service/requirements.txt

kafka-python==2.0.2
couchbase==4.3.0
Enter fullscreen mode Exit fullscreen mode

11.2 consumers/analytics-service/Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY analytics_consumer.py .

CMD ["python", "analytics_consumer.py"]
Enter fullscreen mode Exit fullscreen mode

11.3 consumers/analytics-service/analytics_consumer.py

import json
import os
from collections import Counter

from kafka import KafkaConsumer
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.options import ClusterTimeoutOptions


def get_env(name: str, default: str) -> str:
    return os.environ.get(name, default)


KAFKA_BROKER = get_env("KAFKA_BROKER", "kafka:9092")
ORDERS_TOPIC = get_env("ORDERS_TOPIC", "orders")
PRINT_EVERY = int(get_env("ANALYTICS_PRINT_EVERY", "10"))

COUCHBASE_HOST = get_env("COUCHBASE_HOST", "couchbase")
COUCHBASE_BUCKET = get_env("COUCHBASE_BUCKET", "order_analytics")
COUCHBASE_USERNAME = get_env("COUCHBASE_USERNAME", "Administrator")
COUCHBASE_PASSWORD = get_env("COUCHBASE_PASSWORD", "password")


def create_consumer() -> KafkaConsumer:
    consumer = KafkaConsumer(
        ORDERS_TOPIC,
        bootstrap_servers=KAFKA_BROKER,
        group_id="analytics-service-group",
        value_deserializer=lambda v: json.loads(v.decode("utf-8")),
        key_deserializer=lambda k: int(k.decode("utf-8")) if k else None,
        auto_offset_reset="earliest",
        enable_auto_commit=True,
    )
    print(f"[analytics-service] Connected to Kafka at {KAFKA_BROKER}, topic '{ORDERS_TOPIC}'")
    return consumer


def create_couchbase_bucket():
    try:
        cluster = Cluster(
            f"couchbase://{COUCHBASE_HOST}",
            ClusterOptions(
                PasswordAuthenticator(COUCHBASE_USERNAME, COUCHBASE_PASSWORD),
                timeout_options=ClusterTimeoutOptions(kv_timeout=10)
            )
        )
        bucket = cluster.bucket(COUCHBASE_BUCKET)
        collection = bucket.default_collection()
        print(f"[analytics-service] Connected to Couchbase bucket '{COUCHBASE_BUCKET}' at {COUCHBASE_HOST}")
        return collection
    except Exception as e:
        print(f"[analytics-service] WARNING: Could not connect to Couchbase: {e}")
        return None


def main():
    consumer = create_consumer()
    collection = create_couchbase_bucket()

    total_orders = 0
    total_amount = 0.0
    orders_by_country = Counter()

    for message in consumer:
        order = message.value
        key = message.key

        total_orders += 1
        total_amount += float(order.get("amount", 0))
        country = order.get("country", "UNKNOWN")
        orders_by_country[country] += 1

        print(f"[analytics-service] Received order: key={key}, value={order}")

        # Optionally store each order document in Couchbase
        if collection is not None:
            doc_id = f"order::{order.get('order_id')}"
            try:
                collection.upsert(doc_id, order)
                print(f"[analytics-service] Stored order in Couchbase with id={doc_id}")
            except Exception as e:
                print(f"[analytics-service] ERROR storing to Couchbase: {e}")

        if total_orders % PRINT_EVERY == 0:
            print("\n[analytics-service] ===== STATS =====")
            print(f"Total orders: {total_orders}")
            avg_amount = total_amount / total_orders if total_orders else 0
            print(f"Total amount: {total_amount:.2f}")
            print(f"Average amount: {avg_amount:.2f}")
            print("Orders by country:")
            for c, count in orders_by_country.items():
                print(f"  {c}: {count}")
            print("[analytics-service] ====================\n")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

12. How to run and test

From the root folder kafka-enterprise-orders:

  1. Build and start everything
   docker-compose up -d --build
Enter fullscreen mode Exit fullscreen mode
  1. Initialize Couchbase once
  • Go to http://localhost:8091 in browser.
  • Follow setup wizard:

    • Username: Administrator
    • Password: password
  • Create a bucket named: order_analytics.

  1. Check UIs
  • Kafdrop: http://localhost:9000
  • Control Center: http://localhost:9021/clusters
  • Schema Registry: http://localhost:8081/subjects
  • ksqlDB server info: http://localhost:8088/info
  • Couchbase UI: http://localhost:8091
  1. Create connectors

In a terminal:

   # JDBC source
   curl -X POST -H "Content-Type: application/json" \
     --data @connect/connectors/jdbc-source.json \
     http://localhost:8083/connectors

   # Couchbase sink
   curl -X POST -H "Content-Type: application/json" \
     --data @connect/connectors/couchbase-sink.json \
     http://localhost:8083/connectors
Enter fullscreen mode Exit fullscreen mode
  1. Check logs
   docker logs -f order-producer
   docker logs -f fraud-service
   docker logs -f payment-service
   docker logs -f analytics-service
Enter fullscreen mode Exit fullscreen mode

You now have:

  • Orders streaming into Kafka.
  • Fraud, payment, analytics services consuming.
  • Postgres simulating Oracle with CDC via JDBC Source.
  • Couchbase as NoSQL store for analytics and raw orders.
  • Confluent UIs available on the ports you listed.

1. What you actually need to go from local → internet

Right now you have:

  • A docker-compose stack running on your Mac
  • Internal UIs: Control Center, Kafdrop, Couchbase UI, etc.
  • Microservices + Postgres + Couchbase

To make this “real world / internet-ready”, you need to think in these layers:

A. Where will it run? (Compute / Hosting)

Simplest realistic path for you:

  • AWS EC2 (one or two instances)

    • Run your docker-compose stack there
    • Use security groups + Nginx to expose only what you want public

More advanced options for “Part 3 later”:

  • AWS ECS Fargate or EKS
  • Managed Kafka (MSK or Confluent Cloud)
  • RDS PostgreSQL, Couchbase Capella SaaS

For Part 2, I’d keep Kafka self-hosted and just move exactly what you have to one EC2 VM.


B. What should be public vs private

This is important:

NEVER expose these directly to the internet in real life:

  • Kafka port 9092
  • Zookeeper 2181
  • Postgres 5432
  • Couchbase 8091–8094, 11210
  • Schema Registry 8081
  • Connect 8083
  • Control Center 9021 (only protected for admins)
  • Kafdrop 9000 (admin/debug only)

✅ Public should be:

  • A single HTTPS endpoint:

    • Either REST API service (e.g., /api/orders, /api/analytics)
    • Or a simple web UI that shows analytics / dashboards

So “internet ready” = one gateway in front, everything else inside a private network.


C. Non-functional things you need

To call it “real DevOps”:

  1. Domain + DNS
  • Buy or use a domain (e.g. jumptotech.dev)
  • Configure api.jumptotech.dev → EC2 public IP via Route 53
  1. HTTPS (TLS)
  • Use Nginx + Let’s Encrypt (certbot) on EC2
  • Or AWS ALB + ACM (if you go more advanced)
  1. Secrets management
  • Don’t hardcode DB passwords in docker-compose.yml
  • Use .env or AWS Systems Manager Parameter Store / Secrets Manager
  1. Monitoring & logs
  • At least: CloudWatch logs or a Loki/Promtail stack
  • Health endpoints for services
  1. CI/CD
  • GitHub Actions: on push → build images → push to ECR → SSH or SSM into EC2 → docker-compose pull && docker-compose up -d

2. “Second Part of Project” – Concrete Plan

Let’s define Part 2 clearly so you can teach it as “Phase 2: Production Deployment”.

📁 New repo structure

Extend your project like this:

kafka-enterprise-orders/
  producer/
  consumers/
  db/
  docker-compose.yml

  infra/
    aws-ec2/
      terraform/           # (optional) to create EC2 + security groups
      user-data.sh         # bootstrap script
  deploy/
    nginx/
      nginx.conf
    docker-compose.prod.yml
  ci-cd/
    github-actions/
      deploy.yml
  docs/
    PART1-local-dev.md
    PART2-cloud-deploy.md
Enter fullscreen mode Exit fullscreen mode

Phase 1 – Prepare production docker-compose

Create deploy/docker-compose.prod.yml with only what you need on EC2:

  • Keep everything as now, BUT:

    • Don’t bind every port to 0.0.0.0
    • You can keep admin UIs temporarily for demo but in real life they’re private.

Example idea:

  • Keep Kafka, Zookeeper, Postgres, Couchbase internal (ports: optional or only bound to 127.0.0.1).
  • Expose only:

    • Nginx 80/443
    • maybe Kafdrop/Control Center behind basic auth just for your demos.

Phase 2 – Create EC2 and install Docker

You can do this manually first (then later with Terraform):

  1. Launch Ubuntu EC2 in AWS.
  • Public subnet
  • Security Group: allow

    • 22 from your IP
    • 80, 443 from Anywhere (for HTTP/HTTPS)
  1. SSH in and install Docker + docker-compose:
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release

# Docker repo
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin

sudo usermod -aG docker ubuntu
Enter fullscreen mode Exit fullscreen mode

Re-login so you can use docker without sudo.

  1. Clone your repo on EC2:
git clone https://github.com/YourUser/kafka-enterprise-orders.git
cd kafka-enterprise-orders/deploy
docker compose -f docker-compose.prod.yml up -d
Enter fullscreen mode Exit fullscreen mode

Now your entire stack runs on EC2.


Phase 3 – Nginx reverse proxy + HTTPS

Add a small Nginx container in docker-compose.prod.yml:

  nginx:
    image: nginx:alpine
    container_name: nginx
    depends_on:
      - analytics-service   # or any web/API service you expose
    ports:
      - "80:80"
      # later "443:443" after TLS
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
Enter fullscreen mode Exit fullscreen mode

Example deploy/nginx/nginx.conf:

events {}

http {
  server {
    listen 80;
    server_name _;

    location /api/analytics/ {
      proxy_pass http://analytics-service:8000/;  # if your analytics-service exposes 8000
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
    }

    # (Optional) Proxy Kafdrop for demo:
    location /kafdrop/ {
      proxy_pass http://kafdrop:9000/;
    }

    # (Optional) Proxy Control Center:
    location /control-center/ {
      proxy_pass http://control-center:9021/;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Now:

  • http://EC2_PUBLIC_IP/api/analytics/ → hits your analytics-service via Nginx
  • You can later attach a domain & TLS.

Phase 4 – Domain + HTTPS

  1. In Route 53 (or your DNS provider):
  • Create A record: demo.jumptotech.dev → EC2 public IP
  1. For TLS:
  • Option A (fastest): run certbot on EC2 + mount certificates into Nginx container
  • Option B (more AWS-y): put an Application Load Balancer (ALB) in front + use ACM cert

For teaching and simplicity, you can say:

“We’ll keep HTTP for now in Part 2 and do full HTTPS + ALB in Part 3.”


Phase 5 – CI/CD with GitHub Actions (basic)

In ci-cd/github-actions/deploy.yml:

  • On push to main:

    • Build images
    • Push to ECR (or Docker Hub)
    • SSH or SSM into EC2 and run docker compose pull && docker compose up -d

Outline:

name: Deploy to EC2

on:
  push:
    branches: [ "main" ]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      # build & push images to ECR/DockerHub (you already know this part)

      - name: SSH into EC2 and deploy
        uses: appleboy/ssh-action@v1.0.0
        with:
          host: ${{ secrets.EC2_HOST }}
          username: ubuntu
          key: ${{ secrets.EC2_SSH_KEY }}
          script: |
            cd kafka-enterprise-orders
            git pull
            cd deploy
            docker compose -f docker-compose.prod.yml pull
            docker compose -f docker-compose.prod.yml up -d
Enter fullscreen mode Exit fullscreen mode

Now your students (and you) see full flow:

Code → GitHub → CI/CD → EC2 → Kafka platform online.


3. Short “Part 2 Project Definition” (how you can describe it)

Part 1 – Local development: build a Kafka-based event-driven system with producers, consumers, Postgres, Couchbase, Confluent tools, all via docker-compose on a laptop.

Part 2 – Cloud deployment: take the same system, package it for production using:

  • AWS EC2 as compute
  • Docker Compose for orchestration
  • Nginx as public entry point
  • Domain + DNS
  • Optional CI/CD via GitHub Actions

Goal: make the platform reachable from the internet as a demo system, while keeping internal components (Kafka, DBs) protected.

################ 1. New high-level architecture (Phase 2)

Your current repo:

kafka-enterprise-orders/
├── docker-compose.yml
├── .env
├── db/
├── connect/
├── ksql/
├── producer/
└── consumers/
Enter fullscreen mode Exit fullscreen mode

Phase 2 idea:

  • Confluent Cloud: managed Kafka cluster (no zookeeper, no local kafka, no schema-registry in docker)
  • AWS:

    • ECS Fargate tasks for producer, fraud-service, payment-service, analytics-service
    • RDS Postgres instead of local Postgres container
    • (Optional) Couchbase Capella / another managed NoSQL if you want to mirror Couchbase
  • GitHub:

    • Code in main branch
    • Docker images pushed to GitHub Container Registry (ghcr.io/…)
  • GitHub Actions:

    • Job 1: build & push images
    • Job 2: run Terraform to deploy/update infra

2. Extended repo structure

Let’s extend your schema like this:

kafka-enterprise-orders/
├── docker-compose.yml          # local dev (unchanged)
├── .env                        # local dev config
├── db/
│   └── init.sql
├── connect/
│   ├── Dockerfile
│   └── connectors/
│       ├── jdbc-source.json
│       └── couchbase-sink.json
├── ksql/
│   └── streams.sql
├── producer/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── order_producer.py
└── consumers/
    ├── fraud-service/
    │   ├── Dockerfile
    │   ├── requirements.txt
    │   └── fraud_consumer.py
    ├── payment-service/
    │   ├── Dockerfile
    │   ├── requirements.txt
    │   └── payment_consumer.py
    └── analytics-service/
        ├── Dockerfile
        ├── requirements.txt
        └── analytics_consumer.py

# NEW:
├── infra/
│   └── terraform/
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── ecs.tf          # ECS tasks + services
│       ├── rds.tf          # Postgres RDS
│       └── vpc.tf          # basic VPC + subnets + security groups
└── .github/
    └── workflows/
        └── ci-cd.yml
Enter fullscreen mode Exit fullscreen mode

3. Terraform basics (providers + backend)

This is a skeleton you can copy-paste and then fill secrets.

infra/terraform/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  # For now: local state. Later you can move to S3 + DynamoDB.
}

provider "aws" {
  region = var.aws_region
}
Enter fullscreen mode Exit fullscreen mode

infra/terraform/variables.tf

variable "aws_region" {
  description = "AWS region to deploy into"
  type        = string
  default     = "us-east-2"
}

variable "project_name" {
  description = "Base name for resources"
  type        = string
  default     = "kafka-enterprise-orders"
}

variable "container_image_producer" {
  description = "Docker image for order producer"
  type        = string
}

variable "container_image_fraud" {
  description = "Docker image for fraud service"
  type        = string
}

variable "container_image_payment" {
  description = "Docker image for payment service"
  type        = string
}

variable "container_image_analytics" {
  description = "Docker image for analytics service"
  type        = string
}

variable "confluent_bootstrap_servers" {
  description = "Confluent Cloud bootstrap servers"
  type        = string
}

variable "confluent_api_key" {
  description = "Confluent Cloud API key"
  type        = string
  sensitive   = true
}

variable "confluent_api_secret" {
  description = "Confluent Cloud API secret"
  type        = string
  sensitive   = true
}

variable "rds_username" {
  description = "RDS master username"
  type        = string
  default     = "orders_user"
}

variable "rds_password" {
  description = "RDS master password"
  type        = string
  sensitive   = true
}
Enter fullscreen mode Exit fullscreen mode

infra/terraform/outputs.tf

output "ecs_cluster_name" {
  value = aws_ecs_cluster.app_cluster.name
}

output "rds_endpoint" {
  value = aws_db_instance.orders_db.address
}
Enter fullscreen mode Exit fullscreen mode

4. VPC + networking (simplified)

infra/terraform/vpc.tf (simple but fine for teaching)

resource "aws_vpc" "main" {
  cidr_block           = "10.20.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.project_name}-vpc"
  }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.project_name}-igw"
  }
}

resource "aws_subnet" "public_a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.20.1.0/24"
  availability_zone       = "${var.aws_region}a"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-a"
  }
}

resource "aws_subnet" "public_b" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.20.2.0/24"
  availability_zone       = "${var.aws_region}b"
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-b"
  }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }

  tags = {
    Name = "${var.project_name}-public-rt"
  }
}

resource "aws_route_table_association" "public_a" {
  subnet_id      = aws_subnet.public_a.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "public_b" {
  subnet_id      = aws_subnet.public_b.id
  route_table_id = aws_route_table.public.id
}

resource "aws_security_group" "ecs_tasks" {
  name        = "${var.project_name}-ecs-sg"
  description = "Allow HTTP egress for ECS tasks"
  vpc_id      = aws_vpc.main.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.project_name}-ecs-sg"
  }
}
Enter fullscreen mode Exit fullscreen mode

(You can later add private subnets + NAT for hardening.)


5. RDS Postgres (replacing local container)

infra/terraform/rds.tf

resource "aws_db_subnet_group" "orders" {
  name       = "${var.project_name}-db-subnets"
  subnet_ids = [aws_subnet.public_a.id, aws_subnet.public_b.id]

  tags = {
    Name = "${var.project_name}-db-subnets"
  }
}

resource "aws_security_group" "rds" {
  name        = "${var.project_name}-rds-sg"
  description = "Allow Postgres from ECS tasks"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.ecs_tasks.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.project_name}-rds-sg"
  }
}

resource "aws_db_instance" "orders_db" {
  identifier              = "${var.project_name}-db"
  engine                  = "postgres"
  engine_version          = "16.3"
  instance_class          = "db.t3.micro"
  allocated_storage       = 20
  username                = var.rds_username
  password                = var.rds_password
  db_subnet_group_name    = aws_db_subnet_group.orders.name
  vpc_security_group_ids  = [aws_security_group.rds.id]
  publicly_accessible     = false
  skip_final_snapshot     = true

  tags = {
    Name = "${var.project_name}-db"
  }
}
Enter fullscreen mode Exit fullscreen mode

6. ECS cluster + one example service (pattern)

To keep it readable, I’ll show one ECS task/service (producer). You can copy it for fraud/payment/analytics by changing names & images.

infra/terraform/ecs.tf

resource "aws_ecs_cluster" "app_cluster" {
  name = "${var.project_name}-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Name = "${var.project_name}-cluster"
  }
}

resource "aws_iam_role" "ecs_task_execution" {
  name = "${var.project_name}-ecs-task-execution"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_task_execution_policy" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# Task role (for future: SSM, Secrets Manager, etc.)
resource "aws_iam_role" "ecs_task_role" {
  name = "${var.project_name}-ecs-task-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

# Example: ORDER PRODUCER task definition (Fargate)
resource "aws_ecs_task_definition" "order_producer" {
  family                   = "${var.project_name}-producer"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 256
  memory                   = 512
  execution_role_arn       = aws_iam_role.ecs_task_execution.arn
  task_role_arn            = aws_iam_role.ecs_task_role.arn

  container_definitions = jsonencode([
    {
      name      = "order-producer"
      image     = var.container_image_producer
      essential = true
      environment = [
        {
          name  = "KAFKA_BOOTSTRAP"
          value = var.confluent_bootstrap_servers
        },
        {
          name  = "CONFLUENT_API_KEY"
          value = var.confluent_api_key
        },
        {
          name  = "CONFLUENT_API_SECRET"
          value = var.confluent_api_secret
        }
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          awslogs-group         = "/ecs/${var.project_name}-producer"
          awslogs-region        = var.aws_region
          awslogs-stream-prefix = "ecs"
        }
      }
    }
  ])

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "X86_64"
  }
}

resource "aws_cloudwatch_log_group" "producer" {
  name              = "/ecs/${var.project_name}-producer"
  retention_in_days = 7
}

resource "aws_ecs_service" "order_producer" {
  name            = "${var.project_name}-producer-svc"
  cluster         = aws_ecs_cluster.app_cluster.id
  task_definition = aws_ecs_task_definition.order_producer.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = [aws_subnet.public_a.id, aws_subnet.public_b.id]
    security_groups = [aws_security_group.ecs_tasks.id]
    assign_public_ip = true
  }

  lifecycle {
    ignore_changes = [task_definition]
  }
}
Enter fullscreen mode Exit fullscreen mode

For fraud-service, payment-service, analytics-service, create similar aws_ecs_task_definition + aws_ecs_service blocks using var.container_image_fraud, etc., and pass RDS endpoint as env var for analytics-service.


7. GitHub Actions CI/CD: build images + terraform

Now the fun DevOps part 🚀

.github/workflows/ci-cd.yml

name: CI-CD

on:
  push:
    branches: [ "main" ]
  pull_request:

env:
  AWS_REGION: us-east-2
  TF_WORKING_DIR: infra/terraform
  GHCR_REGISTRY: ghcr.io
  PROJECT_NAME: kafka-enterprise-orders

jobs:
  build-and-push:
    runs-on: ubuntu-latest

    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout repo
        uses: actions/checkout@v4

      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.GHCR_REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build & push order-producer image
        uses: docker/build-push-action@v6
        with:
          context: ./producer
          push: true
          tags: |
            ${{ env.GHCR_REGISTRY }}/${{ github.repository }}/order-producer:latest

      - name: Build & push fraud-service image
        uses: docker/build-push-action@v6
        with:
          context: ./consumers/fraud-service
          push: true
          tags: |
            ${{ env.GHCR_REGISTRY }}/${{ github.repository }}/fraud-service:latest

      - name: Build & push payment-service image
        uses: docker/build-push-action@v6
        with:
          context: ./consumers/payment-service
          push: true
          tags: |
            ${{ env.GHCR_REGISTRY }}/${{ github.repository }}/payment-service:latest

      - name: Build & push analytics-service image
        uses: docker/build-push-action@v6
        with:
          context: ./consumers/analytics-service
          push: true
          tags: |
            ${{ env.GHCR_REGISTRY }}/${{ github.repository }}/analytics-service:latest

      - name: Export image tags
        id: images
        run: |
          echo "producer=${{ env.GHCR_REGISTRY }}/${{ github.repository }}/order-producer:latest" >> $GITHUB_OUTPUT
          echo "fraud=${{ env.GHCR_REGISTRY }}/${{ github.repository }}/fraud-service:latest" >> $GITHUB_OUTPUT
          echo "payment=${{ env.GHCR_REGISTRY }}/${{ github.repository }}/payment-service:latest" >> $GITHUB_OUTPUT
          echo "analytics=${{ env.GHCR_REGISTRY }}/${{ github.repository }}/analytics-service:latest" >> $GITHUB_OUTPUT

  terraform:
    needs: build-and-push
    runs-on: ubuntu-latest

    permissions:
      id-token: write
      contents: read

    steps:
      - name: Checkout repo
        uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.9.5

      - name: Terraform Init
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: terraform init

      - name: Terraform Plan
        working-directory: ${{ env.TF_WORKING_DIR }}
        env:
          TF_VAR_aws_region: ${{ env.AWS_REGION }}
          TF_VAR_container_image_producer: ${{ needs.build-and-push.outputs.producer }}
          TF_VAR_container_image_fraud: ${{ needs.build-and-push.outputs.fraud }}
          TF_VAR_container_image_payment: ${{ needs.build-and-push.outputs.payment }}
          TF_VAR_container_image_analytics: ${{ needs.build-and-push.outputs.analytics }}
          TF_VAR_confluent_bootstrap_servers: ${{ secrets.CONFLUENT_BOOTSTRAP_SERVERS }}
          TF_VAR_confluent_api_key: ${{ secrets.CONFLUENT_API_KEY }}
          TF_VAR_confluent_api_secret: ${{ secrets.CONFLUENT_API_SECRET }}
          TF_VAR_rds_password: ${{ secrets.RDS_PASSWORD }}
        run: terraform plan -out=tfplan

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        working-directory: ${{ env.TF_WORKING_DIR }}
        env:
          TF_VAR_aws_region: ${{ env.AWS_REGION }}
          TF_VAR_container_image_producer: ${{ needs.build-and-push.outputs.producer }}
          TF_VAR_container_image_fraud: ${{ needs.build-and-push.outputs.fraud }}
          TF_VAR_container_image_payment: ${{ needs.build-and-push.outputs.payment }}
          TF_VAR_container_image_analytics: ${{ needs.build-and-push.outputs.analytics }}
          TF_VAR_confluent_bootstrap_servers: ${{ secrets.CONFLUENT_BOOTSTRAP_SERVERS }}
          TF_VAR_confluent_api_key: ${{ secrets.CONFLUENT_API_KEY }}
          TF_VAR_confluent_api_secret: ${{ secrets.CONFLUENT_API_SECRET }}
          TF_VAR_rds_password: ${{ secrets.RDS_PASSWORD }}
        run: terraform apply -auto-approve tfplan
Enter fullscreen mode Exit fullscreen mode

In GitHub repo settings, you’ll need secrets:

  • AWS_ROLE_ARN (IAM role for GitHub OIDC)
  • CONFLUENT_BOOTSTRAP_SERVERS
  • CONFLUENT_API_KEY
  • CONFLUENT_API_SECRET
  • RDS_PASSWORD

8. “What if we harden it – is it too hard?”

It’s not “too hard”, it’s just many small best practices. You can do them gradually:

Short hardening checklist:

  1. Networking
  • Move ECS tasks to private subnets
  • Use NAT gateway for outbound internet
  • Keep RDS in private subnets (already not public)

    1. Secrets
  • Use AWS Secrets Manager for Confluent API key/secret & DB password

  • Inject via ECS task’s secrets instead of plain env vars

    1. IAM
  • GitHub Actions → OIDC → IAM role (already in the workflow)

  • Least privilege policies for ECS task role

    1. Observability
  • CloudWatch logs already configured for ECS tasks

  • Add metrics & alarms (CPU, memory, RDS connections)

    1. Confluent side
  • Use SASL/SSL (Confluent Cloud already uses TLS)

  • Restrict network access using IP allowlists or VPC peering if needed

So: no, it’s not too hard – it’s just the next steps after you’re comfortable with the basic deployment.

Build Public Website (FastAPI backend + React frontend)

1. Create new folder in your repo

kafka-enterprise-orders/
└── web/
    ├── backend/
    │   ├── Dockerfile
    │   ├── app.py
    │   └── requirements.txt
    └── frontend/
        ├── Dockerfile
        ├── nginx.conf
        └── build/ (generated)
Enter fullscreen mode Exit fullscreen mode

1️⃣ BACKEND (FastAPI)

web/backend/app.py

from fastapi import FastAPI
import os
import psycopg2
import json
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.options import ClusterTimeoutOptions

app = FastAPI()

COUCHBASE_HOST = os.getenv("COUCHBASE_HOST")
COUCHBASE_BUCKET = os.getenv("COUCHBASE_BUCKET")
COUCHBASE_USER = os.getenv("COUCHBASE_USERNAME")
COUCHBASE_PASS = os.getenv("COUCHBASE_PASSWORD")

@app.get("/api/analytics")
def get_analytics():
    try:
        cluster = Cluster(
            f"couchbase://{COUCHBASE_HOST}",
            ClusterOptions(
                PasswordAuthenticator(COUCHBASE_USER, COUCHBASE_PASS),
                timeout_options=ClusterTimeoutOptions(kv_timeout=5)
            )
        )
        bucket = cluster.bucket(COUCHBASE_BUCKET)
        collection = bucket.default_collection()

        # Query: get last 10 orders
        result = cluster.query(
            f"SELECT * FROM `{COUCHBASE_BUCKET}` LIMIT 10;"
        )

        orders = [row for row in result]
        return {"status":"ok", "orders":orders}

    except Exception as e:
        return {"error": str(e)}
Enter fullscreen mode Exit fullscreen mode

web/backend/requirements.txt

fastapi
uvicorn
couchbase==4.3.0
psycopg2-binary
Enter fullscreen mode Exit fullscreen mode

web/backend/Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode

2️⃣ FRONTEND (React)

You can use your template:

npx create-react-app web-frontend
Enter fullscreen mode Exit fullscreen mode

Replace src/App.js:

import React, { useEffect, useState } from "react";

function App() {
  const [orders, setOrders] = useState([]);

  useEffect(() => {
    fetch("/api/analytics")
      .then(res => res.json())
      .then(data => {
        setOrders(data.orders || []);
      })
  }, []);

  return (
    <div style={{padding: "20px"}}>
      <h1>Kafka Enterprise Orders  Dashboard</h1>
      <h3>Last 10 orders</h3>
      <table border="1">
        <thead>
          <tr>
            <th>Order ID</th>
            <th>Amount</th>
            <th>Country</th>
          </tr>
        </thead>
        <tbody>
          {orders.map((o, idx) => (
            <tr key={idx}>
              <td>{o.order.order_id}</td>
              <td>{o.order.amount}</td>
              <td>{o.order.country}</td>
            </tr>
          ))}
        </tbody>
      </table>
    </div>
  );
}

export default App;
Enter fullscreen mode Exit fullscreen mode

3️⃣ Frontend Dockerfile (Production)

web/frontend/Dockerfile

FROM node:20 AS build
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

FROM nginx:latest
COPY nginx.conf /etc/nginx/nginx.conf
COPY --from=build /app/build /usr/share/nginx/html
Enter fullscreen mode Exit fullscreen mode

web/frontend/nginx.conf

events {}

http {
  server {
    listen 80;

    location / {
      root /usr/share/nginx/html;
      try_files $uri /index.html;
    }

    location /api/ {
      proxy_pass http://backend:8000/;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Now your UI calls your FastAPI backend.


✔ PART B — Deploy Website on Kubernetes using Helm + ArgoCD

New folder:

k8s/
└── charts/
    └── webapp/
         ├── Chart.yaml
         ├── values.yaml
         ├── templates/
             ├── deployment.yaml
             ├── service.yaml
             ├── ingress.yaml
Enter fullscreen mode Exit fullscreen mode

1️⃣ Chart.yaml

apiVersion: v2
name: webapp
version: 0.1.0
Enter fullscreen mode Exit fullscreen mode

2️⃣ values.yaml

frontendImage: "ghcr.io/aisalkyn85/kafka-enterprise-orders/web-frontend:latest"
backendImage: "ghcr.io/aisalkyn85/kafka-enterprise-orders/web-backend:latest"

domain: "orders.jumptotech.dev"

couchbase:
  host: "couchbase.your-domain"
  bucket: "order_analytics"
  username: "Administrator"
  password: "password"
Enter fullscreen mode Exit fullscreen mode

3️⃣ deployment.yaml

This creates two pods: frontend + backend.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
        - name: backend
          image: {{ .Values.backendImage }}
          env:
            - name: COUCHBASE_HOST
              value: {{ .Values.couchbase.host }}
            - name: COUCHBASE_BUCKET
              value: {{ .Values.couchbase.bucket }}
            - name: COUCHBASE_USERNAME
              value: {{ .Values.couchbase.username }}
            - name: COUCHBASE_PASSWORD
              value: {{ .Values.couchbase.password }}
          ports:
            - containerPort: 8000

        - name: frontend
          image: {{ .Values.frontendImage }}
          ports:
            - containerPort: 80
Enter fullscreen mode Exit fullscreen mode

4️⃣ service.yaml

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  type: ClusterIP
  selector:
    app: webapp
  ports:
    - port: 80
      targetPort: 80
      name: http
Enter fullscreen mode Exit fullscreen mode

5️⃣ ingress.yaml

This exposes:
https://orders.jumptotech.dev

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
spec:
  rules:
    - host: {{ .Values.domain }}
      http:
        paths:
          - backend:
              service:
                name: webapp
                port:
                  number: 80
            path: /
            pathType: Prefix
Enter fullscreen mode Exit fullscreen mode

✔ PART C — ArgoCD config

Create folder:

argocd/
└── webapp.yaml
Enter fullscreen mode Exit fullscreen mode

argocd/webapp.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: webapp
  namespace: argocd
spec:
  project: default

  source:
    repoURL: "https://github.com/Aisalkyn85/kafka-enterprise-orders"
    path: k8s/charts/webapp
    targetRevision: main

  destination:
    server: https://kubernetes.default.svc
    namespace: webapp

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
Enter fullscreen mode Exit fullscreen mode

Apply:

kubectl apply -f argocd/webapp.yaml
Enter fullscreen mode Exit fullscreen mode

✔ FINAL RESULT

When everything is deployed:

The client opens browser:

https://orders.jumptotech.dev
Enter fullscreen mode Exit fullscreen mode

They see:

✔ Live analytics
✔ Last orders
✔ Fraud detection counts
✔ Dashboard updated in real time

All powered by:

  • Confluent Cloud
  • ECS Fargate microservices
  • Couchbase
  • EKS frontend/backend
  • ArgoCD continuous delivery
  • TLS HTTPS with AWS ACM
  • ALB Ingress

This is real production.


✅ PART 4 — PRODUCTION RELEASE ARCHITECTURE

kafka-enterprise-orders/
  infra/terraform/
  web/
     backend/
     frontend/
  k8s/charts/webapp/
  argocd/
  .github/workflows/
  ...
Enter fullscreen mode Exit fullscreen mode

add missing production components in 5 layers:


🔵 LAYER 1 — Production Secrets (Secrets Manager)

❗Why?

Right now your Confluent API key/secret + RDS password are coming from GitHub Secrets → Terraform → ECS env vars.

This is NOT production-safe.
Secrets must be stored in AWS Secrets Manager, encrypted, rotated, and accessed only by ECS tasks at runtime.

📍Where (file location)?

Create a new file:

infra/terraform/secrets.tf
Enter fullscreen mode Exit fullscreen mode

📄 What to put into secrets.tf:

resource "aws_secretsmanager_secret" "confluent_api" {
  name = "${var.project_name}-confluent-api"
}

resource "aws_secretsmanager_secret_version" "confluent_api_v1" {
  secret_id = aws_secretsmanager_secret.confluent_api.id
  secret_string = jsonencode({
    api_key    = var.confluent_api_key
    api_secret = var.confluent_api_secret
  })
}

resource "aws_secretsmanager_secret" "rds_credentials" {
  name = "${var.project_name}-rds-credentials"
}

resource "aws_secretsmanager_secret_version" "rds_credentials_v1" {
  secret_id = aws_secretsmanager_secret.rds_credentials.id
  secret_string = jsonencode({
    username = var.rds_username
    password = var.rds_password
  })
}
Enter fullscreen mode Exit fullscreen mode

🎯 Result

  • Sensitive secrets are not stored in ECS environment variables.
  • ECS tasks retrieve the secrets securely at runtime.
  • You become compliant with production security standards.

🔵 LAYER 2 — ECS Autoscaling (Load-based scaling)

❗Why?

Right now your ECS services run fixed number of tasks (1).
Production needs auto-scaling based on CPU/memory to handle high load.

📍Where?

Add file:

infra/terraform/autoscaling_ecs.tf
Enter fullscreen mode Exit fullscreen mode

📄 Content:

resource "aws_appautoscaling_target" "ecs_service" {
  max_capacity       = 5
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.app_cluster.name}/${aws_ecs_service.order_producer.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_service_cpu" {
  name               = "${var.project_name}-ecs-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 60
  }
}
Enter fullscreen mode Exit fullscreen mode

🎯 Result

  • ECS tasks scale from 1 → 5 automatically.
  • No downtime during traffic spikes.
  • Real production reliability.

🔵 LAYER 3 — WAF + ALB Security

❗Why?

You run a public website using:

k8s/charts/webapp/ingress.yaml
Enter fullscreen mode Exit fullscreen mode

This creates an internet-facing ALB.

In production you must protect the ALB with:

  • SQL injection protection
  • XSS protection
  • Bot control
  • Rate limiting

📍Where?

Create:

infra/terraform/waf.tf
Enter fullscreen mode Exit fullscreen mode

📄 Content:

resource "aws_wafv2_web_acl" "webapp" {
  name  = "${var.project_name}-waf"
  scope = "REGIONAL"

  default_action { allow {} }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "waf"
    sampled_requests_enabled   = true
  }

  rule {
    name     = "AWS-Common"
    priority = 1
    override_action { none {} }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "common"
      sampled_requests_enabled   = true
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

🎯 Result

  • Your public website becomes protected from common attacks.
  • You look like a real cloud engineer in interviews.
  • Production-grade security.

🔵 LAYER 4 — Observability (CloudWatch Alarms)

❗Why?

Your project processes:

  • Kafka events
  • Fraud events
  • Payments
  • Analytics

In production you must receive alerts when something breaks.

📍Where?

Create file:

infra/terraform/monitoring.tf
Enter fullscreen mode Exit fullscreen mode

📄 Content (example):

resource "aws_cloudwatch_metric_alarm" "rds_cpu_high" {
  alarm_name          = "${var.project_name}-rds-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 80
  evaluation_periods  = 2
  period              = 60
  metric_name         = "CPUUtilization"
  namespace           = "AWS/RDS"

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.orders_db.id
  }
}
Enter fullscreen mode Exit fullscreen mode

🎯 Result

  • You see alerts for RDS CPU, ECS CPU, etc.
  • You behave like true SRE/DevOps.

🔵 LAYER 5 — Health Checks (Kubernetes Probes)

❗Why?

ALB + Kubernetes must know when a pod is unhealthy.

Your backend already has FastAPI.
Add a health endpoint.


📍Where?

In:

web/backend/app.py
Enter fullscreen mode Exit fullscreen mode

✏ Add:

@app.get("/healthz")
def healthz():
    return {"status": "ok"}
Enter fullscreen mode Exit fullscreen mode

📍Then edit deployment:

k8s/charts/webapp/templates/deployment.yaml
Enter fullscreen mode Exit fullscreen mode

Add for backend:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8000
readinessProbe:
  httpGet:
    path: /healthz
    port: 8000
Enter fullscreen mode Exit fullscreen mode

🎯 Result

  • Kubernetes restarts unhealthy pods.
  • ALB removes bad pods from rotation.

🔵 LAYER 6 — CI/CD Full Automation (already working)

Your pipeline already does:

✔ Build → Push → Terraform Apply
✔ Build Web → Push
✔ Auto-deploy website
✔ ArgoCD syncs Kubernetes chart

This is exactly production-level CI/CD.


🟢 FINAL RESULT — WHAT YOU HAVE NOW

After implementing the above:

✔ Your whole system becomes true enterprise-grade

Production-grade Security

  • WAF protects your public ALB
  • Secrets Manager protects passwords
  • IAM least privilege

Production-grade Reliability

  • ECS autoscaling based on CPU
  • Kubernetes health probes
  • RDS alarms
  • ECS alarms

Production-grade Delivery (CI/CD)

  • GitHub Actions builds + deploys everything
  • Terraform provisions infra
  • ArgoCD deploys K8s continuously

Production-grade Architecture

  • Kafka microservices
  • RDS/Postgres
  • Couchbase analytics
  • Web frontend + FastAPI backend
  • Public ALB with domain + HTTPS
  • Multi-layer cloud security

✅ PHASE 5 — Add Prometheus + Grafana to Existing ECS Project (Production)

✔ AWS Managed Prometheus (AMP)

✔ AWS Managed Grafana (AMG)

✔ AWS Distro for OpenTelemetry Collector (sidecar in ECS tasks)

This avoids running Prometheus/Grafana in containers — which is NOT production safe.


🎯 WHAT YOU WILL ADD (3 folders + 2 Terraform files)

You only add these new folders to your repo:

infra/terraform/monitoring/
  ├── amp.tf
  ├── grafana.tf
  ├── otel.tf
ansible/
  ├── inventories/
  ├── roles/
  └── playbooks/
otel/
  └── otel-config.yaml
Enter fullscreen mode Exit fullscreen mode

✔ 1. Create monitoring/amp.tf

This creates AWS Managed Prometheus workspace.

resource "aws_prometheus_workspace" "main" {
  alias = "${var.project_name}-amp"
}
Enter fullscreen mode Exit fullscreen mode

✔ 2. Create monitoring/grafana.tf

This creates AWS Managed Grafana with IAM role that can query AMP + CloudWatch.

data "aws_iam_policy_document" "grafana_assume" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "SERVICE"
      identifiers = ["grafana.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "grafana_role" {
  name               = "${var.project_name}-grafana-role"
  assume_role_policy = data.aws_iam_policy_document.grafana_assume.json
}

data "aws_iam_policy_document" "grafana_policy" {
  statement {
    effect = "Allow"
    actions = [
      "aps:QueryMetrics",
      "aps:GetSeries",
      "aps:GetLabels",
      "aps:GetMetricMetadata"
    ]
    resources = [
      aws_prometheus_workspace.main.arn,
      "${aws_prometheus_workspace.main.arn}/*"
    ]
  }

  statement {
    effect = "Allow"
    actions = [
      "cloudwatch:GetMetricData",
      "cloudwatch:ListMetrics",
      "logs:GetLogEvents",
      "logs:DescribeLogGroups",
      "logs:DescribeLogStreams"
    ]
    resources = ["*"]
  }
}

resource "aws_iam_policy" "grafana_policy" {
  name   = "${var.project_name}-grafana-policy"
  policy = data.aws_iam_policy_document.grafana_policy.json
}

resource "aws_iam_role_policy_attachment" "grafana_attach" {
  role       = aws_iam_role.grafana_role.name
  policy_arn = aws_iam_policy.grafana_policy.arn
}

resource "aws_grafana_workspace" "main" {
  name                     = "${var.project_name}-grafana"
  account_access_type      = "CURRENT_ACCOUNT"
  authentication_providers = ["AWS_SSO"]
  permission_type          = "SERVICE_MANAGED"

  role_arn = aws_iam_role.grafana_role.arn

  data_sources = [
    "CLOUDWATCH",
    "PROMETHEUS"
  ]
}
Enter fullscreen mode Exit fullscreen mode

✔ 3. Create monitoring/otel.tf

This injects OpenTelemetry Collector sidecar into all ECS microservices.

Add this to your ECS task definitions:

resource "aws_ecs_task_definition" "order_producer" {
  family                   = "order-producer"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 256
  memory                   = 512

  task_role_arn      = aws_iam_role.ecs_task_role.arn
  execution_role_arn = aws_iam_role.ecs_execution_role.arn

  container_definitions = jsonencode([
    {
      name      = "order-producer"
      image     = var.container_image_producer
      essential = true
      portMappings = [{ containerPort = 8000 }]
      environment = [
        { name = "METRICS_PORT", value = "8000" }
      ]
    },
    {
      name      = "otel-collector"
      image     = "public.ecr.aws/aws-observability/aws-otel-collector:latest"
      essential = true
      command   = ["--config=/etc/ecs/otel-config.yaml"]
      environment = [
        { name = "AWS_REGION", value = var.aws_region },
        { name = "AMP_ENDPOINT", value = aws_prometheus_workspace.main.prometheus_endpoint }
      ]
      mountPoints = [{
        sourceVolume  = "otel-config"
        containerPath = "/etc/ecs"
      }]
    }
  ])

  volume {
    name = "otel-config"
    docker_volume_configuration {
      autoprovision = true
      scope         = "task"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

✔ 4. Create folder otel/otel-config.yaml

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "ecs-metrics"
          static_configs:
            - targets: ["localhost:8000"]

exporters:
  awsprometheusremotewrite:
    endpoint: "${AMP_ENDPOINT}"
    aws_auth:
      region: "${AWS_REGION}"

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [awsprometheusremotewrite]
Enter fullscreen mode Exit fullscreen mode

✔ 5. Add variables in variables.tf

variable "project_name" {
  type    = string
  default = "kafka-enterprise-orders"
}

variable "aws_region" {
  type    = string
  default = "us-east-2"
}
Enter fullscreen mode Exit fullscreen mode

✔ 6. Add output to outputs.tf

output "prometheus_workspace" {
  value = aws_prometheus_workspace.main.arn
}

output "grafana_url" {
  value = aws_grafana_workspace.main.endpoint
}
Enter fullscreen mode Exit fullscreen mode

#️⃣ PART 2 — Add ANSIBLE to Your Project (Production Ready)

Create folder:

ansible/
  ├── inventories/prod
  ├── roles/
  │   └── bastion/
  │       └── tasks/main.yml
  └── playbooks/setup-bastion.yml
Enter fullscreen mode Exit fullscreen mode

ansible/inventories/prod

[bastion]
your-ec2-bastion-ip
Enter fullscreen mode Exit fullscreen mode

roles/bastion/tasks/main.yml

- name: Install tools for debugging Kafka ecosystem
  apt:
    name:
      - awscli
      - python3-pip
      - jq
      - htop
      - kafkacat
    state: present
    update_cache: yes
Enter fullscreen mode Exit fullscreen mode

playbooks/setup-bastion.yml

- hosts: bastion
  become: yes
  roles:
    - bastion
Enter fullscreen mode Exit fullscreen mode

Runs:

ansible-playbook -i inventories/prod playbooks/setup-bastion.yml
Enter fullscreen mode Exit fullscreen mode

🎉 YOU NOW HAVE FULL PRODUCTION MONITORING

After applying:

✔ Metrics automatically flow from your ECS microservices → AMP

✔ Grafana automatically reads AMP via AWS IAM

✔ ECS tasks automatically export /metrics

✔ Ansible manages EC2s (bastion, tools, monitoring agents)

Top comments (1)

Collapse
 
sawyerwolfe profile image
Sawyer Wolfe

This is a fantastic end‑to‑end lab idea. I'm curious: what are your next steps—will you add more realistic failure scenarios, security hardening, or multi-cluster replication?