David Oyewole

Posted on Jul 25

🚀 Building a Fully Automated Cloud Monitoring & Logging Infrastructure on AWS

#monitoring #devops #aws #cloud

In today’s cloud-native world, observability is more than just a luxury, it's a necessity. Whether you’re a startup deploying fast or an enterprise managing hundreds of services, monitoring and logging infrastructure forms the backbone of system reliability and performance.

This article documents how I designed and automated a real-world cloud monitoring and logging infrastructure using Terraform, Ansible, Prometheus, Grafana, Loki, Fluent Bit, and Alertmanager, all deployed on AWS EC2 with logs stored remotely in S3.

🎯 Why I Built This Project

I’ve worked on cloud deployments before, but I wanted something beyond the usual hello-world stack. I wanted a setup that would:

Mirror what’s used in real production environments
Be modular and repeatable (IAC-first mindset)
Handle metrics, logs, and alerts in a centralized and automated way
Include secure public access to dashboards and endpoints via HTTPS

This project was not only an exercise in learning and problem-solving, but also a step toward building production-ready DevOps workflows.

🌍 Real-World Use Case

Imagine you're managing infrastructure for a SaaS product with multiple microservices. Here’s how this stack supports you:

Concern	Solution Provided
Detect server failure	Prometheus + Alertmanager alerts to Slack
Monitor system metrics	Node Exporter + Grafana dashboards
Track app errors in logs	Fluent Bit → Loki → Grafana
Long-term log retention	Loki stores chunks in S3
Secure access to tools	Nginx reverse proxy + SSL
Infrastructure scalability	Terraform + modular Ansible roles

⚙️ Architecture Overview

The project provisions:

Two EC2 instances:
- Monitoring Server (Prometheus, Grafana, Loki, Alertmanager, Nginx)
- Web Server (Fluent Bit, Node Exporter, and static web content)
S3 Bucket for storing Loki logs
Slack Integration via Alertmanager
All services are containerless and run as systemd services for simplicity and transparency.

    [Web Server]                      [Monitoring Server]
    ┌──────────────┐                   ┌────────────────────────────────┐
    │ Node Exporter│→───────────────→│  Prometheus               │
    │ Fluent Bit  │━━━━━━━━━━━┐          │                           │
    └───────────────└────────────→ │  Loki                     │
                                    │   ↳ S3 storage backend     │
                                    │  Grafana + Alertmanager    │
                                    └────────────────────────────┘
                                             ↓
                                       Slack Notifications

🧠 Automation Strategy

From the beginning, the goal was to automate everything. I used:

Terraform: To provision EC2, security groups, IAM roles, and the S3 bucket
Ansible: To install, configure, and start all services
A shell script: To orchestrate the full deployment and handle DNS updates via Namecheap's dynamic DNS API

Everything from spinning up servers to pushing logs into Grafana — happens with a single command:

./deploy.sh

This deploys Terraform infra, fetches the EC2 IP, updates DNS, waits for propagation, generates Ansible variables, and runs the full Ansible playbook.

💥 Challenges Faced

1. DNS & SSL Certbot Errors

Let’s Encrypt (via Certbot) expects Nginx to be running before it can verify domain ownership, but Nginx can’t start unless the SSL certs exist. Classic chicken-and-egg.

Solution:
I implemented a temporary HTTP-only Nginx config just to obtain the certs, then replaced it with the full reverse proxy config.

2. Terraform IP Sync with Ansible

After terraform apply, the private/public IPs of EC2 instances change, but Ansible needs those IPs to connect and configure properly.

Solution:
I wrote a Python script to auto-generate a YAML file (terraform_outputs.yml) from terraform output -json, which Ansible uses directly.

3. Loki Failing to Flush Logs to S3

I saw this error:

PutObjectInput.Bucket: minimum field size of 1

Turns out, even though the AWS credentials were set, the bucket name was missing in the config due to a YAML templating issue.

Lesson: Always double-check templated variables and file paths in production systems.

4. Slack Webhook Not Working

Alertmanager showed the alert as firing, but no message came to Slack.

Cause: The webhook URL wasn’t rendering correctly in the Alertmanager config due to a misconfigured variable.

Fix: Injected the correct URL into Ansible vars and validated it using cat /etc/alertmanager/alertmanager.yml | grep api_url.

📊 Dashboards in Action

Once everything was set up, Grafana became the single-pane-of-glass for:

CPU, memory, disk metrics from Prometheus + Node Exporter
Log streams from Fluent Bit + Loki, searchable by label
Alert status overview with firing/resolved alerts
S3 bucket verification for long-term log persistence

Access it securely at:

https://monitoring.yourdomain.com/grafana/

📣 Slack Alerts in Action

Prometheus + Alertmanager monitor system health, and notify me on Slack when:

A server goes down (NodeDown)
Fluent Bit or Loki stop sending logs
Memory or CPU usage exceeds thresholds

💡 Key Lessons Learned

Observability must be baked in from the start, not retrofitted.
SSL and reverse proxy automation can be tricky — understand the Certbot flow.
Always validate your YAML configs on target machines.
The power of infrastructure-as-code is not just provisioning — but repeatability.

✅ What’s Next

[ ] Add autoscaling for web nodes and dynamic target discovery in Prometheus
[ ] Automate dashboard import in Grafana via JSON API
[ ] Add multi-user RBAC auth for Grafana
[ ] Store Prometheus TSDB snapshots in S3

📁 GitHub Repository

Full codebase is available here:
🔗 GitHub - Cloud Monitoring & Logging Infrastructure

🙌 Final Thoughts

This project solidified my understanding of real-world observability. It goes beyond tutorials and walks into what a DevOps engineer would actually implement in a production environment.

Everything provisioning, configuration, metrics, logs, dashboards, alerts, HTTPS, log storage is done automatically with one script.

If you're building or learning DevOps, I strongly recommend doing a project like this. It forces you to troubleshoot across multiple layers: infrastructure, OS, services, and automation.

Let me know what you think, or if you’d like a walkthrough video or workshop based on this project!

DEV Community