DEV Community

Cover image for πŸš€ Building a Fully Automated Cloud Monitoring & Logging Infrastructure on AWS
David Oyewole
David Oyewole

Posted on

πŸš€ Building a Fully Automated Cloud Monitoring & Logging Infrastructure on AWS

In today’s cloud-native world, observability is more than just a luxury, it's a necessity. Whether you’re a startup deploying fast or an enterprise managing hundreds of services, monitoring and logging infrastructure forms the backbone of system reliability and performance.

This article documents how I designed and automated a real-world cloud monitoring and logging infrastructure using Terraform, Ansible, Prometheus, Grafana, Loki, Fluent Bit, and Alertmanager, all deployed on AWS EC2 with logs stored remotely in S3.


🎯 Why I Built This Project

I’ve worked on cloud deployments before, but I wanted something beyond the usual hello-world stack. I wanted a setup that would:

  • Mirror what’s used in real production environments
  • Be modular and repeatable (IAC-first mindset)
  • Handle metrics, logs, and alerts in a centralized and automated way
  • Include secure public access to dashboards and endpoints via HTTPS

This project was not only an exercise in learning and problem-solving, but also a step toward building production-ready DevOps workflows.


🌍 Real-World Use Case

Imagine you're managing infrastructure for a SaaS product with multiple microservices. Here’s how this stack supports you:

Concern Solution Provided
Detect server failure Prometheus + Alertmanager alerts to Slack
Monitor system metrics Node Exporter + Grafana dashboards
Track app errors in logs Fluent Bit β†’ Loki β†’ Grafana
Long-term log retention Loki stores chunks in S3
Secure access to tools Nginx reverse proxy + SSL
Infrastructure scalability Terraform + modular Ansible roles

βš™οΈ Architecture Overview

The project provisions:

  • Two EC2 instances:

    • Monitoring Server (Prometheus, Grafana, Loki, Alertmanager, Nginx)
    • Web Server (Fluent Bit, Node Exporter, and static web content)
  • S3 Bucket for storing Loki logs

  • Slack Integration via Alertmanager

  • All services are containerless and run as systemd services for simplicity and transparency.

    [Web Server]                      [Monitoring Server]
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Node Exporter│→───────────────→│  Prometheus               β”‚
    β”‚ Fluent Bit  │━━━━━━━━━━━┐          β”‚                           β”‚
    └───────────────└────────────→ β”‚  Loki                     β”‚
                                    β”‚   ↳ S3 storage backend     β”‚
                                    β”‚  Grafana + Alertmanager    β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             ↓
                                       Slack Notifications
Enter fullscreen mode Exit fullscreen mode

🧠 Automation Strategy

From the beginning, the goal was to automate everything. I used:

  • Terraform: To provision EC2, security groups, IAM roles, and the S3 bucket
  • Ansible: To install, configure, and start all services
  • A shell script: To orchestrate the full deployment and handle DNS updates via Namecheap's dynamic DNS API

Everything from spinning up servers to pushing logs into Grafana β€” happens with a single command:

./deploy.sh
Enter fullscreen mode Exit fullscreen mode

This deploys Terraform infra, fetches the EC2 IP, updates DNS, waits for propagation, generates Ansible variables, and runs the full Ansible playbook.


πŸ’₯ Challenges Faced

1. DNS & SSL Certbot Errors

Let’s Encrypt (via Certbot) expects Nginx to be running before it can verify domain ownership, but Nginx can’t start unless the SSL certs exist. Classic chicken-and-egg.

Solution:
I implemented a temporary HTTP-only Nginx config just to obtain the certs, then replaced it with the full reverse proxy config.


2. Terraform IP Sync with Ansible

After terraform apply, the private/public IPs of EC2 instances change, but Ansible needs those IPs to connect and configure properly.

Solution:
I wrote a Python script to auto-generate a YAML file (terraform_outputs.yml) from terraform output -json, which Ansible uses directly.


3. Loki Failing to Flush Logs to S3

I saw this error:

PutObjectInput.Bucket: minimum field size of 1
Enter fullscreen mode Exit fullscreen mode

Turns out, even though the AWS credentials were set, the bucket name was missing in the config due to a YAML templating issue.

Lesson: Always double-check templated variables and file paths in production systems.


4. Slack Webhook Not Working

Alertmanager showed the alert as firing, but no message came to Slack.

Cause: The webhook URL wasn’t rendering correctly in the Alertmanager config due to a misconfigured variable.

Fix: Injected the correct URL into Ansible vars and validated it using cat /etc/alertmanager/alertmanager.yml | grep api_url.


πŸ“Š Dashboards in Action

Once everything was set up, Grafana became the single-pane-of-glass for:

  • CPU, memory, disk metrics from Prometheus + Node Exporter
  • Log streams from Fluent Bit + Loki, searchable by label
  • Alert status overview with firing/resolved alerts
  • S3 bucket verification for long-term log persistence

Access it securely at:

https://monitoring.yourdomain.com/grafana/
Enter fullscreen mode Exit fullscreen mode

πŸ“£ Slack Alerts in Action

Prometheus + Alertmanager monitor system health, and notify me on Slack when:

  • A server goes down (NodeDown)
  • Fluent Bit or Loki stop sending logs
  • Memory or CPU usage exceeds thresholds

πŸ’‘ Key Lessons Learned

  • Observability must be baked in from the start, not retrofitted.
  • SSL and reverse proxy automation can be tricky β€” understand the Certbot flow.
  • Always validate your YAML configs on target machines.
  • The power of infrastructure-as-code is not just provisioning β€” but repeatability.

βœ… What’s Next

  • [ ] Add autoscaling for web nodes and dynamic target discovery in Prometheus
  • [ ] Automate dashboard import in Grafana via JSON API
  • [ ] Add multi-user RBAC auth for Grafana
  • [ ] Store Prometheus TSDB snapshots in S3

πŸ“ GitHub Repository

Full codebase is available here:
πŸ”— GitHub - Cloud Monitoring & Logging Infrastructure


πŸ™Œ Final Thoughts

This project solidified my understanding of real-world observability. It goes beyond tutorials and walks into what a DevOps engineer would actually implement in a production environment.

Everything provisioning, configuration, metrics, logs, dashboards, alerts, HTTPS, log storage is done automatically with one script.

If you're building or learning DevOps, I strongly recommend doing a project like this. It forces you to troubleshoot across multiple layers: infrastructure, OS, services, and automation.

Let me know what you think, or if you’d like a walkthrough video or workshop based on this project!

Top comments (0)