In todayβs cloud-native world, observability is more than just a luxury, it's a necessity. Whether youβre a startup deploying fast or an enterprise managing hundreds of services, monitoring and logging infrastructure forms the backbone of system reliability and performance.
This article documents how I designed and automated a real-world cloud monitoring and logging infrastructure using Terraform, Ansible, Prometheus, Grafana, Loki, Fluent Bit, and Alertmanager, all deployed on AWS EC2 with logs stored remotely in S3.
π― Why I Built This Project
Iβve worked on cloud deployments before, but I wanted something beyond the usual hello-world stack. I wanted a setup that would:
- Mirror whatβs used in real production environments
- Be modular and repeatable (IAC-first mindset)
- Handle metrics, logs, and alerts in a centralized and automated way
- Include secure public access to dashboards and endpoints via HTTPS
This project was not only an exercise in learning and problem-solving, but also a step toward building production-ready DevOps workflows.
π Real-World Use Case
Imagine you're managing infrastructure for a SaaS product with multiple microservices. Hereβs how this stack supports you:
Concern | Solution Provided |
---|---|
Detect server failure | Prometheus + Alertmanager alerts to Slack |
Monitor system metrics | Node Exporter + Grafana dashboards |
Track app errors in logs | Fluent Bit β Loki β Grafana |
Long-term log retention | Loki stores chunks in S3 |
Secure access to tools | Nginx reverse proxy + SSL |
Infrastructure scalability | Terraform + modular Ansible roles |
βοΈ Architecture Overview
The project provisions:
-
Two EC2 instances:
- Monitoring Server (Prometheus, Grafana, Loki, Alertmanager, Nginx)
- Web Server (Fluent Bit, Node Exporter, and static web content)
S3 Bucket for storing Loki logs
Slack Integration via Alertmanager
All services are containerless and run as systemd services for simplicity and transparency.
[Web Server] [Monitoring Server]
ββββββββββββββββ ββββββββββββββββββββββββββββββββββ
β Node Exporterβββββββββββββββββββ Prometheus β
β Fluent Bit βββββββββββββ β β
ββββββββββββββββββββββββββββββ β Loki β
β β³ S3 storage backend β
β Grafana + Alertmanager β
ββββββββββββββββββββββββββββββ
β
Slack Notifications
π§ Automation Strategy
From the beginning, the goal was to automate everything. I used:
- Terraform: To provision EC2, security groups, IAM roles, and the S3 bucket
- Ansible: To install, configure, and start all services
- A shell script: To orchestrate the full deployment and handle DNS updates via Namecheap's dynamic DNS API
Everything from spinning up servers to pushing logs into Grafana β happens with a single command:
./deploy.sh
This deploys Terraform infra, fetches the EC2 IP, updates DNS, waits for propagation, generates Ansible variables, and runs the full Ansible playbook.
π₯ Challenges Faced
1. DNS & SSL Certbot Errors
Letβs Encrypt (via Certbot) expects Nginx to be running before it can verify domain ownership, but Nginx canβt start unless the SSL certs exist. Classic chicken-and-egg.
Solution:
I implemented a temporary HTTP-only Nginx config just to obtain the certs, then replaced it with the full reverse proxy config.
2. Terraform IP Sync with Ansible
After terraform apply
, the private/public IPs of EC2 instances change, but Ansible needs those IPs to connect and configure properly.
Solution:
I wrote a Python script to auto-generate a YAML file (terraform_outputs.yml
) from terraform output -json
, which Ansible uses directly.
3. Loki Failing to Flush Logs to S3
I saw this error:
PutObjectInput.Bucket: minimum field size of 1
Turns out, even though the AWS credentials were set, the bucket name was missing in the config due to a YAML templating issue.
Lesson: Always double-check templated variables and file paths in production systems.
4. Slack Webhook Not Working
Alertmanager showed the alert as firing, but no message came to Slack.
Cause: The webhook URL wasnβt rendering correctly in the Alertmanager config due to a misconfigured variable.
Fix: Injected the correct URL into Ansible vars and validated it using cat /etc/alertmanager/alertmanager.yml | grep api_url
.
π Dashboards in Action
Once everything was set up, Grafana became the single-pane-of-glass for:
- CPU, memory, disk metrics from Prometheus + Node Exporter
- Log streams from Fluent Bit + Loki, searchable by label
- Alert status overview with firing/resolved alerts
- S3 bucket verification for long-term log persistence
Access it securely at:
https://monitoring.yourdomain.com/grafana/
π£ Slack Alerts in Action
Prometheus + Alertmanager monitor system health, and notify me on Slack when:
- A server goes down (NodeDown)
- Fluent Bit or Loki stop sending logs
- Memory or CPU usage exceeds thresholds
π‘ Key Lessons Learned
- Observability must be baked in from the start, not retrofitted.
- SSL and reverse proxy automation can be tricky β understand the Certbot flow.
- Always validate your YAML configs on target machines.
- The power of infrastructure-as-code is not just provisioning β but repeatability.
β Whatβs Next
- [ ] Add autoscaling for web nodes and dynamic target discovery in Prometheus
- [ ] Automate dashboard import in Grafana via JSON API
- [ ] Add multi-user RBAC auth for Grafana
- [ ] Store Prometheus TSDB snapshots in S3
π GitHub Repository
Full codebase is available here:
π GitHub - Cloud Monitoring & Logging Infrastructure
π Final Thoughts
This project solidified my understanding of real-world observability. It goes beyond tutorials and walks into what a DevOps engineer would actually implement in a production environment.
Everything provisioning, configuration, metrics, logs, dashboards, alerts, HTTPS, log storage is done automatically with one script.
If you're building or learning DevOps, I strongly recommend doing a project like this. It forces you to troubleshoot across multiple layers: infrastructure, OS, services, and automation.
Let me know what you think, or if youβd like a walkthrough video or workshop based on this project!
Top comments (0)