Engineering Design Document
MeetMind Observability Platform — V1 Critique and V2 Design
Author: The Duke Airfluke
Audience: Principal Engineer review
Date: June 2026
Executive Summary
The V1 MeetMind observability platform successfully demonstrated a working LGTM stack — Loki, Grafana, Tempo, and Prometheus — deployed as native systemd services on a single EC2 instance, with a reverse SSH tunnel bridging a cross-account monitoring gap. It collected metrics, fired alerts to Slack, and proved the observability pipeline worked end to end.
It is not production-ready.
This document is a brutal critique of V1's architectural decisions and a rigorous design for V2 — a highly available, secure, and scalable observability platform capable of surviving real traffic, real incidents, and real adversaries.
Section 1: V1 Architecture Critique
Brief Overview of V1
V1 deployed nine services — Prometheus, Loki, Tempo, Grafana, Alertmanager, Node Exporter, Blackbox Exporter, Pushgateway, and OpenTelemetry Collector — as systemd units on a single t3.small EC2 instance (2 vCPU, 2GB RAM). A Terraform configuration provisions this instance from scratch. Configs are version-controlled in GitHub. All alerts route to a Slack channel via Alertmanager's webhook integration. Cross-account monitoring is achieved through a reverse SSH tunnel from the application server to the monitoring server, forwarding Node Exporter on port 9100 to port 9101 on the monitoring server.
Exact Breaking Points
1. Single Point of Failure — Everything
The entire observability stack runs on one EC2 instance. When that instance goes down — hardware failure, AWS availability zone outage, accidental terraform destroy, or an OOM kill — you lose:
- All metrics collection
- All alerting
- All dashboards
- All log storage
- All trace storage
Your monitoring stack cannot monitor itself going down. If 3.93.140.221 dies, you find out when a human notices the application is broken, not when an alert fires. This is the most fundamental contradiction in the design.
2. The Reverse SSH Tunnel is a Fragile Bridge
The tunnel between the application server and monitoring server is managed by a single autossh process. During the project, the tunnel dropped multiple times — causing HostDown alerts, SLOAvailabilityFastBurn alerts, and Slack spam. The fix was to increase the for: duration on alerts to tolerate reconnection time.
This means we deliberately made our alerting less sensitive to mask infrastructure fragility. That is the wrong trade-off. The tunnel also represents an unlocked persistent SSH connection between two servers. If either server is compromised, the attacker inherits a pre-authenticated path to the other.
3. Prometheus Has No Storage Redundancy
Prometheus stores metrics on the local disk at /var/lib/prometheus. 30 days of retention. When the instance is terminated by terraform destroy, all historical metrics are lost. There is no remote write, no object storage backup, no secondary Prometheus. Spinning up a new monitoring server starts with zero historical data — no SLO compliance history, no DORA trend lines, nothing.
4. No Authentication on Internal Services
Every service runs with no authentication:
- Prometheus UI at
:9090— open to anyone with network access - Alertmanager UI at
:9093— anyone can create silences, delete routes, or inject fake alerts - Pushgateway at
:9091— anyone can push arbitrary metrics and corrupt DORA data - Grafana defaults to
admin/admin— hardcoded in the install script
The security group restricts ports, but security-group-only auth is not authentication. It is network filtering. Any instance in the same VPC, any compromised dependency, or any misconfigured security group rule exposes these services entirely.
5. The Slack Webhook is a Secret Living in a Config File
The Slack webhook URL is stored in /etc/alertmanager/alertmanager.yml on the monitoring server. It was accidentally committed to GitHub twice during the project — once in a commit that had to be rewritten with git push --force. The webhook was never properly rotated after those exposures. This means any attacker with the webhook URL can post arbitrary messages to #all-hng-alerts, injecting fake alerts or silencing real ones by creating confusion.
6. t3.small Will OOM Under Real Load
Loki, Tempo, Prometheus, and Grafana are all memory-intensive. On a t3.small with 2GB RAM:
- Prometheus with 30 days of metrics and multiple scrape targets consumes approximately 800MB–1.2GB
- Loki's ingestion buffer consumes 200–400MB under moderate load
- Grafana with multiple concurrent dashboard users consumes 300–500MB
Total: 1.3GB–2.1GB. The instance has 2GB. Under any sustained load, the OOM killer terminates a service. During the project we saw the stress-ng game day scenario push the instance to the edge. In production, real traffic would do the same.
7. No Retention Policy Enforcement
30-day retention is configured but not verified. Loki's compactor runs but is not monitored. If the disk fills before the compactor can reclaim space — a real risk on a 30GB root volume — Loki stops accepting logs, Tempo stops accepting traces, and the entire stack degrades silently.
Security Blind Spots
- No mTLS between any services — all inter-service communication is plaintext on localhost
- No rate limiting on the Pushgateway endpoint — trivially DoS'd or data-poisoned
- SSH private key for the tunnel stored at
/var/lib/monitoring/id_rsawith no rotation schedule - No audit logging — no record of who silenced alerts, who changed Grafana dashboards, or when
- Grafana
allowUiUpdates: falseprevents dashboard modification but does not prevent API access with the admin password
Section 2: V2 New Features — Fully Designed
Feature 1: Remote Write to Managed Object Storage (Prometheus → Thanos)
What it does and why it is needed
V1 loses all historical metrics on terraform destroy. V2 introduces Thanos Sidecar running alongside each Prometheus instance, uploading compacted metric blocks to S3 every 2 hours. Thanos Query provides a unified query layer across multiple Prometheus instances and S3, enabling dashboards to query metrics spanning months without keeping them in local TSDB.
Architectural Integration
Prometheus (primary) ──remote write──→ Thanos Sidecar
│
↓ upload every 2h
S3 Bucket (meetmind-metrics)
│
Thanos Store ←─┘
│
Thanos Query (unified query API)
│
Grafana datasource
Thanos Ruler takes over alert rule evaluation from Prometheus, ensuring alert evaluation continues even if one Prometheus instance goes down. Alertmanager receives alerts from Thanos Ruler, not directly from Prometheus.
Data Model Changes
No schema changes. Thanos uses the existing Prometheus TSDB format. S3 object structure:
s3://meetmind-metrics/
prometheus/
01HXYZ.../ (2-hour compacted block)
chunks/
index
meta.json
Retention on S3 is governed by a lifecycle policy: raw blocks retained for 1 year, downsampled 5-minute resolution blocks for 3 years.
Trade-offs
S3 upload introduces 2-hour lag before metrics are queryable from object storage. Recent metrics (last 2 hours) come from local Prometheus TSDB. Thanos Store queries are slower than local Prometheus queries — acceptable for historical dashboards, not acceptable for alerting (which uses local Prometheus). Operational complexity increases: Thanos introduces three new components (Sidecar, Store, Query) that must be monitored.
Feature 2: High Availability Alertmanager Cluster
What it does and why it is needed
V1 runs a single Alertmanager. If it crashes between a Prometheus firing event and the Slack webhook delivery, the alert is silently lost. V2 runs three Alertmanager instances in a gossip cluster using the --cluster.peer flag. All three share alert state — deduplication is guaranteed across the cluster.
Architectural Integration
Prometheus ──→ Alertmanager-1 (port 9093) ┐
──→ Alertmanager-2 (port 9094) ├── gossip cluster (port 9094/mesh)
──→ Alertmanager-3 (port 9095) ┘
│
(deduplicated) one notification
│
Slack webhook
Prometheus sends alerts to all three Alertmanager instances simultaneously. The gossip protocol ensures only one instance sends the Slack notification.
Data Model Changes
No new persistent data. Alertmanager state (silences, inhibitions) is replicated in memory across the cluster via gossip. Silences are persisted to disk per instance for recovery after restart.
Trade-offs
Three Alertmanager instances on separate hosts cost more. Gossip protocol adds ~5ms of coordination overhead per alert group — invisible in practice. Requires a load balancer or DNS round-robin for Prometheus to reach all three. If the network partitions, two instances may send the same notification once before reconciling.
Feature 3: Secrets Management via AWS Secrets Manager
What it does and why it is needed
V1 stores the Slack webhook URL in a plaintext config file that was committed to GitHub twice. V2 stores all secrets — Slack webhook, Grafana admin password, SSH keys, database credentials — in AWS Secrets Manager. The install script fetches secrets at deploy time using the instance's IAM role. No secret ever touches a config file or a Git repository.
Architectural Integration
EC2 Instance (monitoring server)
└── IAM Role: meetmind-monitoring-role
└── Policy: allow secretsmanager:GetSecretValue
on arn:aws:secretsmanager:*:*:secret:meetmind/*
install.sh:
SLACK_WEBHOOK=$(aws secretsmanager get-secret-value \
--secret-id meetmind/slack-webhook \
--query SecretString --output text)
Secrets are injected into config files at install time and deleted from environment variables immediately after. Secrets are never written to disk in their raw form — only written to the config file with appropriate file permissions (chmod 640, owned by the service user).
Data Model Changes
Secrets Manager stores:
meetmind/slack-webhook → {"url": "https://hooks.slack.com/..."}
meetmind/grafana-admin → {"password": "..."}
meetmind/tunnel-ssh-key → {"private_key": "-----BEGIN..."}
meetmind/prometheus-auth → {"username": "prometheus", "password": "..."}
Trade-offs
AWS Secrets Manager costs $0.40/secret/month plus $0.05 per 10,000 API calls — negligible. The IAM role must be tightly scoped: if the monitoring server is compromised, the attacker can only read meetmind/* secrets, not all secrets in the account. Secrets rotation requires a Lambda function or manual rotation — operationally more complex than editing a config file.
Section 3: Production Readiness
Security
Authentication and Authorization
Every service in V2 requires authentication:
-
Prometheus: HTTP basic auth via
web.ymlconfig (bcrypt-hashed password). Grafana connects with a service account, not the admin user. - Alertmanager: HTTP basic auth on the API. Prometheus uses credentials when sending alerts.
- Grafana: SAML or OAuth2 SSO via the organization's identity provider. Role-based access: Viewer (read dashboards), Editor (modify dashboards), Admin (manage datasources and users). Service accounts for Prometheus datasource — not the admin password.
- Pushgateway: Basic auth required. GitHub Actions authenticates with a token stored in GitHub Secrets, fetched at workflow runtime. Unauthenticated pushes are rejected.
- Loki: Token-based auth using Loki's built-in multi-tenancy. Each service that pushes logs uses a unique tenant ID and bearer token.
Secrets Management
All secrets in AWS Secrets Manager as described in Feature 3. No secrets in environment variables at rest. No secrets in config files in Git. Rotation schedule: SSH keys rotated every 90 days via a Lambda function. Slack webhook rotated every 180 days.
Network Attack Surface Reduction
Public internet:
→ Port 3000 (Grafana only) — authenticated via SSO
Internal VPC only:
→ Port 9090 (Prometheus) — basic auth
→ Port 9093 (Alertmanager) — basic auth
→ Port 9091 (Pushgateway) — basic auth + IP allowlist (GitHub Actions IPs)
→ Port 3100 (Loki) — token auth
→ Port 3200 (Tempo) — token auth
No public access:
→ Port 9100 (Node Exporter) — localhost only
→ Port 9115 (Blackbox Exporter) — localhost only
→ Port 4319 (OTel Collector) — VPC only
The reverse SSH tunnel is replaced in V2. Instead, Node Exporter on the application server is scraped via AWS PrivateLink or VPC peering — eliminating the fragile autossh dependency and the unlocked persistent SSH connection.
Input Validation
Pushgateway metrics are validated against an allowlist of metric names using a Prometheus recording rule that discards any metric not matching ^(deployment_|meetmind_).*. Unknown metrics pushed by attackers are dropped before storage.
Scalability
Horizontal Scaling Boundaries
| Component | V1 | V2 Scaling Strategy |
|---|---|---|
| Prometheus | 1 instance | 2 instances with functional sharding — one per application domain |
| Loki | 1 instance | Loki in microservices mode: separate Distributor, Ingester, Querier, Compactor |
| Grafana | 1 instance | Stateless — scale behind ALB, shared PostgreSQL backend for config |
| Alertmanager | 1 instance | 3-instance gossip cluster |
| Tempo | 1 instance | Tempo in distributed mode with separate components |
Prometheus does not shard automatically. In V2, recording rules pre-aggregate high-cardinality metrics (per-pod CPU, per-endpoint latency) into lower-cardinality summaries. This reduces query load on Prometheus by 60–80% for dashboard queries.
Caching Strategy
Grafana query cache (built-in):
- Cache: in-memory per Grafana instance
- TTL: 60 seconds for dashboard panels
- Eviction: LRU, max 100MB per instance
- Purpose: Prevent N dashboard users from firing N identical Prometheus queries
Thanos Query cache (Redis):
- Cache: ElastiCache Redis, r6g.large
- TTL: 5 minutes for range queries, 30 seconds for instant queries
- Eviction: allkeys-lru
- Purpose: Historical metric queries against S3 are slow (200–500ms).
Redis caches the results so repeated dashboard loads hit cache, not S3.
Why Redis specifically:
Sub-millisecond read latency for cache hits. Native TTL support per key
eliminates the need for a separate eviction job. The sorted set data
structure supports time-windowed cache invalidation when new metric blocks
are uploaded to S3.
Handling Traffic Spikes
Loki Distributor is stateless and horizontally scalable behind an ALB. During an application incident that generates a 10x log spike, auto-scaling adds Distributor instances within 90 seconds (CloudWatch metric: loki_distributor_lines_received_total rate). Ingesters are the bottleneck — they hold recent log data in memory. V2 pre-provisions 3 Ingesters with capacity for 5x normal ingestion rate before auto-scaling triggers.
Observability — Observing the Observer
Structured Logging Strategy
All V2 services emit JSON logs with mandatory fields:
{
"timestamp": "2026-05-18T14:23:01.234Z",
"level": "error",
"service": "prometheus",
"trace_id": "abc123def456",
"message": "failed to scrape target",
"target": "http://localhost:9101/metrics",
"duration_ms": 5023,
"error": "context deadline exceeded"
}
trace_id is mandatory in every log line. This enables the Loki → Tempo drill-down even for the monitoring stack's own internal errors.
Core Metrics Tracked
| Metric | Alert Threshold | Reason |
|---|---|---|
prometheus_tsdb_head_samples_appended_total rate |
< 1000/s sustained | Prometheus ingestion stalled |
loki_distributor_lines_received_total rate |
< 100/s sustained | Loki ingestion stalled |
alertmanager_notifications_failed_total rate |
> 0 for 5m | Slack webhook broken |
prometheus_rule_evaluation_duration_seconds p99 |
> 1s | Alert evaluation too slow — alerts may lag |
thanos_objstore_operation_failures_total |
> 0 for 10m | S3 upload failing — metrics will be lost on restart |
grafana_api_response_status_total{code="5xx"} rate |
> 1% | Grafana serving errors |
Distributed Error Tracking
V2 instruments the monitoring stack itself with OpenTelemetry. Every Prometheus scrape failure, every Alertmanager notification failure, and every Loki ingestion rejection creates a span. These traces go to Tempo. When Slack stops receiving alerts, the on-call engineer opens Tempo, finds the alertmanager.notify spans with status=error, and sees exactly which webhook call failed and why — without SSHing into a server.
Section 4: Tech Stack Decisions
Infrastructure
AWS EC2 — t3.medium per service group
V1 used t3.small (2GB RAM) for everything. V2 separates the stack across three t3.medium instances (4GB RAM each): one for Prometheus + Thanos, one for Loki + Tempo + OTel Collector, one for Grafana + Alertmanager. The separation ensures an OOM kill in one group does not take down the entire observability stack. t3.medium uses burstable CPU credits — appropriate for observability workloads which are spiky (dashboard loads, incident spikes) rather than sustained.
AWS S3 — Long-term metric storage
Chosen over EBS snapshots because S3 provides 11 nines of durability, cross-region replication, and native lifecycle policies for tiered retention. EBS snapshots require manual scheduling and restore procedures. S3 with Thanos is the industry standard for Prometheus long-term storage.
AWS ElastiCache Redis — Query cache
Chosen specifically for: sub-millisecond read latency (P99 < 1ms for cache hits vs 200–500ms for Thanos S3 queries), native TTL per key (no background eviction job needed), and the sorted set data structure for time-windowed invalidation. Memcached was rejected because it lacks TTL-per-key and persistence — a Redis restart loses all cache, Memcached does too, but Redis can be configured with RDB snapshots to warm the cache on restart.
AWS Secrets Manager — Secrets storage
Chosen over HashiCorp Vault because: it integrates natively with EC2 IAM roles (no Vault agent sidecar), costs are negligible at this scale, and it satisfies SOC2 audit requirements without additional configuration. Vault was considered and rejected — operational overhead of running a highly available Vault cluster exceeds the benefit for a team of this size.
Observability Stack
Prometheus 2.51 — Metrics
Chosen over Datadog, New Relic, or Grafana Cloud for: zero per-metric pricing at scale (Datadog charges $0.05/custom metric/month — at 10,000 metrics that is $500/month), data sovereignty (no customer metrics leave our AWS account), and the open standards ecosystem (PromQL, OpenMetrics, recording rules). The pull-based scraping model suits our architecture — targets expose metrics, Prometheus collects them on its schedule.
Loki 2.9 — Logs
Chosen over Elasticsearch because: Loki indexes only log labels, not log content — storage cost is 10–30x lower at equivalent log volume. Elasticsearch indexes every word in every log line, making it expensive and operationally complex. Loki's query language (LogQL) is sufficient for our use case (error filtering, pattern matching, metric extraction from logs). Elasticsearch's full-text search is unnecessary — we are not building a search engine.
Tempo 2.4 — Traces
Chosen over Jaeger and Zipkin because: Tempo uses object storage (S3) as its backend, giving the same durability and cost profile as Loki. Jaeger's Cassandra or Elasticsearch backend is operationally expensive. Tempo is natively integrated with Loki derived fields — the trace ID click-through from Loki to Tempo is a first-class feature, not a workaround.
Grafana 10.4 — Visualization
Grafana is the only viable choice for unified visualization across Prometheus, Loki, and Tempo. It is the primary UI for all three Grafana Labs backends. The dashboard-as-JSON provisioning model, the derived fields configuration, and the trace waterfall viewer are all first-class features with no equivalent in competing tools. Kibana was rejected — it only works with Elasticsearch.
OpenTelemetry Collector contrib 0.98 — Telemetry pipeline
Chosen because OpenTelemetry is the CNCF standard for telemetry data — vendor-neutral, supported by every major observability platform. The contrib distribution includes exporters for Loki (otlphttp/loki), Tempo (otlp), and Prometheus (prometheus). Running the collector decouples application instrumentation from backend decisions — switching from Loki to Elasticsearch requires only a config change in the collector, not changes to every service.
Thanos — Long-term Prometheus storage
Chosen over Cortex and VictoriaMetrics because: Thanos is the most mature solution in this space (CNCF graduated project), uses S3 natively without additional storage backends, and the Sidecar pattern integrates non-disruptively with existing Prometheus deployments. VictoriaMetrics has better write performance but weaker ecosystem maturity and less operator familiarity. Cortex adds Cassandra as a dependency — operational overhead not justified at this scale.
Alertmanager 0.27 — Alert routing
There is no viable alternative for Prometheus alert routing. Grafana's built-in alerting was evaluated and rejected — it does not support multi-window burn rate alerting, inhibition rules, or the structured Go template system needed for our Slack payload format.
Language and Tooling
Terraform 1.8 — Infrastructure provisioning
Chosen over AWS CDK and Pulumi because: the HCL syntax is readable by engineers who are not TypeScript or Python developers, the provider ecosystem is the largest and most mature, and state management via S3 + DynamoDB is a well-understood pattern. AWS CDK was rejected — it requires familiarity with TypeScript/Python and generates CloudFormation, adding an abstraction layer that makes debugging harder.
Bash — Install scripts
The install scripts are Bash because: every Ubuntu server has Bash without additional installation, no external dependencies are introduced, and the scripts are straightforward enough that Python or Go would add complexity without benefit. Scripts over 500 lines would warrant a proper configuration management tool (Ansible), but at 278 lines Bash is the right tool.
V2 Architecture Diagram
INTERNET
│
┌───────┴────────┐
│ AWS ALB │
│ (port 443 TLS) │
└───────┬────────┘
│ (Grafana only — authenticated via SSO)
┌───────┴────────┐
│ Grafana ASG │
│ (2+ instances) │
│ PostgreSQL │
│ shared backend │
└───────┬────────┘
│ queries
┌─────────────────┼──────────────────┐
│ │ │
┌─────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Thanos │ │ Loki │ │ Tempo │
│ Query │ │ Querier │ │ Query │
└─────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌─────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Prometheus │ │ Loki │ │ Tempo │
│ x2 │ │ Ingester │ │ Ingester │
│ + Thanos │ │ x3 │ │ x2 │
│ Sidecar │ └─────────────┘ └─────────────┘
└─────┬──────┘ │ │
│ │ │
↓ ↓ ↓
S3: metrics S3: logs chunks S3: trace blocks
(Thanos blocks) (Loki chunks) (Tempo blocks)
┌─────────────────────────────────────────────────┐
│ Alertmanager gossip cluster (3 instances) │
│ ← Thanos Ruler evaluates alert rules │
│ → Slack webhook (URL from Secrets Manager) │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Application Server (VPC peering or PrivateLink) │
│ Node Exporter :9100 ← scraped by Prometheus │
│ App (OTel SDK) → OTel Collector → Loki + Tempo │
└─────────────────────────────────────────────────┘
ElastiCache Redis ← Thanos Query cache
AWS Secrets Manager ← all secrets
S3 ← Terraform state (DynamoDB lock)
Summary of V1 → V2 Migrations
| Problem in V1 | V2 Solution |
|---|---|
| Single server — total failure | Multi-instance, service-grouped across 3 hosts |
Metrics lost on terraform destroy
|
Thanos + S3 — metrics survive indefinitely |
| Fragile reverse SSH tunnel | VPC peering or PrivateLink — no SSH dependency |
| No authentication on any service | Basic auth everywhere, SSO on Grafana |
| Slack webhook in plaintext config | AWS Secrets Manager — never touches disk |
| Single Alertmanager — alerts lost on crash | 3-instance gossip cluster |
| No audit trail | CloudTrail + Loki structured logs with mandatory trace IDs |
OOM on t3.small under load |
Separated service groups on t3.medium, Redis cache reduces query load |
| Alert spam from grouping by instance |
group_by: [alertname, severity] — one message per incident |
This document represents a production-ready evolution. V1 proved the concept. V2 makes it survivable.

Top comments (0)