DEV Community

Abraham Acha
Abraham Acha

Posted on • Edited on

Engineering Design Document: MeetMind Observability Platform V2

Engineering Design Document

MeetMind Observability Platform — V1 Critique and V2 Design

Author: The Duke Airfluke
Audience: Principal Engineer review
Date: June 2026


Executive Summary

The V1 MeetMind observability platform successfully demonstrated a working LGTM stack — Loki, Grafana, Tempo, and Prometheus — deployed as native systemd services on a single EC2 instance, with a reverse SSH tunnel bridging a cross-account monitoring gap. It collected metrics, fired alerts to Slack, and proved the observability pipeline worked end to end.

It is not production-ready.

This document is a brutal critique of V1's architectural decisions and a rigorous design for V2 — a highly available, secure, and scalable observability platform capable of surviving real traffic, real incidents, and real adversaries.


Section 1: V1 Architecture Critique

Brief Overview of V1

V1 deployed nine services — Prometheus, Loki, Tempo, Grafana, Alertmanager, Node Exporter, Blackbox Exporter, Pushgateway, and OpenTelemetry Collector — as systemd units on a single t3.small EC2 instance (2 vCPU, 2GB RAM). A Terraform configuration provisions this instance from scratch. Configs are version-controlled in GitHub. All alerts route to a Slack channel via Alertmanager's webhook integration. Cross-account monitoring is achieved through a reverse SSH tunnel from the application server to the monitoring server, forwarding Node Exporter on port 9100 to port 9101 on the monitoring server.

Exact Breaking Points

1. Single Point of Failure — Everything

The entire observability stack runs on one EC2 instance. When that instance goes down — hardware failure, AWS availability zone outage, accidental terraform destroy, or an OOM kill — you lose:

  • All metrics collection
  • All alerting
  • All dashboards
  • All log storage
  • All trace storage

Your monitoring stack cannot monitor itself going down. If 3.93.140.221 dies, you find out when a human notices the application is broken, not when an alert fires. This is the most fundamental contradiction in the design.

2. The Reverse SSH Tunnel is a Fragile Bridge

The tunnel between the application server and monitoring server is managed by a single autossh process. During the project, the tunnel dropped multiple times — causing HostDown alerts, SLOAvailabilityFastBurn alerts, and Slack spam. The fix was to increase the for: duration on alerts to tolerate reconnection time.

This means we deliberately made our alerting less sensitive to mask infrastructure fragility. That is the wrong trade-off. The tunnel also represents an unlocked persistent SSH connection between two servers. If either server is compromised, the attacker inherits a pre-authenticated path to the other.

3. Prometheus Has No Storage Redundancy

Prometheus stores metrics on the local disk at /var/lib/prometheus. 30 days of retention. When the instance is terminated by terraform destroy, all historical metrics are lost. There is no remote write, no object storage backup, no secondary Prometheus. Spinning up a new monitoring server starts with zero historical data — no SLO compliance history, no DORA trend lines, nothing.

4. No Authentication on Internal Services

Every service runs with no authentication:

  • Prometheus UI at :9090 — open to anyone with network access
  • Alertmanager UI at :9093 — anyone can create silences, delete routes, or inject fake alerts
  • Pushgateway at :9091 — anyone can push arbitrary metrics and corrupt DORA data
  • Grafana defaults to admin/admin — hardcoded in the install script

The security group restricts ports, but security-group-only auth is not authentication. It is network filtering. Any instance in the same VPC, any compromised dependency, or any misconfigured security group rule exposes these services entirely.

5. The Slack Webhook is a Secret Living in a Config File

The Slack webhook URL is stored in /etc/alertmanager/alertmanager.yml on the monitoring server. It was accidentally committed to GitHub twice during the project — once in a commit that had to be rewritten with git push --force. The webhook was never properly rotated after those exposures. This means any attacker with the webhook URL can post arbitrary messages to #all-hng-alerts, injecting fake alerts or silencing real ones by creating confusion.

6. t3.small Will OOM Under Real Load

Loki, Tempo, Prometheus, and Grafana are all memory-intensive. On a t3.small with 2GB RAM:

  • Prometheus with 30 days of metrics and multiple scrape targets consumes approximately 800MB–1.2GB
  • Loki's ingestion buffer consumes 200–400MB under moderate load
  • Grafana with multiple concurrent dashboard users consumes 300–500MB

Total: 1.3GB–2.1GB. The instance has 2GB. Under any sustained load, the OOM killer terminates a service. During the project we saw the stress-ng game day scenario push the instance to the edge. In production, real traffic would do the same.

7. No Retention Policy Enforcement

30-day retention is configured but not verified. Loki's compactor runs but is not monitored. If the disk fills before the compactor can reclaim space — a real risk on a 30GB root volume — Loki stops accepting logs, Tempo stops accepting traces, and the entire stack degrades silently.

Security Blind Spots

  • No mTLS between any services — all inter-service communication is plaintext on localhost
  • No rate limiting on the Pushgateway endpoint — trivially DoS'd or data-poisoned
  • SSH private key for the tunnel stored at /var/lib/monitoring/id_rsa with no rotation schedule
  • No audit logging — no record of who silenced alerts, who changed Grafana dashboards, or when
  • Grafana allowUiUpdates: false prevents dashboard modification but does not prevent API access with the admin password

Section 2: V2 New Features — Fully Designed

Feature 1: Remote Write to Managed Object Storage (Prometheus → Thanos)

What it does and why it is needed

V1 loses all historical metrics on terraform destroy. V2 introduces Thanos Sidecar running alongside each Prometheus instance, uploading compacted metric blocks to S3 every 2 hours. Thanos Query provides a unified query layer across multiple Prometheus instances and S3, enabling dashboards to query metrics spanning months without keeping them in local TSDB.

Architectural Integration

Prometheus (primary) ──remote write──→ Thanos Sidecar
                                              │
                                              ↓ upload every 2h
                                        S3 Bucket (meetmind-metrics)
                                              │
                               Thanos Store ←─┘
                                     │
                          Thanos Query (unified query API)
                                     │
                               Grafana datasource
Enter fullscreen mode Exit fullscreen mode

Thanos Ruler takes over alert rule evaluation from Prometheus, ensuring alert evaluation continues even if one Prometheus instance goes down. Alertmanager receives alerts from Thanos Ruler, not directly from Prometheus.

Data Model Changes

No schema changes. Thanos uses the existing Prometheus TSDB format. S3 object structure:

s3://meetmind-metrics/
  prometheus/
    01HXYZ.../   (2-hour compacted block)
      chunks/
      index
      meta.json
Enter fullscreen mode Exit fullscreen mode

Retention on S3 is governed by a lifecycle policy: raw blocks retained for 1 year, downsampled 5-minute resolution blocks for 3 years.

Trade-offs

S3 upload introduces 2-hour lag before metrics are queryable from object storage. Recent metrics (last 2 hours) come from local Prometheus TSDB. Thanos Store queries are slower than local Prometheus queries — acceptable for historical dashboards, not acceptable for alerting (which uses local Prometheus). Operational complexity increases: Thanos introduces three new components (Sidecar, Store, Query) that must be monitored.


Feature 2: High Availability Alertmanager Cluster

What it does and why it is needed

V1 runs a single Alertmanager. If it crashes between a Prometheus firing event and the Slack webhook delivery, the alert is silently lost. V2 runs three Alertmanager instances in a gossip cluster using the --cluster.peer flag. All three share alert state — deduplication is guaranteed across the cluster.

Architectural Integration

Prometheus ──→ Alertmanager-1 (port 9093)  ┐
           ──→ Alertmanager-2 (port 9094)  ├── gossip cluster (port 9094/mesh)
           ──→ Alertmanager-3 (port 9095)  ┘
                       │
              (deduplicated) one notification
                       │
                    Slack webhook
Enter fullscreen mode Exit fullscreen mode

Prometheus sends alerts to all three Alertmanager instances simultaneously. The gossip protocol ensures only one instance sends the Slack notification.

Data Model Changes

No new persistent data. Alertmanager state (silences, inhibitions) is replicated in memory across the cluster via gossip. Silences are persisted to disk per instance for recovery after restart.

Trade-offs

Three Alertmanager instances on separate hosts cost more. Gossip protocol adds ~5ms of coordination overhead per alert group — invisible in practice. Requires a load balancer or DNS round-robin for Prometheus to reach all three. If the network partitions, two instances may send the same notification once before reconciling.


Feature 3: Secrets Management via AWS Secrets Manager

What it does and why it is needed

V1 stores the Slack webhook URL in a plaintext config file that was committed to GitHub twice. V2 stores all secrets — Slack webhook, Grafana admin password, SSH keys, database credentials — in AWS Secrets Manager. The install script fetches secrets at deploy time using the instance's IAM role. No secret ever touches a config file or a Git repository.

Architectural Integration

EC2 Instance (monitoring server)
  └── IAM Role: meetmind-monitoring-role
        └── Policy: allow secretsmanager:GetSecretValue
                    on arn:aws:secretsmanager:*:*:secret:meetmind/*

install.sh:
  SLACK_WEBHOOK=$(aws secretsmanager get-secret-value \
    --secret-id meetmind/slack-webhook \
    --query SecretString --output text)
Enter fullscreen mode Exit fullscreen mode

Secrets are injected into config files at install time and deleted from environment variables immediately after. Secrets are never written to disk in their raw form — only written to the config file with appropriate file permissions (chmod 640, owned by the service user).

Data Model Changes

Secrets Manager stores:

meetmind/slack-webhook       → {"url": "https://hooks.slack.com/..."}
meetmind/grafana-admin       → {"password": "..."}
meetmind/tunnel-ssh-key      → {"private_key": "-----BEGIN..."}
meetmind/prometheus-auth     → {"username": "prometheus", "password": "..."}
Enter fullscreen mode Exit fullscreen mode

Trade-offs

AWS Secrets Manager costs $0.40/secret/month plus $0.05 per 10,000 API calls — negligible. The IAM role must be tightly scoped: if the monitoring server is compromised, the attacker can only read meetmind/* secrets, not all secrets in the account. Secrets rotation requires a Lambda function or manual rotation — operationally more complex than editing a config file.


Section 3: Production Readiness

Security

Authentication and Authorization

Every service in V2 requires authentication:

  • Prometheus: HTTP basic auth via web.yml config (bcrypt-hashed password). Grafana connects with a service account, not the admin user.
  • Alertmanager: HTTP basic auth on the API. Prometheus uses credentials when sending alerts.
  • Grafana: SAML or OAuth2 SSO via the organization's identity provider. Role-based access: Viewer (read dashboards), Editor (modify dashboards), Admin (manage datasources and users). Service accounts for Prometheus datasource — not the admin password.
  • Pushgateway: Basic auth required. GitHub Actions authenticates with a token stored in GitHub Secrets, fetched at workflow runtime. Unauthenticated pushes are rejected.
  • Loki: Token-based auth using Loki's built-in multi-tenancy. Each service that pushes logs uses a unique tenant ID and bearer token.

Secrets Management

All secrets in AWS Secrets Manager as described in Feature 3. No secrets in environment variables at rest. No secrets in config files in Git. Rotation schedule: SSH keys rotated every 90 days via a Lambda function. Slack webhook rotated every 180 days.

Network Attack Surface Reduction

Public internet:
  → Port 3000 (Grafana only) — authenticated via SSO

Internal VPC only:
  → Port 9090 (Prometheus) — basic auth
  → Port 9093 (Alertmanager) — basic auth
  → Port 9091 (Pushgateway) — basic auth + IP allowlist (GitHub Actions IPs)
  → Port 3100 (Loki) — token auth
  → Port 3200 (Tempo) — token auth

No public access:
  → Port 9100 (Node Exporter) — localhost only
  → Port 9115 (Blackbox Exporter) — localhost only
  → Port 4319 (OTel Collector) — VPC only
Enter fullscreen mode Exit fullscreen mode

The reverse SSH tunnel is replaced in V2. Instead, Node Exporter on the application server is scraped via AWS PrivateLink or VPC peering — eliminating the fragile autossh dependency and the unlocked persistent SSH connection.

Input Validation

Pushgateway metrics are validated against an allowlist of metric names using a Prometheus recording rule that discards any metric not matching ^(deployment_|meetmind_).*. Unknown metrics pushed by attackers are dropped before storage.

Scalability

Horizontal Scaling Boundaries

Component V1 V2 Scaling Strategy
Prometheus 1 instance 2 instances with functional sharding — one per application domain
Loki 1 instance Loki in microservices mode: separate Distributor, Ingester, Querier, Compactor
Grafana 1 instance Stateless — scale behind ALB, shared PostgreSQL backend for config
Alertmanager 1 instance 3-instance gossip cluster
Tempo 1 instance Tempo in distributed mode with separate components

Prometheus does not shard automatically. In V2, recording rules pre-aggregate high-cardinality metrics (per-pod CPU, per-endpoint latency) into lower-cardinality summaries. This reduces query load on Prometheus by 60–80% for dashboard queries.

Caching Strategy

Grafana query cache (built-in):
  - Cache: in-memory per Grafana instance
  - TTL: 60 seconds for dashboard panels
  - Eviction: LRU, max 100MB per instance
  - Purpose: Prevent N dashboard users from firing N identical Prometheus queries

Thanos Query cache (Redis):
  - Cache: ElastiCache Redis, r6g.large
  - TTL: 5 minutes for range queries, 30 seconds for instant queries
  - Eviction: allkeys-lru
  - Purpose: Historical metric queries against S3 are slow (200–500ms).
             Redis caches the results so repeated dashboard loads hit cache, not S3.

Why Redis specifically:
  Sub-millisecond read latency for cache hits. Native TTL support per key
  eliminates the need for a separate eviction job. The sorted set data
  structure supports time-windowed cache invalidation when new metric blocks
  are uploaded to S3.
Enter fullscreen mode Exit fullscreen mode

Handling Traffic Spikes

Loki Distributor is stateless and horizontally scalable behind an ALB. During an application incident that generates a 10x log spike, auto-scaling adds Distributor instances within 90 seconds (CloudWatch metric: loki_distributor_lines_received_total rate). Ingesters are the bottleneck — they hold recent log data in memory. V2 pre-provisions 3 Ingesters with capacity for 5x normal ingestion rate before auto-scaling triggers.

Observability — Observing the Observer

Structured Logging Strategy

All V2 services emit JSON logs with mandatory fields:

{
  "timestamp": "2026-05-18T14:23:01.234Z",
  "level": "error",
  "service": "prometheus",
  "trace_id": "abc123def456",
  "message": "failed to scrape target",
  "target": "http://localhost:9101/metrics",
  "duration_ms": 5023,
  "error": "context deadline exceeded"
}
Enter fullscreen mode Exit fullscreen mode

trace_id is mandatory in every log line. This enables the Loki → Tempo drill-down even for the monitoring stack's own internal errors.

Core Metrics Tracked

Metric Alert Threshold Reason
prometheus_tsdb_head_samples_appended_total rate < 1000/s sustained Prometheus ingestion stalled
loki_distributor_lines_received_total rate < 100/s sustained Loki ingestion stalled
alertmanager_notifications_failed_total rate > 0 for 5m Slack webhook broken
prometheus_rule_evaluation_duration_seconds p99 > 1s Alert evaluation too slow — alerts may lag
thanos_objstore_operation_failures_total > 0 for 10m S3 upload failing — metrics will be lost on restart
grafana_api_response_status_total{code="5xx"} rate > 1% Grafana serving errors

Distributed Error Tracking

V2 instruments the monitoring stack itself with OpenTelemetry. Every Prometheus scrape failure, every Alertmanager notification failure, and every Loki ingestion rejection creates a span. These traces go to Tempo. When Slack stops receiving alerts, the on-call engineer opens Tempo, finds the alertmanager.notify spans with status=error, and sees exactly which webhook call failed and why — without SSHing into a server.


Section 4: Tech Stack Decisions

Infrastructure

AWS EC2 — t3.medium per service group
V1 used t3.small (2GB RAM) for everything. V2 separates the stack across three t3.medium instances (4GB RAM each): one for Prometheus + Thanos, one for Loki + Tempo + OTel Collector, one for Grafana + Alertmanager. The separation ensures an OOM kill in one group does not take down the entire observability stack. t3.medium uses burstable CPU credits — appropriate for observability workloads which are spiky (dashboard loads, incident spikes) rather than sustained.

AWS S3 — Long-term metric storage
Chosen over EBS snapshots because S3 provides 11 nines of durability, cross-region replication, and native lifecycle policies for tiered retention. EBS snapshots require manual scheduling and restore procedures. S3 with Thanos is the industry standard for Prometheus long-term storage.

AWS ElastiCache Redis — Query cache
Chosen specifically for: sub-millisecond read latency (P99 < 1ms for cache hits vs 200–500ms for Thanos S3 queries), native TTL per key (no background eviction job needed), and the sorted set data structure for time-windowed invalidation. Memcached was rejected because it lacks TTL-per-key and persistence — a Redis restart loses all cache, Memcached does too, but Redis can be configured with RDB snapshots to warm the cache on restart.

AWS Secrets Manager — Secrets storage
Chosen over HashiCorp Vault because: it integrates natively with EC2 IAM roles (no Vault agent sidecar), costs are negligible at this scale, and it satisfies SOC2 audit requirements without additional configuration. Vault was considered and rejected — operational overhead of running a highly available Vault cluster exceeds the benefit for a team of this size.

Observability Stack

Prometheus 2.51 — Metrics
Chosen over Datadog, New Relic, or Grafana Cloud for: zero per-metric pricing at scale (Datadog charges $0.05/custom metric/month — at 10,000 metrics that is $500/month), data sovereignty (no customer metrics leave our AWS account), and the open standards ecosystem (PromQL, OpenMetrics, recording rules). The pull-based scraping model suits our architecture — targets expose metrics, Prometheus collects them on its schedule.

Loki 2.9 — Logs
Chosen over Elasticsearch because: Loki indexes only log labels, not log content — storage cost is 10–30x lower at equivalent log volume. Elasticsearch indexes every word in every log line, making it expensive and operationally complex. Loki's query language (LogQL) is sufficient for our use case (error filtering, pattern matching, metric extraction from logs). Elasticsearch's full-text search is unnecessary — we are not building a search engine.

Tempo 2.4 — Traces
Chosen over Jaeger and Zipkin because: Tempo uses object storage (S3) as its backend, giving the same durability and cost profile as Loki. Jaeger's Cassandra or Elasticsearch backend is operationally expensive. Tempo is natively integrated with Loki derived fields — the trace ID click-through from Loki to Tempo is a first-class feature, not a workaround.

Grafana 10.4 — Visualization
Grafana is the only viable choice for unified visualization across Prometheus, Loki, and Tempo. It is the primary UI for all three Grafana Labs backends. The dashboard-as-JSON provisioning model, the derived fields configuration, and the trace waterfall viewer are all first-class features with no equivalent in competing tools. Kibana was rejected — it only works with Elasticsearch.

OpenTelemetry Collector contrib 0.98 — Telemetry pipeline
Chosen because OpenTelemetry is the CNCF standard for telemetry data — vendor-neutral, supported by every major observability platform. The contrib distribution includes exporters for Loki (otlphttp/loki), Tempo (otlp), and Prometheus (prometheus). Running the collector decouples application instrumentation from backend decisions — switching from Loki to Elasticsearch requires only a config change in the collector, not changes to every service.

Thanos — Long-term Prometheus storage
Chosen over Cortex and VictoriaMetrics because: Thanos is the most mature solution in this space (CNCF graduated project), uses S3 natively without additional storage backends, and the Sidecar pattern integrates non-disruptively with existing Prometheus deployments. VictoriaMetrics has better write performance but weaker ecosystem maturity and less operator familiarity. Cortex adds Cassandra as a dependency — operational overhead not justified at this scale.

Alertmanager 0.27 — Alert routing
There is no viable alternative for Prometheus alert routing. Grafana's built-in alerting was evaluated and rejected — it does not support multi-window burn rate alerting, inhibition rules, or the structured Go template system needed for our Slack payload format.

Language and Tooling

Terraform 1.8 — Infrastructure provisioning
Chosen over AWS CDK and Pulumi because: the HCL syntax is readable by engineers who are not TypeScript or Python developers, the provider ecosystem is the largest and most mature, and state management via S3 + DynamoDB is a well-understood pattern. AWS CDK was rejected — it requires familiarity with TypeScript/Python and generates CloudFormation, adding an abstraction layer that makes debugging harder.

Bash — Install scripts
The install scripts are Bash because: every Ubuntu server has Bash without additional installation, no external dependencies are introduced, and the scripts are straightforward enough that Python or Go would add complexity without benefit. Scripts over 500 lines would warrant a proper configuration management tool (Ansible), but at 278 lines Bash is the right tool.


V2 Architecture Diagram

                        INTERNET
                            │
                    ┌───────┴────────┐
                    │ AWS ALB        │
                    │ (port 443 TLS) │
                    └───────┬────────┘
                            │ (Grafana only — authenticated via SSO)
                    ┌───────┴────────┐
                    │ Grafana ASG    │
                    │ (2+ instances) │
                    │ PostgreSQL     │
                    │ shared backend │
                    └───────┬────────┘
                            │ queries
          ┌─────────────────┼──────────────────┐
          │                 │                  │
    ┌─────┴──────┐   ┌──────┴──────┐   ┌──────┴──────┐
    │ Thanos     │   │ Loki        │   │ Tempo       │
    │ Query      │   │ Querier     │   │ Query       │
    └─────┬──────┘   └──────┬──────┘   └──────┬──────┘
          │                 │                  │
    ┌─────┴──────┐   ┌──────┴──────┐   ┌──────┴──────┐
    │ Prometheus │   │ Loki        │   │ Tempo       │
    │ x2         │   │ Ingester    │   │ Ingester    │
    │ + Thanos   │   │ x3          │   │ x2          │
    │ Sidecar    │   └─────────────┘   └─────────────┘
    └─────┬──────┘          │                  │
          │                 │                  │
          ↓                 ↓                  ↓
    S3: metrics       S3: logs chunks    S3: trace blocks
    (Thanos blocks)   (Loki chunks)      (Tempo blocks)

    ┌─────────────────────────────────────────────────┐
    │ Alertmanager gossip cluster (3 instances)       │
    │ ← Thanos Ruler evaluates alert rules            │
    │ → Slack webhook (URL from Secrets Manager)      │
    └─────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────┐
    │ Application Server (VPC peering or PrivateLink) │
    │ Node Exporter :9100  ← scraped by Prometheus   │
    │ App (OTel SDK) → OTel Collector → Loki + Tempo  │
    └─────────────────────────────────────────────────┘

    ElastiCache Redis ← Thanos Query cache
    AWS Secrets Manager ← all secrets
    S3 ← Terraform state (DynamoDB lock)
Enter fullscreen mode Exit fullscreen mode

Summary of V1 → V2 Migrations

Problem in V1 V2 Solution
Single server — total failure Multi-instance, service-grouped across 3 hosts
Metrics lost on terraform destroy Thanos + S3 — metrics survive indefinitely
Fragile reverse SSH tunnel VPC peering or PrivateLink — no SSH dependency
No authentication on any service Basic auth everywhere, SSO on Grafana
Slack webhook in plaintext config AWS Secrets Manager — never touches disk
Single Alertmanager — alerts lost on crash 3-instance gossip cluster
No audit trail CloudTrail + Loki structured logs with mandatory trace IDs
OOM on t3.small under load Separated service groups on t3.medium, Redis cache reduces query load
Alert spam from grouping by instance group_by: [alertname, severity] — one message per incident

This document represents a production-ready evolution. V1 proved the concept. V2 makes it survivable.

Top comments (0)