Adeolu

Posted on Jun 10

Engineering Design Document: Reusable Observability Platform V2

#devops #observability #architecture #sre

A production-focused redesign of a Stage 6 LGTM observability platform, moving from a single-service Anvila monitoring setup to a reusable, secure, highly available observability platform.

Executive Summary

This document proposes V2 of the observability platform I built during Stage 6. V1 was validated using the Anvila API as the first monitored service, so many names, dashboards, alerts, and targets were Anvila-specific. However, the real architectural idea was broader than Anvila: build a reusable internal monitoring and reliability platform that could collect metrics, logs, traces, service-level objectives, DORA metrics, and alerts for production applications.

In this document, I treat Anvila as the first customer of the platform, not as the only possible customer. V1 proved that the stack worked for one real API. V2 redesigns it into a highly available, secure, multi-environment observability platform that can onboard many services without rebuilding the whole system each time. "Highly available" means the platform should keep working even if one server or component fails. The main changes are: replacing the single monitoring server with a resilient telemetry architecture, hardening access to observability tools, improving OpenTelemetry collection, making alert routing ownership-aware, storing telemetry with explicit retention and durability policies, and turning SLO and DORA measurements into enforceable release decisions.

The target reader for this document is a Principal Engineer reviewing whether the proposed architecture can survive operational pressure, not whether the dashboards look impressive. For clarity, I still reference Anvila throughout the document because it was the real service used to test V1.

1. V1 Architecture Critique

V1 Overview

V1 was a single-service observability platform deployed on a dedicated AWS EC2 monitoring server. It monitored Anvila as the first real application, so the implementation was branded and configured around Anvila. The application server ran the Anvila API through Nginx and PM2, with staging and production processes exposed on separate local ports. PM2 is a process manager; in this setup, it kept the FastAPI backend running and allowed the team to restart staging or production processes. The monitoring server ran the LGTM stack as systemd services rather than Docker. LGTM means Loki, Grafana, Tempo, and Prometheus:

Prometheus collected metrics, which are numeric measurements such as request count, latency, CPU usage, and error rate.
Loki stored logs, which are timestamped records of what the application and servers are doing.
Tempo stored traces, which show the path and timing of a request across services.
Grafana visualized dashboards by reading data from Prometheus, Loki, and Tempo.
Alertmanager routed alerts to #DevOps-Alerts.
Node Exporter collected host-level CPU, memory, disk, filesystem, and network metrics.
Blackbox Exporter probed the public staging and production API URLs.
OpenTelemetry Collector received application traces and shipped logs.
A custom DORA exporter scraped GitHub Actions workflow data into Prometheus.
A later Tempo recent-traces exporter exposed trace summaries to Grafana through Prometheus.

The backend also received application-level instrumentation. Instrumentation means adding code or configuration so the system can report what is happening inside it. The FastAPI app exposed /metrics using prometheus_client, with counters, histograms, and in-progress gauges:

http_requests_total
http_request_duration_seconds
http_requests_in_progress

The route-labeling logic normalized unmatched paths into a controlled label to avoid Prometheus label-cardinality explosion from random scanner URLs.

Dashboards were provisioned as JSON, and alert rules were committed as YAML. V1 included SLI/SLO definitions, an error budget policy, runbooks, a blameless post-incident review, and Game Day evidence for deployment failure, latency injection, and resource pressure.

Exact Breaking Points

The first major breaking point was the single monitoring EC2 instance. If that instance failed, the team lost Prometheus, Grafana, Loki, Tempo, Alertmanager, dashboards, and local alert routing at the same time. That is unacceptable for production because the observability platform becomes unavailable exactly when operators may need it most. This is the opposite of high availability: one machine became a single point of failure.

V2 fix: split the platform into multiple highly available components. Run at least two OpenTelemetry Collectors behind an internal load balancer, run Grafana in a highly available setup, and move telemetry storage to durable backends. One server failure should reduce capacity, not remove observability.

The second breaking point was local disk dependency. Prometheus, Loki, Tempo, and Grafana state lived on one machine. A disk fill, filesystem issue, or instance replacement could destroy recent telemetry unless backups were handled outside the documented workflow. V1 had retention periods, but retention is not durability.

V2 fix: move logs and traces to object-storage-backed Loki and Tempo, and move metrics to Prometheus remote write or a Prometheus-compatible long-term store such as Mimir. This means telemetry survives instance replacement and is not tied to one local disk.

The third breaking point was access control. Grafana, Prometheus, and Alertmanager were exposed on public ports and protected mainly by a security group allowlist. That is better than open internet access, but it is not strong enough for production. IP allowlists break when team members change networks, and they do not provide identity, auditability, role-based permissions, or revocation at the user level.

V2 fix: put observability tools behind an identity-aware access layer. Grafana should use SSO, MFA, and RBAC, while Prometheus, Loki, Tempo, and Alertmanager APIs should stay private. Access should be based on who the engineer is and what role they have, not only on their IP address.

The fourth breaking point was telemetry ingestion coupling. The application depended on PM2 startup commands and OpenTelemetry runtime instrumentation. If the process was restarted without the instrumentation wrapper, traces could silently disappear while the app continued serving traffic. That creates false confidence: dashboards still exist, but the signal is incomplete.

V2 fix: make telemetry configuration part of the deployment contract. Services should send telemetry to a stable internal OTLP endpoint, and instrumentation settings should be versioned with the service deployment. Health checks should also verify that metrics, logs, and traces are arriving, not only that the API returns HTTP 200.

The fifth breaking point was incomplete DORA accuracy. Deployment frequency and change failure rate were reasonable approximations from GitHub Actions, but lead time and MTTR were not fully event-driven. Deployment confirmation was inferred from workflow success, and MTTR used a placeholder/manual incident metric. This is acceptable for a Stage 6 demo, but it weakens executive reporting because the data can overstate delivery health.

V2 fix: make DORA events explicit. The deployment pipeline should emit deployment started, deployment completed, rollback started, rollback completed, incident opened, and incident resolved events. The DORA exporter should report those real events instead of approximating deployment confirmation and MTTR.

The sixth breaking point was environment and service modeling. Staging and production were monitored, but the stack was not yet designed as a reusable multi-service platform. The labels were Anvila-specific, ownership routing was basic, and onboarding another service would require manual config changes and dashboard duplication. A stronger platform would let another application define its service name, owners, SLOs, alert routes, and dashboards through a repeatable template.

V2 fix: introduce a service registry such as observability_services.yml. Each service defines its name, environment, owners, SLO profile, dashboard folder, and alert route. New services are onboarded by adding a registry entry and using shared dashboard and alert templates.

The seventh breaking point was operational drift. Terraform created the server and uploaded config, but the installation relied heavily on remote shell scripts and mutable systemd services. After deployment, manual server changes could drift away from Git without immediate detection.

V2 fix: reduce mutable server configuration. Terraform should provision infrastructure, while service configuration should move toward immutable images, cloud-init, Ansible, or container orchestration. Config changes should be reviewed in Git and redeployed, not manually patched on the server.

Security Blind Spots

V1 left several security gaps because the immediate goal was proving observability capability under pressure.

Secrets management was too manual. Slack webhook URLs, GitHub tokens, Brevo keys, Terraform variables, and environment-specific credentials were handled through local files or server-side environment files. In a production workflow, files such as private keys, Terraform state, and terraform.tfvars must never live in a publishable repository. They should be stored outside Git, encrypted where possible, rotated if exposure is suspected, and replaced by references to a managed secrets system.

Internal services were not consistently authenticated. Loki had auth_enabled: false, and Prometheus, Alertmanager, and Tempo were designed around network trust rather than service identity. That is common for demos, but production needs stronger boundaries.

Grafana access did not have enterprise-grade identity controls. There was no documented SSO, MFA, per-team RBAC, or audit trail for dashboard and datasource access.

The attack surface was larger than needed. Public access to Grafana, Prometheus, and Alertmanager ports, even through allowlists, increases exposure. Observability systems often contain sensitive data: URLs, headers, traces, stack traces, user IDs, deployment metadata, internal hostnames, and incident timelines.

Log and trace data did not have a documented PII redaction policy. The backend did make good security choices around OAuth logs by hashing emails and avoiding raw token logging, but the wider telemetry platform lacked a formal rule for what must never enter logs, spans, or labels.

2. New Features Fully Designed

Feature 1: Highly Available Telemetry Control Plane

What it does and why it is needed:

V2 replaces the single monitoring server with a highly available observability control plane. A control plane is the part of the system that receives, processes, stores, and routes observability data. The minimum production version runs two OpenTelemetry Collectors behind an internal load balancer, a highly available Grafana instance, and durable backend storage for metrics, logs, and traces. The goal is to ensure that one node failure does not remove visibility during an incident.

Architectural integration:

Application services send OTLP traces, logs, and metrics to an internal telemetry endpoint. OTLP means OpenTelemetry Protocol; it is the standard protocol applications use to send telemetry to OpenTelemetry Collectors. The endpoint load-balances across OpenTelemetry Collectors. Collectors apply batching, memory limits, retries, redaction processors, and routing rules before exporting telemetry to the storage layer.

Prometheus either runs in high-availability pair mode with remote write, or V2 adopts a Prometheus-compatible long-term store such as Grafana Mimir. Remote write means Prometheus keeps scraping metrics but sends a copy to a more durable backend. Loki stores logs using object storage for durability. Tempo stores traces using object storage-backed blocks. Grafana reads from Prometheus/Mimir, Loki, and Tempo.

Data model changes:

Telemetry itself is not stored in the Anvila relational database, but V2 introduces an observability_services.yml registry:

services:
  - service_name: anvila-api
    owner: anvila-devops
    tier: user-facing
    environments: [staging, production]
    slo_profile: public-api-standard
    slack_channel: "#DevOps-Alerts"

This registry becomes the source of truth for labels, alert routes, dashboard folders, and service ownership. If another application joins later, it gets another entry in the same file instead of a separate hand-built monitoring stack.

Trade-offs:

This increases operational complexity and cost. V1 was cheap and simple because everything lived on one EC2 instance. V2 sacrifices simplicity for survivability. The cost is justified because observability must remain available during failures.

Feature 2: Identity-Aware Access Layer

What it does and why it is needed:

V2 puts Grafana, Prometheus, Alertmanager, and trace/log exploration behind an identity-aware access layer. Engineers authenticate through SSO with MFA. Access is granted by team and role, not by IP address.

Architectural integration:

Grafana is placed behind a private ALB or VPN-accessible endpoint. Authentication is delegated to an identity provider. Prometheus, Alertmanager, Loki, and Tempo APIs are not directly public. Engineers query them through Grafana or through short-lived authenticated access paths.

Data model changes:

The observability service registry gains ownership metadata:

owners:
  - team: anvila-devops
    grafana_role: editor
    alert_contact: "#DevOps-Alerts"
    escalation_policy: anvila-primary

Grafana teams map to this registry. Alertmanager routes use the same ownership source.

Trade-offs:

Identity-aware access adds setup time and may slow emergency access if misconfigured. The trade-off is acceptable because production telemetry can contain sensitive operational and user-adjacent data.

Feature 3: Release Health Gate Based on SLO Burn and DORA Signals

What it does and why it is needed:

V1 showed SLO burn and DORA metrics on dashboards, but V2 turns those signals into release policy. An SLO, or Service Level Objective, is a reliability target such as "99.5% of requests should succeed." An error budget is the amount of failure the service can tolerate before it breaks that target. DORA metrics measure software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to restore. In V2, deployments can be blocked, delayed, or escalated when the service is burning error budget too quickly or when change failure rate is above threshold.

The important change is that observability stops being only something engineers look at after a problem. It becomes part of the deployment decision. If the platform already knows the service is unhealthy, the release pipeline should not blindly push more change into production. This is the same idea as a safety gate: before the deployment continues, the system checks whether reliability conditions are acceptable.

SLO burn means the service is consuming its error budget. A slow burn means the service is getting worse gradually. A fast burn means the service is failing quickly enough that the team may break the SLO soon. For example, if the availability target is 99.5%, the service only has 0.5% failure allowance for the window. A fast burn alert means that allowance is being consumed too quickly.

Architectural integration:

CI/CD queries a reliability policy endpoint before production deployment. CI/CD means Continuous Integration and Continuous Deployment: the automated path from code change to deployed service. The policy endpoint reads current SLO burn rate, active critical alerts, deployment failure history, and service tier. For Anvila API, production deployment is blocked if:

fast burn is active;
availability budget is fully consumed;
unresolved critical incident exists;
change failure rate exceeds 15% over the configured window.

The policy endpoint does not replace the deployment pipeline. It gives the pipeline a decision: allow, warn, or block. The deployment tool sends context such as service name, environment, commit SHA, actor, and target version. The policy service then checks Prometheus or Mimir for SLO burn, checks Alertmanager for active critical alerts, checks recent deployment history, and returns a decision with reasons.

Example response:

{
  "decision": "block",
  "service": "anvila-api",
  "environment": "production",
  "reasons": [
    "SLOFastBurn is active",
    "Change failure rate is 18%, above the 15% threshold"
  ]
}

This matters because it gives the operator a clear explanation. The system should not just say "deployment blocked." It should explain which reliability rule failed and what evidence caused the block.

Data model changes:

V2 introduces a small reliability metadata database:

CREATE TABLE service_release_policies (
  id UUID PRIMARY KEY,
  service_name TEXT NOT NULL,
  environment TEXT NOT NULL,
  max_change_failure_rate NUMERIC NOT NULL,
  block_on_fast_burn BOOLEAN NOT NULL DEFAULT TRUE,
  block_on_open_critical BOOLEAN NOT NULL DEFAULT TRUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE release_decisions (
  id UUID PRIMARY KEY,
  service_name TEXT NOT NULL,
  environment TEXT NOT NULL,
  commit_sha TEXT NOT NULL,
  decision TEXT NOT NULL,
  reasons JSONB NOT NULL,
  requested_by TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

The service_release_policies table stores the rules for each service and environment. A user-facing production API can have stricter rules than an internal staging service. For example, production might block deployment during fast burn, while staging might only warn.

The release_decisions table stores every decision made by the policy service. This creates an audit trail. If someone asks why a deployment was blocked or allowed, the team can inspect the commit SHA, environment, decision, reasons, requester, and timestamp.

Trade-offs:

This can slow feature delivery during reliability incidents. That is intentional. V2 chooses controlled delivery over shipping into known instability.

The cost of this feature is extra operational complexity. The team now needs a policy service, a metadata database, and reliable integrations with CI/CD, Prometheus, and Alertmanager. There is also a risk of false positives: a bad alert rule could block a safe deployment. To manage that risk, the policy should support an emergency override path that requires approval, records the reason, and notifies the team.

The benefit is that release decisions become evidence-based. Instead of relying on a human to remember to check dashboards before deployment, the platform checks the most important reliability signals automatically.

Feature 4: Telemetry Data Hygiene and Redaction

What it does and why it is needed:

V2 standardizes what can enter logs, metrics, and traces. It prevents secrets, OAuth codes, access tokens, raw emails, payment IDs, and high-cardinality labels from polluting telemetry. High-cardinality labels are labels with too many possible values, such as raw URLs, user emails, random IDs, or search terms. They can make Prometheus expensive, noisy, and slow.

Architectural integration:

OpenTelemetry Collector processors redact sensitive fields before exporting. Application logging uses structured JSON with approved fields. Metrics labels are restricted to bounded values such as method, route, status, service, and environment.

Data model changes:

An observability_redaction_rules.yml file defines redaction policies:

redact_fields:
  - authorization
  - cookie
  - access_token
  - refresh_token
  - oauth_code
  - github_token
hash_fields:
  - user_email
  - user_id
drop_span_attributes:
  - http.request.header.authorization

Trade-offs:

Redaction can remove useful debugging context. V2 chooses privacy and security over convenience. Debug-only access to sensitive values should come from controlled application debugging, not broad telemetry storage.

3. Production Readiness

Security

Authentication and authorization are treated as separate concerns. Authentication, often shortened to AuthN, proves who the user or service is. Authorization, often shortened to AuthZ, decides what that identity is allowed to do.

Human authentication uses SSO with MFA for Grafana and all human access to observability tools. SSO means Single Sign-On, where users log in through a central identity provider. MFA means Multi-Factor Authentication, where login requires an extra proof beyond a password. Authorization is RBAC-based, meaning Role-Based Access Control: viewers can inspect dashboards, on-call engineers can silence alerts, and platform admins can edit datasources and alert rules. Direct API access to Prometheus, Loki, Tempo, and Alertmanager is blocked from the public internet.

Service-to-service authentication uses private networking plus cloud identity where available. GitHub Actions uses OIDC, or OpenID Connect, to assume AWS roles instead of storing long-lived cloud keys. Secrets move into AWS Secrets Manager or SSM Parameter Store. Terraform receives only secret references, not plaintext secrets. Slack, GitHub, Brevo, Stripe, Gemini, and database credentials are rotated and never stored in Git.

Input validation is enforced at the application and telemetry layers. The FastAPI app already uses typed Pydantic settings and route-level models. V2 extends this by validating telemetry labels and dropping unknown high-cardinality labels at the collector.

The attack surface is minimized by private networking. Only Grafana is reachable through an authenticated access layer. OTLP ports are internal. Node Exporter is reachable only from the collector or Prometheus security group. Admin APIs are private.

Scalability

Horizontal scaling boundaries are explicit:

Anvila API scales horizontally behind a load balancer.
OpenTelemetry Collectors scale horizontally and are stateless.
Grafana scales horizontally behind a load balancer with external database/session storage.
Logs and traces scale through object-storage-backed Loki and Tempo.
Metrics scale through Prometheus HA with remote write or a Mimir-compatible backend.
Celery workers scale independently for persona generation. Celery is a Python background job system. It lets slow work, such as LLM calls or GitHub publishing, run outside the user-facing HTTP request.

Caching should be precise, not generic. Redis is used for session-adjacent ephemeral state, Celery broker/result coordination, persona generation events, and short-lived rate limit counters. Redis is an in-memory data store, which makes it fast for temporary data. Cache keys must have TTLs, meaning they expire automatically after a defined time. OAuth state remains short-lived and signed. Token blacklists or revocation sets should use TTL equal to token expiry.

Traffic spikes are handled through three layers: API autoscaling, queue-based background processing for expensive persona generation, and backpressure/rate limiting on high-cost endpoints.

API autoscaling means adding more API instances when request volume increases. Queue-based background processing means that expensive work, such as persona generation, is placed into a queue and handled by workers instead of forcing the user-facing API request to wait until all the work is finished. This matters because persona generation may call an LLM provider, write database records, match skills, build files, and publish events. Those operations are slower and more failure-prone than a normal API read request.

Backpressure means the system deliberately slows down or rejects new work when it is already overloaded. Rate limiting means setting a maximum number of requests a user or client can make in a period of time. For high-cost endpoints, such as generation, publishing, OAuth callbacks, or payment actions, V2 should return clear 429 Too Many Requests or 202 Accepted responses instead of allowing unlimited requests to overload Redis, PostgreSQL, the worker pool, or the external LLM provider.

The system should reject excess work predictably instead of letting the database, LLM provider, or worker pool collapse.

Observability

Structured logging uses JSON with required fields:

timestamp
service
environment
level
event
request_id
trace_id
route
status_code
duration_ms
error_class

Structured logging means every important log line follows the same shape. Instead of writing vague messages like something failed, the service emits consistent fields that can be searched and grouped. For example, if an OAuth callback fails, the log should include the service name, environment, route, status code, error class, request ID, and trace ID. That makes it possible to connect a user-visible problem to the exact backend event without exposing secrets.

Core metrics:

request rate by route and status;
p50/p95/p99 latency by route;
5xx rate;
auth failure rate;
OAuth provider failure rate;
persona generation queue depth;
persona generation success/failure count;
Celery task duration and retry count;
Redis connection errors;
database connection pool saturation;
SLO burn rate;
DORA deployment frequency, lead time, CFR, and MTTR.

These metrics are the operational health signals for the platform. Request rate shows demand. Latency shows whether users are waiting too long. Error rate shows whether the service is failing. Queue depth shows whether background workers are falling behind. Database pool saturation shows whether the API is running out of database connections. SLO burn rate shows whether the service is consuming its allowed failure budget too quickly. DORA metrics show whether the team is deploying safely and recovering quickly.

Alerting thresholds:

availability fast burn: 14.4x over 1 hour;
availability slow burn: 5x over 6 hours;
5xx rate above 1% for 5 minutes;
p95 API latency above 500ms for 10 minutes;
CPU above 80% warning and 90% critical;
memory above 80% warning and 90% critical;
disk above 75% warning and 90% critical;
queue depth above expected worker capacity for 10 minutes;
deployment CFR above 15%.

The alert thresholds are intentionally tied to user impact and operational risk. A single slow request should not wake someone up, but sustained high latency or a fast SLO burn should. CPU and memory alerts are useful, but they are secondary signals; the more important question is whether users are experiencing failures or slow responses. Queue depth matters because it warns that background work is piling up before users start complaining that generation is stuck.

Distributed error tracking connects metrics, logs, and traces through trace_id. Grafana derived fields allow jumping from Loki logs into Tempo traces. Application errors include stable error classes but not raw secrets or full user data.

In practical terms, this means an engineer can start from a Grafana latency spike, open the related Loki logs for the same time window, then jump into the Tempo trace for the exact request. The trace shows where time was spent, while the logs explain what happened. This is the difference between knowing "the API is slow" and knowing "persona generation is slow because the LLM call timed out after 30 seconds."

4. Tech Stack Decisions

FastAPI remains appropriate because Anvila is an API-heavy Python backend with strong async support, typed request validation, and easy OpenTelemetry instrumentation. Async support matters because API servers often wait on databases, external APIs, and network calls; async handling helps the server use resources efficiently during that waiting time.

Nginx remains useful at the application edge because it is mature, battle-tested, and efficient at reverse proxying traffic to backend processes. In V2, it should either sit behind an AWS Application Load Balancer or be replaced by a managed ingress layer if the platform moves to containers.

PM2 worked for V1 because it kept the FastAPI staging and production processes running and allowed fast restarts during the task. For V2, PM2 is acceptable for a small VM-based deployment, but a production platform should consider systemd units, containers, or orchestration so process configuration is versioned and less dependent on manual runtime commands.

PostgreSQL remains the primary relational database because Anvila needs transactional integrity for users, personas, OAuth links, refresh tokens, payments, and publishing state. Unique indexes on email and provider subjects protect identity flows from races.

Redis is used for ephemeral coordination because persona generation, rate limiting, short-lived state, Celery queues, and event streams need low-latency TTL-backed storage. "Ephemeral" means temporary. Redis is not the source of truth; if data must survive permanently, it belongs in PostgreSQL or object storage.

Celery remains appropriate for persona generation because LLM calls and GitHub publishing can be slow, retryable, and unsuitable for synchronous request handling. Synchronous request handling would force the user to wait while the server does all the work before returning a response.

Prometheus remains the metrics interface because it is mature, pull-based, and has strong PromQL support for SLO burn-rate alerting. For V2 scale, Prometheus should remote-write to durable long-term storage.

Loki is retained for logs because it integrates tightly with Grafana and is cost-efficient when logs are indexed by labels rather than full text.

Tempo is retained for traces because it is designed for high-volume trace storage with object storage and works well with OpenTelemetry.

Grafana remains the visualization layer because it can unify Prometheus, Loki, and Tempo and supports provisioned dashboards.

OpenTelemetry Collector becomes more central in V2 because it gives a vendor-neutral telemetry pipeline with batching, redaction, retries, memory limits, and routing. Vendor-neutral means the application does not have to be rewritten if the team later changes storage or visualization tools.

Alertmanager remains useful for alert grouping, routing, inhibition, and resolved notifications. V2 strengthens it with ownership metadata and escalation policies.

Node Exporter remains useful for host-level metrics such as CPU, memory, disk, filesystem, and network usage. Blackbox Exporter remains useful for probing public endpoints from outside the application process, because an API can look healthy internally while the public route is broken.

The custom DORA exporter remains useful as a bridge between GitHub Actions and Prometheus. Its V2 responsibility should be narrowed and made more accurate: export deployment timestamps, workflow duration, deployment result, rollback markers, and incident restoration timestamps instead of relying on approximations.

GitHub Actions remains the CI/CD system because it is already the source of deployment workflow events for the Anvila backend. Keeping it reduces migration risk and allows deployment metadata to feed directly into the DORA exporter and release health gate.

Slack remains the primary alert destination because it is where the team already collaborates during incidents. Alertmanager should send structured Slack messages with severity, affected service, current value, dashboard link, runbook link, and resolved/firing status. Slack is not the system of record for incidents, but it is effective for fast human response.

Brevo or another transactional email provider remains useful for lower-urgency notifications and account-related emails. Email should not replace Slack for urgent incidents, but it is useful for user-facing flows, escalation summaries, and non-real-time operational notices.

Gemini remains the LLM provider for persona generation because the existing application already uses it. In V2, LLM calls should stay behind Celery workers and rate limits because they are slower, more expensive, and more failure-prone than normal API reads.

Stripe remains the payment provider where payment features are enabled because it is a mature managed payment platform. The system should not store raw card data. Stripe webhooks should be validated, logged with safe event IDs, and monitored as high-impact integration points.

Terraform remains the infrastructure provisioning tool because the system needs reproducible cloud infrastructure. However, V2 should reduce shell provisioner dependence and move toward immutable images, cloud-init, Ansible, or container orchestration.

AWS remains a reasonable cloud provider because the existing deployment is already on EC2, and AWS provides the missing production pieces: Secrets Manager, IAM, ALB, S3, RDS, ElastiCache, autoscaling, and private networking. ALB means Application Load Balancer, which distributes traffic across healthy backends. S3 provides durable object storage for logs and traces. RDS provides managed PostgreSQL. ElastiCache provides managed Redis.

Grafana Mimir is optional but valuable if metrics retention and scale outgrow a single Prometheus server. The trade-off is operational complexity; the benefit is long-term, horizontally scalable Prometheus-compatible metrics storage.

Proposed V2 Architecture Diagram

The diagram separates the system into six readable paths:

user traffic enters through the public Application Load Balancer and reaches Anvila API replicas;
slow persona generation and GitHub publishing run in Celery workers instead of blocking user requests;
application services send metrics, logs, and traces to an internal OTLP endpoint;
OpenTelemetry Collectors process, redact, batch, and route telemetry to durable storage;
engineers access dashboards through SSO/MFA/RBAC-protected Grafana rather than direct public APIs;
GitHub Actions, the DORA exporter, Alertmanager, and the release health policy close the loop between deployment and reliability.

Conclusion

V1 was successful as a learning-stage observability platform because it proved the full reliability loop: telemetry collection, dashboards, SLOs, alerts, runbooks, incident review, and Game Day validation. It used Anvila as the first real service, which made the work concrete and testable. It was not production-grade because it depended on a single monitoring host, weak identity boundaries, mutable server state, partial DORA accuracy, incomplete data hygiene, and service-specific configuration.

V2 keeps the strongest V1 decisions: LGTM, OpenTelemetry, provisioned dashboards, SLO burn-rate alerting, and DORA visibility. It replaces the fragile parts with highly available collectors, durable telemetry storage, identity-aware access, secrets management, release health gates, and structured ownership metadata. The result is an observability platform that can support Anvila as a real user-facing product and also onboard other applications through the same platform model.

Top comments (1)

Adeolu • Jun 10

I hope this could be a good influence on the community.