DEV Community

Cover image for How We Built a CTO-Grade Grafana Dashboard With Codex
Alexander Schneider
Alexander Schneider

Posted on

How We Built a CTO-Grade Grafana Dashboard With Codex

A good dashboard is not a wall of charts. It is an answer to a question.

For us, the question was simple:

Is production healthy, and if not, where should we look first?

We already had the usual observability ingredients: application metrics, host metrics, container metrics, structured logs, traces, synthetic checks, and database telemetry. The hard part was not collecting more data. The hard part was turning that data into a dashboard that a technical leader could open during a normal day, a deploy, or an incident and understand the state of the system quickly.

Grafana gave us the observability platform. Codex helped us keep the dashboard honest, small, tested, and aligned with the codebase.

This is the playbook we ended up with.

Start With Decisions, Not Charts

The first version of almost every dashboard grows by accumulation:

  • CPU chart
  • memory chart
  • request rate chart
  • latency chart
  • error chart
  • database chart
  • worker chart
  • logs chart
  • another latency chart

That is useful for exploration, but it is not a good top-level operational view.

For our CTO dashboard, we started from decisions:

  • Is the public API reachable?
  • Are enough API nodes alive to serve traffic redundantly?
  • Are workers alive and producing useful data?
  • Is PostgreSQL healthy?
  • Are alerts firing?
  • Is observability itself working?
  • If something is slow, is it API, worker, database, logs, or tracing?

Only after writing down those questions did we choose panels.

That one change made the dashboard much smaller.

Separate Product Health From Observability Coverage

One of the most important design choices was splitting these two ideas:

  1. Product health: is the service working for users?
  2. Observability coverage: can we see the service clearly?

Those are related, but they are not the same.

If an observability agent stops reporting from one host, that is bad. But it should not automatically make the product look down. Likewise, an API outage should not be hidden because the metrics pipeline is still green.

So our top-level dashboard has separate areas for:

  • production health
  • observability coverage
  • API node readiness
  • worker health
  • database health
  • alerts and drill-down links

This makes incidents easier to reason about. A broken monitoring path is visible, but it does not masquerade as a product outage.

Manage Grafana Assets Like Code

The biggest quality jump came when we stopped treating Grafana as a browser-only editing surface.

Our dashboards, alert rules, synthetic checks, and collector configs live in git:

ops/grafana/
  dashboards/
  alerts/
  synthetics/
  alloy/
Enter fullscreen mode Exit fullscreen mode

This gives us normal engineering controls:

  • pull requests
  • code review
  • CI checks
  • repeatable sync
  • history
  • rollback

Dashboard JSON is not pretty, but it is still production behavior. If it decides what engineers see during incidents, it deserves the same review discipline as application code.

Grafana's MCP Is Actually Useful

One thing that surprised us in a good way: Grafana's MCP integration was not a toy.

It was practical enough to help with real dashboard work:

  • finding dashboards and panels
  • inspecting datasource-backed queries
  • checking alerting and dashboard structure
  • moving between Grafana context and repository context
  • turning operational questions into concrete dashboard changes

That matters because AI assistance gets much better when it can inspect the actual observability system instead of guessing from exported JSON alone. The Grafana MCP made Codex feel connected to the running operational surface, while the repository still stayed the source of truth.

We also tried similar MCP-style integrations from other observability vendors. In our evaluation, Grafana's was the most useful and reliable for day-to-day engineering work. Some alternatives, including New Relic's, did not feel mature enough for our workflow yet, so I would not recommend adopting them as the primary AI-observability interface today.

That may change, but if I were setting this up again now, I would start with Grafana.

Use Grafana Alloy as the Collection Layer

Grafana Alloy became the edge collector for our production telemetry.

In broad terms, the pipeline looks like this:

application metrics
host metrics
container metrics
database metrics
structured logs
OTLP traces
synthetic checks
        |
        v
Grafana Alloy
        |
        v
Grafana Cloud: Prometheus, Loki, Tempo, dashboards, alerts
Enter fullscreen mode Exit fullscreen mode

Alloy lets us keep the collection profile explicit:

  • scrape host and container metrics
  • collect textfile metrics for custom worker state
  • forward structured logs
  • receive OTLP traces
  • remote-write metrics
  • apply relabeling and cardinality controls before data leaves the host

The important lesson: treat the collector config as part of the product. It defines what you can and cannot see during an incident.

Metrics For SLIs, Logs And Traces For Investigation

We try not to build core service-level indicators from log parsing.

For top-level API health, metrics are the right source:

rate(api_requests_total[5m])
rate(api_request_duration_seconds_bucket[5m])
Enter fullscreen mode Exit fullscreen mode

Logs are still critical, but they are better for answering:

  • Which request failed?
  • Which worker cycle failed?
  • Which trace ID connects API, worker, and database behavior?
  • What did the application say at the time?

Traces are the next step after metrics and logs:

  • Which route was slow?
  • Where did time go?
  • Did the database dominate the request?
  • Was the problem isolated to one worker or one dependency?

The top-level dashboard should point to those tools. It should not try to replace them.

Add Domain Metrics, Not Just Infrastructure Metrics

Infrastructure metrics tell you whether machines are alive.

They do not always tell you whether the business process is useful.

For worker systems, we added domain-specific metrics such as:

  • worker alive
  • heartbeat age
  • last cycle status
  • last cycle duration
  • items produced in the latest cycle
  • items observed over recent windows
  • consecutive failures

This matters because a worker can be technically alive and still produce no useful output.

That distinction changed the quality of our alerts. We could alert on:

the worker is alive, but it has produced no useful data for a while

That is much better than only alerting when the process is dead.

Be Ruthless About Metric Semantics

One small example: we had a non-scraper background worker showing up in a "mentions produced" panel with a value of zero.

Technically, the metric existed.

Operationally, it was noise.

The fix was not to hide the worker everywhere. Its status and duration still mattered. The correct fix was to exclude it only from panels where "mentions" was the business meaning.

That is the kind of dashboard maintenance Codex is very good at:

  1. inspect the dashboard JSON
  2. find the PromQL targets
  3. understand which panels use which metric
  4. make the smallest scoped change
  5. add a regression test

Small semantic fixes like this are what make dashboards feel trustworthy.

Test The Dashboard

Testing dashboards sounds strange until the first time a dashboard breaks during an incident.

We test things like:

  • dashboard JSON is valid
  • dashboard titles and UIDs stay stable
  • required panels exist
  • important PromQL expressions are still present
  • alert rule groups keep stable identifiers
  • datasource UIDs are correct
  • high-cardinality labels do not leak into metrics or logs
  • business panels exclude irrelevant worker types

The tests do not need to render the dashboard. They need to protect the contract.

Example contract:

assert "worker_type" in query
assert "analytics_rollup" not in mention_panel_series
Enter fullscreen mode Exit fullscreen mode

In practice, the real tests are a little more robust than this, but the idea is simple: encode the intent.

Where Codex Helped Most

Codex was useful because it could work across the repo, not just inside one file.

A dashboard change often touches several places:

  • dashboard JSON
  • alert YAML
  • collector config
  • metric writer script
  • tests
  • docs
  • deployment sync logic

Codex could inspect those relationships and avoid local-only fixes that looked correct but broke the wider system.

The best workflow was:

  1. state the operational problem in plain language
  2. ask Codex to inspect the relevant repo paths
  3. require a small scoped diff
  4. require tests
  5. review the PromQL and dashboard semantics like production code

Codex is not a replacement for operational judgment. It is a very fast assistant for applying that judgment consistently.

The Dashboard Architecture We Recommend

For a production service, I would structure Grafana like this:

1. CTO Overview

The top-level dashboard should answer:

  • is production healthy?
  • are users affected?
  • are alerts firing?
  • are enough nodes serving traffic?
  • are workers producing useful output?
  • is the database healthy?
  • is observability coverage intact?

Keep this dashboard short.

2. Production Infrastructure

This is where you put:

  • host CPU and memory
  • container CPU and memory
  • disk usage
  • process freshness
  • container freshness
  • agent health

3. Database Performance

This is where you put:

  • connection usage
  • cache hit ratio
  • lock pressure
  • transaction rate
  • temp file churn
  • slow query groups
  • table pressure
  • top query fingerprints

4. Tracing

This is where you put:

  • accepted spans
  • failed exports
  • slow API traces
  • recent worker traces
  • database spans

5. Synthetic Monitoring

This is where you put:

  • public liveness
  • readiness
  • deeper health checks
  • multi-region latency

The overview links to the diagnostic dashboards. It does not try to become all of them.

Security And Cardinality Rules

A public article should say this plainly: observability can leak data if you are careless.

Our rules:

  • never put API keys in labels
  • never put customer identifiers in labels
  • never put raw URLs with query strings in labels
  • never put request IDs or trace IDs in Prometheus labels
  • keep secrets out of dashboard JSON
  • keep collector credentials outside git
  • use route templates instead of raw paths
  • prefer bounded enum labels
  • drop or avoid high-cardinality metrics before remote write

Logs and traces can contain richer context, but even there you should be deliberate. Query-time parsing is often safer than promoting every field to a label.

Practical Tips

Here are the rules I would reuse on any team.

Write The Dashboard Goal First

Before adding panels, write one sentence:

This dashboard exists to help us decide ...

If you cannot finish that sentence, you are probably building a chart collection, not a dashboard.

Use Boring Names

Panel names should be operationally obvious:

  • API 5xx Error Rate
  • API Nodes Ready
  • Workers Alive
  • PostgreSQL Connections
  • Worker Heartbeat Age

Avoid clever names. During incidents, nobody wants to decode poetry.

Prefer Fewer Panels

The best overview dashboard is the one people actually use.

If a panel does not change a decision, move it to a drill-down dashboard.

Add Links

Every top-level panel should have an obvious next place to go:

  • production overview
  • database dashboard
  • tracing dashboard
  • logs exploration
  • runbook

Test The Intent

Do not only test JSON validity.

Test the meaning:

  • this panel uses the API metrics datasource
  • this alert has a stable UID
  • this worker is excluded from this business metric
  • this latency alert has a traffic floor
  • this dashboard does not depend on a deprecated metric

Keep Manual Edits Temporary

Manual Grafana edits are great for exploration. They are not a source of truth.

Once the panel matters, export it, commit it, review it, and sync it from code.

What "Perfect" Means

The perfect dashboard is not the biggest dashboard.

It is the dashboard where every panel earns its place.

For us, that meant:

  • metrics for health
  • logs for detail
  • traces for causality
  • alerts tied to action
  • dashboards stored in git
  • tests that protect dashboard meaning
  • Codex helping us keep changes small and consistent

Grafana made the system visible. Codex helped us make that visibility maintainable. That combination turned observability from a collection of charts into an operational product.

Top comments (0)