Alexander Schneider

Posted on May 31

How We Built a CTO-Grade Grafana Dashboard With Codex

#grafana #observability #devops #ai

A good dashboard is not a wall of charts. It is an answer to a question.

For us, the question was simple:

Is production healthy, and if not, where should we look first?

We already had the usual observability ingredients: application metrics, host metrics, container metrics, structured logs, traces, synthetic checks, and database telemetry. The hard part was not collecting more data. The hard part was turning that data into a dashboard that a technical leader could open during a normal day, a deploy, or an incident and understand the state of the system quickly.

Grafana gave us the observability platform. Codex helped us keep the dashboard honest, small, tested, and aligned with the codebase.

This is the playbook we ended up with.

Start With Decisions, Not Charts

The first version of almost every dashboard grows by accumulation:

CPU chart
memory chart
request rate chart
latency chart
error chart
database chart
worker chart
logs chart
another latency chart

That is useful for exploration, but it is not a good top-level operational view.

For our CTO dashboard, we started from decisions:

Is the public API reachable?
Are enough API nodes alive to serve traffic redundantly?
Are workers alive and producing useful data?
Is PostgreSQL healthy?
Are alerts firing?
Is observability itself working?
If something is slow, is it API, worker, database, logs, or tracing?

Only after writing down those questions did we choose panels.

That one change made the dashboard much smaller.

Separate Product Health From Observability Coverage

One of the most important design choices was splitting these two ideas:

Product health: is the service working for users?
Observability coverage: can we see the service clearly?

Those are related, but they are not the same.

If an observability agent stops reporting from one host, that is bad. But it should not automatically make the product look down. Likewise, an API outage should not be hidden because the metrics pipeline is still green.

So our top-level dashboard has separate areas for:

production health
observability coverage
API node readiness
worker health
database health
alerts and drill-down links

This makes incidents easier to reason about. A broken monitoring path is visible, but it does not masquerade as a product outage.

Manage Grafana Assets Like Code

The biggest quality jump came when we stopped treating Grafana as a browser-only editing surface.

Our dashboards, alert rules, synthetic checks, and collector configs live in git:

ops/grafana/
  dashboards/
  alerts/
  synthetics/
  alloy/

This gives us normal engineering controls:

pull requests
code review
CI checks
repeatable sync
history
rollback

Dashboard JSON is not pretty, but it is still production behavior. If it decides what engineers see during incidents, it deserves the same review discipline as application code.

Grafana's MCP Is Actually Useful

One thing that surprised us in a good way: Grafana's MCP integration was not a toy.

It was practical enough to help with real dashboard work:

finding dashboards and panels
inspecting datasource-backed queries
checking alerting and dashboard structure
moving between Grafana context and repository context
turning operational questions into concrete dashboard changes

That matters because AI assistance gets much better when it can inspect the actual observability system instead of guessing from exported JSON alone. The Grafana MCP made Codex feel connected to the running operational surface, while the repository still stayed the source of truth.

We also tried similar MCP-style integrations from other observability vendors. In our evaluation, Grafana's was the most useful and reliable for day-to-day engineering work. Some alternatives, including New Relic's, did not feel mature enough for our workflow yet, so I would not recommend adopting them as the primary AI-observability interface today.

That may change, but if I were setting this up again now, I would start with Grafana.

Use Grafana Alloy as the Collection Layer

Grafana Alloy became the edge collector for our production telemetry.

In broad terms, the pipeline looks like this:

application metrics
host metrics
container metrics
database metrics
structured logs
OTLP traces
synthetic checks
        |
        v
Grafana Alloy
        |
        v
Grafana Cloud: Prometheus, Loki, Tempo, dashboards, alerts

Alloy lets us keep the collection profile explicit:

scrape host and container metrics
collect textfile metrics for custom worker state
forward structured logs
receive OTLP traces
remote-write metrics
apply relabeling and cardinality controls before data leaves the host

The important lesson: treat the collector config as part of the product. It defines what you can and cannot see during an incident.

Metrics For SLIs, Logs And Traces For Investigation

We try not to build core service-level indicators from log parsing.

For top-level API health, metrics are the right source:

rate(api_requests_total[5m])
rate(api_request_duration_seconds_bucket[5m])

Logs are still critical, but they are better for answering:

Which request failed?
Which worker cycle failed?
Which trace ID connects API, worker, and database behavior?
What did the application say at the time?

Traces are the next step after metrics and logs:

Which route was slow?
Where did time go?
Did the database dominate the request?
Was the problem isolated to one worker or one dependency?

The top-level dashboard should point to those tools. It should not try to replace them.

Add Domain Metrics, Not Just Infrastructure Metrics

Infrastructure metrics tell you whether machines are alive.

They do not always tell you whether the business process is useful.

For worker systems, we added domain-specific metrics such as:

worker alive
heartbeat age
last cycle status
last cycle duration
items produced in the latest cycle
items observed over recent windows
consecutive failures

This matters because a worker can be technically alive and still produce no useful output.

That distinction changed the quality of our alerts. We could alert on:

the worker is alive, but it has produced no useful data for a while

That is much better than only alerting when the process is dead.

Be Ruthless About Metric Semantics

One small example: we had a non-scraper background worker showing up in a "mentions produced" panel with a value of zero.

Technically, the metric existed.

Operationally, it was noise.

The fix was not to hide the worker everywhere. Its status and duration still mattered. The correct fix was to exclude it only from panels where "mentions" was the business meaning.

That is the kind of dashboard maintenance Codex is very good at:

inspect the dashboard JSON
find the PromQL targets
understand which panels use which metric
make the smallest scoped change
add a regression test

Small semantic fixes like this are what make dashboards feel trustworthy.

Test The Dashboard

Testing dashboards sounds strange until the first time a dashboard breaks during an incident.

We test things like:

dashboard JSON is valid
dashboard titles and UIDs stay stable
required panels exist
important PromQL expressions are still present
alert rule groups keep stable identifiers
datasource UIDs are correct
high-cardinality labels do not leak into metrics or logs
business panels exclude irrelevant worker types

The tests do not need to render the dashboard. They need to protect the contract.

Example contract:

assert "worker_type" in query
assert "analytics_rollup" not in mention_panel_series

In practice, the real tests are a little more robust than this, but the idea is simple: encode the intent.

Where Codex Helped Most

Codex was useful because it could work across the repo, not just inside one file.

A dashboard change often touches several places:

dashboard JSON
alert YAML
collector config
metric writer script
tests
docs
deployment sync logic

Codex could inspect those relationships and avoid local-only fixes that looked correct but broke the wider system.

The best workflow was:

state the operational problem in plain language
ask Codex to inspect the relevant repo paths
require a small scoped diff
require tests
review the PromQL and dashboard semantics like production code

Codex is not a replacement for operational judgment. It is a very fast assistant for applying that judgment consistently.

The Dashboard Architecture We Recommend

For a production service, I would structure Grafana like this:

1. CTO Overview

The top-level dashboard should answer:

is production healthy?
are users affected?
are alerts firing?
are enough nodes serving traffic?
are workers producing useful output?
is the database healthy?
is observability coverage intact?

Keep this dashboard short.

2. Production Infrastructure

This is where you put:

host CPU and memory
container CPU and memory
disk usage
process freshness
container freshness
agent health

3. Database Performance

This is where you put:

connection usage
cache hit ratio
lock pressure
transaction rate
temp file churn
slow query groups
table pressure
top query fingerprints

4. Tracing

This is where you put:

accepted spans
failed exports
slow API traces
recent worker traces
database spans

5. Synthetic Monitoring

This is where you put:

public liveness
readiness
deeper health checks
multi-region latency

The overview links to the diagnostic dashboards. It does not try to become all of them.

Security And Cardinality Rules

A public article should say this plainly: observability can leak data if you are careless.

Our rules:

never put API keys in labels
never put customer identifiers in labels
never put raw URLs with query strings in labels
never put request IDs or trace IDs in Prometheus labels
keep secrets out of dashboard JSON
keep collector credentials outside git
use route templates instead of raw paths
prefer bounded enum labels
drop or avoid high-cardinality metrics before remote write

Logs and traces can contain richer context, but even there you should be deliberate. Query-time parsing is often safer than promoting every field to a label.

Practical Tips

Here are the rules I would reuse on any team.

Write The Dashboard Goal First

Before adding panels, write one sentence:

This dashboard exists to help us decide ...

If you cannot finish that sentence, you are probably building a chart collection, not a dashboard.

Use Boring Names

Panel names should be operationally obvious:

API 5xx Error Rate
API Nodes Ready
Workers Alive
PostgreSQL Connections
Worker Heartbeat Age

Avoid clever names. During incidents, nobody wants to decode poetry.

Prefer Fewer Panels

The best overview dashboard is the one people actually use.

If a panel does not change a decision, move it to a drill-down dashboard.

Add Links

Every top-level panel should have an obvious next place to go:

production overview
database dashboard
tracing dashboard
logs exploration
runbook

Test The Intent

Do not only test JSON validity.

Test the meaning:

this panel uses the API metrics datasource
this alert has a stable UID
this worker is excluded from this business metric
this latency alert has a traffic floor
this dashboard does not depend on a deprecated metric

Keep Manual Edits Temporary

Manual Grafana edits are great for exploration. They are not a source of truth.

Once the panel matters, export it, commit it, review it, and sync it from code.

What "Perfect" Means

The perfect dashboard is not the biggest dashboard.

It is the dashboard where every panel earns its place.

For us, that meant:

metrics for health
logs for detail
traces for causality
alerts tied to action
dashboards stored in git
tests that protect dashboard meaning
Codex helping us keep changes small and consistent

Grafana made the system visible. Codex helped us make that visibility maintainable. That combination turned observability from a collection of charts into an operational product.

DEV Community

How We Built a CTO-Grade Grafana Dashboard With Codex

Start With Decisions, Not Charts

Separate Product Health From Observability Coverage

Manage Grafana Assets Like Code

Grafana's MCP Is Actually Useful

Use Grafana Alloy as the Collection Layer

Metrics For SLIs, Logs And Traces For Investigation

Add Domain Metrics, Not Just Infrastructure Metrics

Be Ruthless About Metric Semantics

Test The Dashboard

Where Codex Helped Most

The Dashboard Architecture We Recommend

1. CTO Overview

2. Production Infrastructure

3. Database Performance

4. Tracing

5. Synthetic Monitoring

Security And Cardinality Rules

Practical Tips

Write The Dashboard Goal First

Use Boring Names

Prefer Fewer Panels

Add Links

Test The Intent

Keep Manual Edits Temporary

What "Perfect" Means

Top comments (0)