manja316

Posted on Apr 7

I Shipped 12 SigNoz Dashboard PRs in 4 Days (864 Panels, 134K Lines)

#prometheus #devops #monitoring #opensource

I shipped 12 SigNoz dashboard PRs in 4 days — totaling 864 panels and 134K lines of JSON. Here's the technical architecture that made it possible.

Why SigNoz?

SigNoz is an open-source observability platform (OpenTelemetry-native) that's growing fast. Their dashboard template system uses a specific JSON schema that, once you understand it, becomes a factory for monitoring configs.

The key insight: SigNoz rewards multiple contributors per dashboard issue. Unlike typical bounties where first-to-merge wins, quality submissions all get paid. This changes the economics entirely.

The SigNoz Dashboard JSON Schema

Every SigNoz dashboard follows this structure:

{
  "title": "Kafka Server Monitoring",
  "description": "Comprehensive monitoring for Apache Kafka brokers",
  "tags": ["kafka", "messaging", "prometheus"],
  "layout": [
    {
      "id": "panel-uuid",
      "x": 0, "y": 0, "w": 6, "h": 2,
      "panelTypes": "graph"
    }
  ],
  "widgets": [
    {
      "id": "panel-uuid",
      "title": "Broker Active Controller Count",
      "description": "Should always be exactly 1 in a healthy cluster",
      "panelTypes": "value",
      "queryData": {
        "queryType": "builder",
        "promQL": "kafka_controller_active_controller_count"
      }
    }
  ]
}

The layout array controls grid positioning (12-column grid). The widgets array holds the actual panel definitions. Every widget needs a matching layout entry.

My 12-Dashboard Sprint

PR	Technology	Panels	Lines	Key Metrics
#290	ASP.NET Core (OTLP)	~60	9K	Request duration, exception rate, GC pressure
#291	Istio Service Mesh	~70	11K	Proxy latency, circuit breaker trips, mTLS status
#295	CloudNativePG	87	13K	Replication lag, WAL size, connection pool
#296	cert-manager	90	13K	Certificate expiry, ACME failures, renewal rate
#298	Kafka Server	138	21K	Broker health, partition skew, consumer lag
#299	Kong Gateway	97	14K	Upstream latency, rate limiting hits, plugin errors
#300	AWS MSK	103	18K	Cluster throughput, disk usage, Zookeeper sync
#301	OTel Collector	98	15K	Pipeline errors, dropped spans, exporter queue
#302	PostgreSQL (fix)	2	200	Idle + active connection split
#303	Keycloak	56	10K	Login failures, token issuance rate, realm sessions
#304	Apache Spark	63	10K	Executor memory, shuffle spill, stage duration

Total: 864 panels, 134K lines of structured JSON.

The Architecture Pattern

Every dashboard follows the same hierarchy — mapped to how SREs actually triage incidents:

1. Overview Row (4-6 value panels)
   → Is the thing alive? Error rate? Throughput?

2. Resource Utilization (6-8 graph panels)
   → CPU, memory, disk, network, connections

3. Business/Domain Metrics (8-15 panels per section)
   → Technology-specific deep dives

4. Error Analysis (4-8 panels)
   → Error breakdown by type, rate over time

5. Performance Deep-Dive (6-10 panels)
   → Latency percentiles (p50/p95/p99), slow queries

When a page fires at 3 AM, an SRE looks at the overview first. Consistent layout across all dashboards means muscle memory transfers.

Building a 138-Panel Kafka Dashboard

The Kafka dashboard (PR #298) was the most complex:

Cluster Health Overview

kafka_controller_active_controller_count     → value (should be 1)
kafka_server_broker_count                    → value
kafka_server_under_replicated_partitions     → value (alert if > 0)
kafka_network_request_rate                   → graph

Broker Performance

kafka_server_bytes_in_per_sec                → per-broker graph
kafka_server_bytes_out_per_sec               → per-broker graph
kafka_network_request_latency_ms{quantile}   → p50/p95/p99 graph

Partition Health

kafka_cluster_partition_count                → per-topic table
kafka_cluster_partition_under_min_isr        → alert panel
kafka_log_log_size                           → per-partition heatmap
kafka_controller_leader_election_rate        → graph (spikes = instability)

5 PromQL Patterns That Cover 80% of Dashboards

After 12 dashboards, these patterns handle almost everything:

Rate of change:

rate(metric_total{label="value"}[5m])

Current gauge with filtering:

metric_gauge{namespace=~"", pod=~""}

Percentile from histogram:

histogram_quantile(0.99, rate(metric_bucket[5m]))

Top-K for tables:

topk(10, sum by (label) (rate(metric_total[5m])))

Success ratio:

sum(rate(success_total[5m])) / sum(rate(total_total[5m])) * 100

Tooling That Makes This Fast

The repetitive structure of dashboard JSON is where AI-assisted development shines:

Dashboard Builder Skill — Generates SigNoz-compatible JSON from a metrics specification. The 138-panel Kafka dashboard started as a metrics spec that the skill expanded into full JSON with proper panel types, grid positions, and PromQL queries.
API Connector Skill — Handles integration when dashboards ingest metrics from non-standard sources (like AWS MSK CloudWatch bridged to Prometheus).

The human part — deciding WHICH metrics matter — is where domain knowledge wins. The machine part — generating 21K lines of correctly-structured JSON — is where tooling pays for itself.

Lessons from 12 Dashboards

1. Panel count signals quality. My dashboards have 56-138 panels vs competitors' 20-40. Maintainers notice.

2. Descriptions > titles. Every panel has a description explaining what the metric means and when to worry.

3. Zero-competition issues exist. Keycloak and Spark had zero competing PRs.

4. Fix PRs build trust. Small 2-panel fixes show maintainers you understand the codebase.

5. Consistent schema = fast reviews. Predictable structure helps maintainers review faster.

Getting Started

Fork the SigNoz dashboards repo and study existing templates
Pick one technology you know well
Start with the overview row — 4-6 value panels answering "is it healthy?"
Expand section by section
Validate PromQL queries against actual metric names

The Dashboard Builder handles the JSON scaffolding so you focus on metric selection and layout.

If you're into security tooling, check out our Security Scanner Skill — it scans codebases for vulnerabilities using the same systematic approach.

DEV Community