DEV Community

manja316
manja316

Posted on

I Shipped 12 SigNoz Dashboard PRs in 4 Days (864 Panels, 134K Lines)

I shipped 12 SigNoz dashboard PRs in 4 days — totaling 864 panels and 134K lines of JSON. Here's the technical architecture that made it possible.

Why SigNoz?

SigNoz is an open-source observability platform (OpenTelemetry-native) that's growing fast. Their dashboard template system uses a specific JSON schema that, once you understand it, becomes a factory for monitoring configs.

The key insight: SigNoz rewards multiple contributors per dashboard issue. Unlike typical bounties where first-to-merge wins, quality submissions all get paid. This changes the economics entirely.

The SigNoz Dashboard JSON Schema

Every SigNoz dashboard follows this structure:

{
  "title": "Kafka Server Monitoring",
  "description": "Comprehensive monitoring for Apache Kafka brokers",
  "tags": ["kafka", "messaging", "prometheus"],
  "layout": [
    {
      "id": "panel-uuid",
      "x": 0, "y": 0, "w": 6, "h": 2,
      "panelTypes": "graph"
    }
  ],
  "widgets": [
    {
      "id": "panel-uuid",
      "title": "Broker Active Controller Count",
      "description": "Should always be exactly 1 in a healthy cluster",
      "panelTypes": "value",
      "queryData": {
        "queryType": "builder",
        "promQL": "kafka_controller_active_controller_count"
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The layout array controls grid positioning (12-column grid). The widgets array holds the actual panel definitions. Every widget needs a matching layout entry.

My 12-Dashboard Sprint

PR Technology Panels Lines Key Metrics
#290 ASP.NET Core (OTLP) ~60 9K Request duration, exception rate, GC pressure
#291 Istio Service Mesh ~70 11K Proxy latency, circuit breaker trips, mTLS status
#295 CloudNativePG 87 13K Replication lag, WAL size, connection pool
#296 cert-manager 90 13K Certificate expiry, ACME failures, renewal rate
#298 Kafka Server 138 21K Broker health, partition skew, consumer lag
#299 Kong Gateway 97 14K Upstream latency, rate limiting hits, plugin errors
#300 AWS MSK 103 18K Cluster throughput, disk usage, Zookeeper sync
#301 OTel Collector 98 15K Pipeline errors, dropped spans, exporter queue
#302 PostgreSQL (fix) 2 200 Idle + active connection split
#303 Keycloak 56 10K Login failures, token issuance rate, realm sessions
#304 Apache Spark 63 10K Executor memory, shuffle spill, stage duration

Total: 864 panels, 134K lines of structured JSON.

The Architecture Pattern

Every dashboard follows the same hierarchy — mapped to how SREs actually triage incidents:

1. Overview Row (4-6 value panels)
   → Is the thing alive? Error rate? Throughput?

2. Resource Utilization (6-8 graph panels)
   → CPU, memory, disk, network, connections

3. Business/Domain Metrics (8-15 panels per section)
   → Technology-specific deep dives

4. Error Analysis (4-8 panels)
   → Error breakdown by type, rate over time

5. Performance Deep-Dive (6-10 panels)
   → Latency percentiles (p50/p95/p99), slow queries
Enter fullscreen mode Exit fullscreen mode

When a page fires at 3 AM, an SRE looks at the overview first. Consistent layout across all dashboards means muscle memory transfers.

Building a 138-Panel Kafka Dashboard

The Kafka dashboard (PR #298) was the most complex:

Cluster Health Overview

kafka_controller_active_controller_count     → value (should be 1)
kafka_server_broker_count                    → value
kafka_server_under_replicated_partitions     → value (alert if > 0)
kafka_network_request_rate                   → graph
Enter fullscreen mode Exit fullscreen mode

Broker Performance

kafka_server_bytes_in_per_sec                → per-broker graph
kafka_server_bytes_out_per_sec               → per-broker graph
kafka_network_request_latency_ms{quantile}   → p50/p95/p99 graph
Enter fullscreen mode Exit fullscreen mode

Partition Health

kafka_cluster_partition_count                → per-topic table
kafka_cluster_partition_under_min_isr        → alert panel
kafka_log_log_size                           → per-partition heatmap
kafka_controller_leader_election_rate        → graph (spikes = instability)
Enter fullscreen mode Exit fullscreen mode

5 PromQL Patterns That Cover 80% of Dashboards

After 12 dashboards, these patterns handle almost everything:

Rate of change:

rate(metric_total{label="value"}[5m])
Enter fullscreen mode Exit fullscreen mode

Current gauge with filtering:

metric_gauge{namespace=~"", pod=~""}
Enter fullscreen mode Exit fullscreen mode

Percentile from histogram:

histogram_quantile(0.99, rate(metric_bucket[5m]))
Enter fullscreen mode Exit fullscreen mode

Top-K for tables:

topk(10, sum by (label) (rate(metric_total[5m])))
Enter fullscreen mode Exit fullscreen mode

Success ratio:

sum(rate(success_total[5m])) / sum(rate(total_total[5m])) * 100
Enter fullscreen mode Exit fullscreen mode

Tooling That Makes This Fast

The repetitive structure of dashboard JSON is where AI-assisted development shines:

  • Dashboard Builder Skill — Generates SigNoz-compatible JSON from a metrics specification. The 138-panel Kafka dashboard started as a metrics spec that the skill expanded into full JSON with proper panel types, grid positions, and PromQL queries.

  • API Connector Skill — Handles integration when dashboards ingest metrics from non-standard sources (like AWS MSK CloudWatch bridged to Prometheus).

The human part — deciding WHICH metrics matter — is where domain knowledge wins. The machine part — generating 21K lines of correctly-structured JSON — is where tooling pays for itself.

Lessons from 12 Dashboards

1. Panel count signals quality. My dashboards have 56-138 panels vs competitors' 20-40. Maintainers notice.

2. Descriptions > titles. Every panel has a description explaining what the metric means and when to worry.

3. Zero-competition issues exist. Keycloak and Spark had zero competing PRs.

4. Fix PRs build trust. Small 2-panel fixes show maintainers you understand the codebase.

5. Consistent schema = fast reviews. Predictable structure helps maintainers review faster.

Getting Started

  1. Fork the SigNoz dashboards repo and study existing templates
  2. Pick one technology you know well
  3. Start with the overview row — 4-6 value panels answering "is it healthy?"
  4. Expand section by section
  5. Validate PromQL queries against actual metric names

The Dashboard Builder handles the JSON scaffolding so you focus on metric selection and layout.

If you're into security tooling, check out our Security Scanner Skill — it scans codebases for vulnerabilities using the same systematic approach.

Top comments (0)