I shipped 12 SigNoz dashboard PRs in 4 days — totaling 864 panels and 134K lines of JSON. Here's the technical architecture that made it possible.
Why SigNoz?
SigNoz is an open-source observability platform (OpenTelemetry-native) that's growing fast. Their dashboard template system uses a specific JSON schema that, once you understand it, becomes a factory for monitoring configs.
The key insight: SigNoz rewards multiple contributors per dashboard issue. Unlike typical bounties where first-to-merge wins, quality submissions all get paid. This changes the economics entirely.
The SigNoz Dashboard JSON Schema
Every SigNoz dashboard follows this structure:
{
"title": "Kafka Server Monitoring",
"description": "Comprehensive monitoring for Apache Kafka brokers",
"tags": ["kafka", "messaging", "prometheus"],
"layout": [
{
"id": "panel-uuid",
"x": 0, "y": 0, "w": 6, "h": 2,
"panelTypes": "graph"
}
],
"widgets": [
{
"id": "panel-uuid",
"title": "Broker Active Controller Count",
"description": "Should always be exactly 1 in a healthy cluster",
"panelTypes": "value",
"queryData": {
"queryType": "builder",
"promQL": "kafka_controller_active_controller_count"
}
}
]
}
The layout array controls grid positioning (12-column grid). The widgets array holds the actual panel definitions. Every widget needs a matching layout entry.
My 12-Dashboard Sprint
| PR | Technology | Panels | Lines | Key Metrics |
|---|---|---|---|---|
| #290 | ASP.NET Core (OTLP) | ~60 | 9K | Request duration, exception rate, GC pressure |
| #291 | Istio Service Mesh | ~70 | 11K | Proxy latency, circuit breaker trips, mTLS status |
| #295 | CloudNativePG | 87 | 13K | Replication lag, WAL size, connection pool |
| #296 | cert-manager | 90 | 13K | Certificate expiry, ACME failures, renewal rate |
| #298 | Kafka Server | 138 | 21K | Broker health, partition skew, consumer lag |
| #299 | Kong Gateway | 97 | 14K | Upstream latency, rate limiting hits, plugin errors |
| #300 | AWS MSK | 103 | 18K | Cluster throughput, disk usage, Zookeeper sync |
| #301 | OTel Collector | 98 | 15K | Pipeline errors, dropped spans, exporter queue |
| #302 | PostgreSQL (fix) | 2 | 200 | Idle + active connection split |
| #303 | Keycloak | 56 | 10K | Login failures, token issuance rate, realm sessions |
| #304 | Apache Spark | 63 | 10K | Executor memory, shuffle spill, stage duration |
Total: 864 panels, 134K lines of structured JSON.
The Architecture Pattern
Every dashboard follows the same hierarchy — mapped to how SREs actually triage incidents:
1. Overview Row (4-6 value panels)
→ Is the thing alive? Error rate? Throughput?
2. Resource Utilization (6-8 graph panels)
→ CPU, memory, disk, network, connections
3. Business/Domain Metrics (8-15 panels per section)
→ Technology-specific deep dives
4. Error Analysis (4-8 panels)
→ Error breakdown by type, rate over time
5. Performance Deep-Dive (6-10 panels)
→ Latency percentiles (p50/p95/p99), slow queries
When a page fires at 3 AM, an SRE looks at the overview first. Consistent layout across all dashboards means muscle memory transfers.
Building a 138-Panel Kafka Dashboard
The Kafka dashboard (PR #298) was the most complex:
Cluster Health Overview
kafka_controller_active_controller_count → value (should be 1)
kafka_server_broker_count → value
kafka_server_under_replicated_partitions → value (alert if > 0)
kafka_network_request_rate → graph
Broker Performance
kafka_server_bytes_in_per_sec → per-broker graph
kafka_server_bytes_out_per_sec → per-broker graph
kafka_network_request_latency_ms{quantile} → p50/p95/p99 graph
Partition Health
kafka_cluster_partition_count → per-topic table
kafka_cluster_partition_under_min_isr → alert panel
kafka_log_log_size → per-partition heatmap
kafka_controller_leader_election_rate → graph (spikes = instability)
5 PromQL Patterns That Cover 80% of Dashboards
After 12 dashboards, these patterns handle almost everything:
Rate of change:
rate(metric_total{label="value"}[5m])
Current gauge with filtering:
metric_gauge{namespace=~"", pod=~""}
Percentile from histogram:
histogram_quantile(0.99, rate(metric_bucket[5m]))
Top-K for tables:
topk(10, sum by (label) (rate(metric_total[5m])))
Success ratio:
sum(rate(success_total[5m])) / sum(rate(total_total[5m])) * 100
Tooling That Makes This Fast
The repetitive structure of dashboard JSON is where AI-assisted development shines:
Dashboard Builder Skill — Generates SigNoz-compatible JSON from a metrics specification. The 138-panel Kafka dashboard started as a metrics spec that the skill expanded into full JSON with proper panel types, grid positions, and PromQL queries.
API Connector Skill — Handles integration when dashboards ingest metrics from non-standard sources (like AWS MSK CloudWatch bridged to Prometheus).
The human part — deciding WHICH metrics matter — is where domain knowledge wins. The machine part — generating 21K lines of correctly-structured JSON — is where tooling pays for itself.
Lessons from 12 Dashboards
1. Panel count signals quality. My dashboards have 56-138 panels vs competitors' 20-40. Maintainers notice.
2. Descriptions > titles. Every panel has a description explaining what the metric means and when to worry.
3. Zero-competition issues exist. Keycloak and Spark had zero competing PRs.
4. Fix PRs build trust. Small 2-panel fixes show maintainers you understand the codebase.
5. Consistent schema = fast reviews. Predictable structure helps maintainers review faster.
Getting Started
- Fork the SigNoz dashboards repo and study existing templates
- Pick one technology you know well
- Start with the overview row — 4-6 value panels answering "is it healthy?"
- Expand section by section
- Validate PromQL queries against actual metric names
The Dashboard Builder handles the JSON scaffolding so you focus on metric selection and layout.
If you're into security tooling, check out our Security Scanner Skill — it scans codebases for vulnerabilities using the same systematic approach.
Top comments (0)