TL;DR: Slack's native analytics won't tell you which threads die, which requests get ignored, or how yyour team's collaboration patterns evolve. I built a Go webhook handler that captures every reaction, reply, and thread via Slack's event API, exports it to Prometheus, and visualizes it in Grafana - giving me metrics Slack never will.
The Wake-Up Call
It started with a post-incident review. We'd just recovered from a 3-hour outage, and I was asked the usual questions: How fast did we respond? Which channels were involved? Were there warning signs in our internal Slack discussions before things broke?
I opened Slack's analytics dashboard. Member counts. Message volume. Channel growth. Vanity metrics that told me nothing about the incident or our response patterns.
I couldn't answer basic questions: How many infrastructure requests were sitting unresolved in our #platform-requests channel? Who was drowning in context-switching across too many active threads? Had we missed early warning signs buried in reaction patterns on code review requests?
Worse, I realized I was guessing about team health. I'd see a busy Slack day and assume we were productive. I'd see green checkmark reactions and assume requests were getting fulfilled. But I had no data - just gut feel and fragmented chat history.
Then came the compliance gap. A terraform apply failed in staging, and someone replied with "skipping, will fix later" - then applied the change manually to unblock themselves. It showed up as a casual thread reply with a 😱 reaction from a teammate, buried under 30 messages. No alert, no metric, no visibility - until I went looking for it during the post-mortem and found the octopus reaction someone had wisely added to flag the IaC violation.
That's when I realized: Slack is where my team actually works, but Slack's analytics treat it like a broadcast channel instead of a workflow engine. Every reaction, every thread, every reply is a data point. I just wasn't collecting them.
So I built something to fix that.
Why Slack Won't Tell You What You Need to Know
Slack gives you member counts, message volume, and channel growth. That's it. If you're running a platform engineering team, a DevOps org, or an internal support channel, those numbers are vanity metrics.
What you actually need to know:
- Which threads get resolved vs. abandoned?
- How fast does the team respond to infrastructure requests?
- Which code review requests get ignored (and why)?
- Are people actually engaging with your announcements?
Slack doesn't expose this. But Slack's event subscriptions do - if you know how to catch them.
Architecture: How I Capture What Slack Hides
The system has three layers: Slack events → Go webhook handler → Prometheus + Grafana.
The Webhook Handler
Built in Go 1.26 with Gin, the handler listens at POST /api/slack/events. Every event goes through this middleware stack:
| Middleware | Purpose | Notes |
|---|---|---|
| Recovery | Panic recovery | |
| RequestID | Unique ID per request | |
| GinLogger | Structured JSON logging | |
| RequestSizeLimit | Reject >1MB | |
| RateLimiter | 10 req/s per IP, burst 20 | Handles Slack's event batching |
| SlackVerification | HMAC-SHA256 signature check |
The burst 20 setting is critical - Slack's Event API sometimes batches multiple events into a short window, so a single IP can legitimately spike above 10 req/s. The burst allows those spikes without dropping legitimate events.
The three event types we care about:
1. message events - catch new threads and replies:
// internal/handler/slack.go (simplified)
if event.ThreadTs == "" || event.ThreadTs == event.Ts {
// Top-level message → new thread
metrics.RecordNewThread(ctx, channel, event.Ts)
} else {
// Reply in thread
metrics.RecordThreadReply(ctx, channel, event.ThreadTs, event.User)
}
2. reaction_added / reaction_removed - the signal layer:
// Reactions are stored in Redis + recorded as metrics
metrics.RecordThreadReaction(ctx, channel, event.Item.Ts, event.User, event.Reaction)
metrics.RecordUserEngagement(ctx, channel, event.Item.Ts, event.User)
3. url_verification - Slack's challenge-response during app setup.
The Slack Retry Problem
Slack expects a 200 OK within 3 seconds. If your handler is busy writing to Redis or the Prometheus exporter is slow, Slack will retry the event - and you'll get duplicate counts in your metrics.
The fix: acknowledge first, process asynchronously. Return 200 OK immediately, then handle the event in a goroutine:
// internal/handler/slack.go (simplified)
func (h *Handler) HandleEvent(c *gin.Context) {
var event SlackEvent
if err := c.ShouldBindJSON(&event); err != nil {
c.JSON(400, gin.H{"error": err.Error()})
return
}
// Acknowledge immediately - don't block on Redis/Prometheus
c.JSON(200, gin.H{"ok": true})
// Process in background
go func() {
ctx := context.Background()
if event.ThreadTs == "" || event.ThreadTs == event.Ts {
h.metrics.RecordNewThread(ctx, event.Channel, event.Ts)
} else {
h.metrics.RecordThreadReply(ctx, event.Channel, event.ThreadTs, event.User)
}
}()
}
This prevents Slack's retry mechanism from creating duplicate metric increments.
Why Reactions Are the Secret Signal
I don't just count messages. I track reactions as semantic signals:
| Emoji | Reaction Name | Meaning | Metric Label |
|---|---|---|---|
| ✅ | white_check_mark | Request completed | reaction="white_check_mark" |
| 😱 | scream | Multiple active threads per person (overload) | reaction="scream" |
| 🚫 | no_entry | Irrelevant code review request | reaction="no_entry" |
| 🔁 | repeat | Repeating / duplicate request | reaction="repeat" |
| ❌ | x | Irrelevant request | reaction="x" |
| ✔️ | done | Completed (no PR review needed) | reaction="done" |
| 🐙 | octopus | Infrastructure-as-Code violation | reaction="octopus" |
This turns Slack into a structured workflow tool - reactions become queryable metrics.
Getting the Team on Board (Social Engineering)
This only works if people actually use the reactions. I didn't mandate it - instead, I led by example: I started reacting to threads with octopus when I saw IaC violations, and with scream when I was drowning in threads. Within a week, the team adopted it naturally. The key is making it feel like a helpful shorthand, not a tracking mechanism. If the team doesn't use the emoji, your metrics are useless - so make it useful for them first.
Metric Definitions and Component Mapping
I define 5 OpenTelemetry counter metrics in internal/metrics/slack.go. The Prometheus exporter appends _total, giving the double suffix you see in PromQL queries - you might want to fix it in your OTEL Agent.
Metric-to-Component Mapping
| Code Name | Labels | Triggered By |
|---|---|---|
webhook_requests_total |
channel, status
|
Every incoming webhook |
threads_total |
channel |
New top-level message |
thread_replies_total |
channel, user_id
|
Reply in thread |
thread_reactions_total |
channel, user_id, reaction
|
Reaction added |
user_engagement_total |
channel, user_id
|
Reply OR reaction |
The Prometheus Cardinality Bomb
If you're tempted to add thread_id as a label - don't. In my original POC, I made this mistake. In Prometheus, every unique combination of label values creates a new time series. With thousands of Slack threads (each with a unique thread_id), you're creating thousands of time series. This is a cardinality bomb that will explode your Prometheus memory usage and can crash the instance.
The fix: Keep Prometheus for aggregate channel-level counters only. Thread-level detail belongs in Redis (which I'm already using) or a SQL database, not in Prometheus labels. If you want the "Longest Thread" feature, query Redis directly - don't pollute your metrics with high-cardinality labels. Prometheus is for metrics, not event logging.
Key Performance Indicators
The dashboard is organized into four insight layers, each answering a different question about how your Slack channels actually operate:
Volume & Throughput - Are we drowning or cruising?
Total Threads shows raw request load per day, making it obvious which days are release days, incident spikes, or quiet periods. Active Threads (threads with at least one reply) reveals engagement depth - a high thread count with low reply rate means people are posting but not discussing, a signal of announcements drowning out conversation.
Resolution Signals - Are requests actually getting done?
The green panels tell the completion story. white_check_mark reactions track formal completions - infrastructure requests fulfilled, reviews done. done reactions capture quick wins that never needed a PR. Meanwhile, x reactions surface canceled or irrelevant requests, helping you spot scope creep or misrouted asks. Together, they give you a resolution rate without ever asking anyone to fill out a form.
Quality & Friction - Where is the process breaking?
The red and yellow panels are your early warning system. A spike in scream reactions means people are drowning in threads - too many active requests per person. no_entry on code reviews exposes recurring low-quality submissions from specific contributors. repeat reactions highlight documentation gaps - the same question asked three times means the answer isn't findable. octopus flags manual infrastructure changes that bypass your IaC pipeline, a compliance risk you'd otherwise miss.
Time-Based Insights - When is the team really working?
The blue panels surface patterns hidden by aggregate daily stats. After-Hours Work quantifies overtime - if 30% of threads start after 17:00, you have an on-call problem, not a "dedicated team" problem. Longest Thread with a clickable Slack link lets you jump directly to the most contentious discussion - usually a design debate or a stuck incident. Busiest Day aggregates threads by date to reveal your actual rhythm: Tuesday releases? Friday incident spikes? The data tells the story.
Business-Hour Awareness
One panel tracks requests outside 10:00–17:00 (browser time):
sum_over_time(
(
sum(
increase(threads_total{channel=~"$channel_filter"}[5m])
and on()
((hour() < 10) or (hour() >= 17))
)
)[$__range:5m]
)
This tells us: is the team working after hours? If so, why?
Note: The hour() function in Prometheus uses UTC. If your team is in a different timezone (e.g., CET/Poland), adjust the offsets: 10:00 CET = 08:00 UTC, 17:00 CET = 15:00 UTC. Adjust the < 10 / >= 17 values based on where your Prometheus server is running.
Longest Thread Detection
Since I removed thread_id from Prometheus labels (see The Prometheus Cardinality Bomb), the "Longest Thread" feature works differently: I pull thread IDs directly from Redis, where high-cardinality data lives safely. The Grafana dashboard queries Redis for the top threads by reply count, then overlays Prometheus aggregate trends for context.
To implement this, use a Grafana Infinity datasource or Redis datasource plugin to query:
LRANGE replies:{channel}:* 0 -1
Then match the longest list to its channel and thread_ts. The clickable Slack link is built from these Redis keys:
https://yourworkspace.slack.com/archives/{channel}/p{thread_ts_without_dot}
Alternatively, if you keep thread_id in Prometheus for a small team (few threads), the PromQL would be:
topk(
1,
sum by (thread_id, channel) (
max_over_time(
thread_replies_total_total{
channel=~"$channel_filter"
}[$__range]
)
)
)
With a Data link override pointing to https://yourworkspace.slack.com/archives/${channel}/p${__cell}.
Busiest Day Detection
sum(
increase(threads_total{channel=~"$channel_filter"}[1d])
)
Grafana transformations convert the timestamp to YYYY-MM-DD format and sort by value.
Dependencies: What We Chose and Why
go.mod (direct dependencies)
├── github.com/gin-gonic/gin v1.11.0 # HTTP routing
├── go.opentelemetry.io/otel v1.40.0 # Metrics SDK
├── go.opentelemetry.io/otel/exporters/prometheus v0.62.0 # Prometheus export
├── go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.40.0
├── github.com/prometheus/client_golang v1.23.2 # Prometheus client
├── github.com/redis/go-redis/v9 v9.17.3 # Redis client (cluster support)
├── golang.org/x/time v0.14.0 # Rate limiting
└── github.com/alicebob/miniredis/v2 v2.36.1 # In-memory Redis for tests
Key Trade-offs
| Choice | Why | Trade-off |
|---|---|---|
| OpenTelemetry over raw Prometheus client | Future-proof, OTLP export option | More boilerplate setup |
| Redis for thread storage | Fast lookups, TTL support | Extra dependency, network hop |
| Gin over stdlib | Fast dev, built-in middleware | Framework lock-in |
| Counter metrics only | Simple, append-only | No histograms for latency (yet) |
| Fly.io deployment | Simple, cheap, EU region (waw) | Less control than K8s |
What I Skipped
- Histograms: I don't track webhook processing latency. The handler is fast enough (<10ms) that it hasn't mattered yet.
-
Gauge metrics: No "current active threads" gauge. I rely on
increase()queries instead. -
Thread cancellation: No separate metric needed - canceled threads are tracked via
thread_reactions_total{reaction="x"}(the ❌ emoji). The Grafana dashboard queries reactions withreaction="x"to surface irrelevant/canceled requests.
Redis Data Structures
The handler stores thread context in Redis for enrichment (not just metrics):
| Key Pattern | Type | Content |
|---|---|---|
reactions:{channel}:{ts} |
Hash | Field: {reaction}:{user}, Value: JSON with timestamps |
messages:{channel}:{ts} |
String | JSON: user, text, ts, channel |
replies:{channel}:{thread_ts} |
List | JSON entries for each reply |
processed_events:{event_id} |
String | TTL 5min - idempotency key for Slack retries |
Event Deduplication
Slack retries events if it doesn't get a 200 OK fast enough. To prevent double-counting metrics, the handler uses the event_id as an idempotency key with SETNX:
// Before processing, check if we already handled this event
key := fmt.Sprintf("processed_events:%s", slackEvent.EventID)
// SetNX returns true if key was SET (new event), false if key already existed
processed, _ := rdb.SetNX(ctx, key, "1", 5*time.Minute).Result()
if !processed {
// Event already processed - skip to prevent double-counting
return
}
This ensures that even if Slack sends the same event twice during a network blip, your metrics are only incremented once.
Lessons Learned
1. Reactions Are Better Than Thread Replies for Signals
A reply requires typing. A reaction is one click. I get 3× more signal from reactions than replies because the friction is lower.
2. Grafana Transformations Are Powerful But Fragile
The "Longest Thread" panel uses 4 transformations: sortBy, limit, convertFieldType, formatTime, and renameByRegex. One breaks, the panel dies. Keep transformations minimal where possible.
3. Rate Limiting Is Essential
Slack's event replay will flood you if your handler was down. Our rate limiter (10 req/s per IP) prevents thundering herd on restart.
Wrapping Up: What You Get Out of This
Building custom Slack metrics isn't about dashboards for the sake of dashboards. It's about making the invisible visible. Here's what this setup gives you after a few weeks of data:
You stop guessing about team health. A quick glance at the KPI row tells you: are we keeping up (green), drowning in context-switching (scream spikes), or fielding the same question repeatedly (repeat reactions)? These aren't vanity metrics - they're leading indicators of burnout, documentation gaps, and process breakdowns.
You catch compliance and quality issues early. That octopus reaction on a thread? It's someone manually changing infrastructure instead of using Terraform. The no_entry on a code review? A recurring contributor quality problem you can address directly. Without this signal, those issues stay hidden in chat history until they become outages.
You understand your real working patterns. The after-hours panel quantifies overtime without timesheets. The longest thread view surfaces your actual bottlenecks - the design debates and stuck incidents that deserve retrospectives. The busiest day chart reveals whether your release rhythm is working or just creating spikes.
You get all of this without asking anyone to change behavior. No new forms, no status updates, no "please label your threads." People just use Slack the way they always have - the reactions and replies they'd make anyway become the data. That's the real win: observability that emerges from existing workflows, not another process to maintain.
The stack is lightweight: a Go handler (~200 lines), Redis for context, Prometheus for storage, Grafana for visualization. But the output is a window into your team's actual collaboration patterns - the stuff Slack's analytics won't show you, and the stuff you need if you want to lead a team effectively.
One final note: the working POC took me a few hours to build - without AI assistance. Not because I'm unusually fast, but because this isn't actually hard when you understand how the pieces fit together. Slack gives you webhooks. Prometheus takes counters. Grafana draws panels. The real skill isn't writing the code - it's recognizing that the data you need is already flowing through your tools, you just haven't built the bridge between them yet. Sometimes the best observability projects are the ones where you stop waiting for a vendor feature and wire up the integration yourself.
Originally published at https://bard.sh/posts/slack-metrics/



Top comments (0)