Bartłomiej Danek

Posted on Apr 28 • Originally published at bard.sh

Beyond Slack Analytics: Building Custom Engagement Metrics with Webhooks, Prometheus, and Grafana

#slack #devops #prometheus #observability

TL;DR: Slack's native analytics won't tell you which threads die, which requests get ignored, or how yyour team's collaboration patterns evolve. I built a Go webhook handler that captures every reaction, reply, and thread via Slack's event API, exports it to Prometheus, and visualizes it in Grafana - giving me metrics Slack never will.

The Wake-Up Call

It started with a post-incident review. We'd just recovered from a 3-hour outage, and I was asked the usual questions: How fast did we respond? Which channels were involved? Were there warning signs in our internal Slack discussions before things broke?

I opened Slack's analytics dashboard. Member counts. Message volume. Channel growth. Vanity metrics that told me nothing about the incident or our response patterns.

I couldn't answer basic questions: How many infrastructure requests were sitting unresolved in our #platform-requests channel? Who was drowning in context-switching across too many active threads? Had we missed early warning signs buried in reaction patterns on code review requests?

Worse, I realized I was guessing about team health. I'd see a busy Slack day and assume we were productive. I'd see green checkmark reactions and assume requests were getting fulfilled. But I had no data - just gut feel and fragmented chat history.

Then came the compliance gap. A terraform apply failed in staging, and someone replied with "skipping, will fix later" - then applied the change manually to unblock themselves. It showed up as a casual thread reply with a 😱 reaction from a teammate, buried under 30 messages. No alert, no metric, no visibility - until I went looking for it during the post-mortem and found the octopus reaction someone had wisely added to flag the IaC violation.

That's when I realized: Slack is where my team actually works, but Slack's analytics treat it like a broadcast channel instead of a workflow engine. Every reaction, every thread, every reply is a data point. I just wasn't collecting them.

So I built something to fix that.

Why Slack Won't Tell You What You Need to Know

Slack gives you member counts, message volume, and channel growth. That's it. If you're running a platform engineering team, a DevOps org, or an internal support channel, those numbers are vanity metrics.

What you actually need to know:

Which threads get resolved vs. abandoned?
How fast does the team respond to infrastructure requests?
Which code review requests get ignored (and why)?
Are people actually engaging with your announcements?

Slack doesn't expose this. But Slack's event subscriptions do - if you know how to catch them.

Architecture: How I Capture What Slack Hides

The system has three layers: Slack events → Go webhook handler → Prometheus + Grafana.

The Webhook Handler

Built in Go 1.26 with Gin, the handler listens at POST /api/slack/events. Every event goes through this middleware stack:

Middleware	Purpose	Notes
Recovery	Panic recovery
RequestID	Unique ID per request
GinLogger	Structured JSON logging
RequestSizeLimit	Reject >1MB
RateLimiter	10 req/s per IP, burst 20	Handles Slack's event batching
SlackVerification	HMAC-SHA256 signature check

The burst 20 setting is critical - Slack's Event API sometimes batches multiple events into a short window, so a single IP can legitimately spike above 10 req/s. The burst allows those spikes without dropping legitimate events.

The three event types we care about:

1. message events - catch new threads and replies:

// internal/handler/slack.go (simplified)
if event.ThreadTs == "" || event.ThreadTs == event.Ts {
    // Top-level message → new thread
    metrics.RecordNewThread(ctx, channel, event.Ts)
} else {
    // Reply in thread
    metrics.RecordThreadReply(ctx, channel, event.ThreadTs, event.User)
}

2. reaction_added / reaction_removed - the signal layer:

// Reactions are stored in Redis + recorded as metrics
metrics.RecordThreadReaction(ctx, channel, event.Item.Ts, event.User, event.Reaction)
metrics.RecordUserEngagement(ctx, channel, event.Item.Ts, event.User)

3. url_verification - Slack's challenge-response during app setup.

The Slack Retry Problem

Slack expects a 200 OK within 3 seconds. If your handler is busy writing to Redis or the Prometheus exporter is slow, Slack will retry the event - and you'll get duplicate counts in your metrics.

The fix: acknowledge first, process asynchronously. Return 200 OK immediately, then handle the event in a goroutine:

// internal/handler/slack.go (simplified)
func (h *Handler) HandleEvent(c *gin.Context) {
    var event SlackEvent
    if err := c.ShouldBindJSON(&event); err != nil {
        c.JSON(400, gin.H{"error": err.Error()})
        return
    }

    // Acknowledge immediately - don't block on Redis/Prometheus
    c.JSON(200, gin.H{"ok": true})

    // Process in background
    go func() {
        ctx := context.Background()
        if event.ThreadTs == "" || event.ThreadTs == event.Ts {
            h.metrics.RecordNewThread(ctx, event.Channel, event.Ts)
        } else {
            h.metrics.RecordThreadReply(ctx, event.Channel, event.ThreadTs, event.User)
        }
    }()
}

This prevents Slack's retry mechanism from creating duplicate metric increments.

Why Reactions Are the Secret Signal

I don't just count messages. I track reactions as semantic signals:

Emoji	Reaction Name	Meaning	Metric Label
✅	white_check_mark	Request completed	`reaction="white_check_mark"`
😱	scream	Multiple active threads per person (overload)	`reaction="scream"`
🚫	no_entry	Irrelevant code review request	`reaction="no_entry"`
🔁	repeat	Repeating / duplicate request	`reaction="repeat"`
❌	x	Irrelevant request	`reaction="x"`
✔️	done	Completed (no PR review needed)	`reaction="done"`
🐙	octopus	Infrastructure-as-Code violation	`reaction="octopus"`

This turns Slack into a structured workflow tool - reactions become queryable metrics.

Getting the Team on Board (Social Engineering)

This only works if people actually use the reactions. I didn't mandate it - instead, I led by example: I started reacting to threads with octopus when I saw IaC violations, and with scream when I was drowning in threads. Within a week, the team adopted it naturally. The key is making it feel like a helpful shorthand, not a tracking mechanism. If the team doesn't use the emoji, your metrics are useless - so make it useful for them first.

Metric Definitions and Component Mapping

I define 5 OpenTelemetry counter metrics in internal/metrics/slack.go. The Prometheus exporter appends _total, giving the double suffix you see in PromQL queries - you might want to fix it in your OTEL Agent.

Metric-to-Component Mapping

Code Name	Labels	Triggered By
`webhook_requests_total`	`channel`, `status`	Every incoming webhook
`threads_total`	`channel`	New top-level message
`thread_replies_total`	`channel`, `user_id`	Reply in thread
`thread_reactions_total`	`channel`, `user_id`, `reaction`	Reaction added
`user_engagement_total`	`channel`, `user_id`	Reply OR reaction

The Prometheus Cardinality Bomb

If you're tempted to add thread_id as a label - don't. In my original POC, I made this mistake. In Prometheus, every unique combination of label values creates a new time series. With thousands of Slack threads (each with a unique thread_id), you're creating thousands of time series. This is a cardinality bomb that will explode your Prometheus memory usage and can crash the instance.

The fix: Keep Prometheus for aggregate channel-level counters only. Thread-level detail belongs in Redis (which I'm already using) or a SQL database, not in Prometheus labels. If you want the "Longest Thread" feature, query Redis directly - don't pollute your metrics with high-cardinality labels. Prometheus is for metrics, not event logging.

Key Performance Indicators

The dashboard is organized into four insight layers, each answering a different question about how your Slack channels actually operate:

Volume & Throughput - Are we drowning or cruising?
Total Threads shows raw request load per day, making it obvious which days are release days, incident spikes, or quiet periods. Active Threads (threads with at least one reply) reveals engagement depth - a high thread count with low reply rate means people are posting but not discussing, a signal of announcements drowning out conversation.

Resolution Signals - Are requests actually getting done?
The green panels tell the completion story. white_check_mark reactions track formal completions - infrastructure requests fulfilled, reviews done. done reactions capture quick wins that never needed a PR. Meanwhile, x reactions surface canceled or irrelevant requests, helping you spot scope creep or misrouted asks. Together, they give you a resolution rate without ever asking anyone to fill out a form.

Quality & Friction - Where is the process breaking?
The red and yellow panels are your early warning system. A spike in scream reactions means people are drowning in threads - too many active requests per person. no_entry on code reviews exposes recurring low-quality submissions from specific contributors. repeat reactions highlight documentation gaps - the same question asked three times means the answer isn't findable. octopus flags manual infrastructure changes that bypass your IaC pipeline, a compliance risk you'd otherwise miss.

Time-Based Insights - When is the team really working?
The blue panels surface patterns hidden by aggregate daily stats. After-Hours Work quantifies overtime - if 30% of threads start after 17:00, you have an on-call problem, not a "dedicated team" problem. Longest Thread with a clickable Slack link lets you jump directly to the most contentious discussion - usually a design debate or a stuck incident. Busiest Day aggregates threads by date to reveal your actual rhythm: Tuesday releases? Friday incident spikes? The data tells the story.

Business-Hour Awareness

One panel tracks requests outside 10:00–17:00 (browser time):

sum_over_time(
  (
    sum(
      increase(threads_total{channel=~"$channel_filter"}[5m])
      and on()
        ((hour() < 10) or (hour() >= 17))
    )
  )[$__range:5m]
)

This tells us: is the team working after hours? If so, why?

Note: The hour() function in Prometheus uses UTC. If your team is in a different timezone (e.g., CET/Poland), adjust the offsets: 10:00 CET = 08:00 UTC, 17:00 CET = 15:00 UTC. Adjust the < 10 / >= 17 values based on where your Prometheus server is running.

Longest Thread Detection

Since I removed thread_id from Prometheus labels (see The Prometheus Cardinality Bomb), the "Longest Thread" feature works differently: I pull thread IDs directly from Redis, where high-cardinality data lives safely. The Grafana dashboard queries Redis for the top threads by reply count, then overlays Prometheus aggregate trends for context.

To implement this, use a Grafana Infinity datasource or Redis datasource plugin to query:

LRANGE replies:{channel}:* 0 -1

Then match the longest list to its channel and thread_ts. The clickable Slack link is built from these Redis keys:

https://yourworkspace.slack.com/archives/{channel}/p{thread_ts_without_dot}

Alternatively, if you keep thread_id in Prometheus for a small team (few threads), the PromQL would be:

topk(
  1,
  sum by (thread_id, channel) (
    max_over_time(
      thread_replies_total_total{
        channel=~"$channel_filter"
      }[$__range]
    )
  )
)

With a Data link override pointing to https://yourworkspace.slack.com/archives/${channel}/p${__cell}.

Busiest Day Detection

sum(
  increase(threads_total{channel=~"$channel_filter"}[1d])
)

Grafana transformations convert the timestamp to YYYY-MM-DD format and sort by value.

Dependencies: What We Chose and Why

go.mod (direct dependencies)
├── github.com/gin-gonic/gin v1.11.0         # HTTP routing
├── go.opentelemetry.io/otel v1.40.0         # Metrics SDK
├── go.opentelemetry.io/otel/exporters/prometheus v0.62.0  # Prometheus export
├── go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.40.0
├── github.com/prometheus/client_golang v1.23.2  # Prometheus client
├── github.com/redis/go-redis/v9 v9.17.3     # Redis client (cluster support)
├── golang.org/x/time v0.14.0                # Rate limiting
└── github.com/alicebob/miniredis/v2 v2.36.1 # In-memory Redis for tests

Key Trade-offs

Choice	Why	Trade-off
OpenTelemetry over raw Prometheus client	Future-proof, OTLP export option	More boilerplate setup
Redis for thread storage	Fast lookups, TTL support	Extra dependency, network hop
Gin over stdlib	Fast dev, built-in middleware	Framework lock-in
Counter metrics only	Simple, append-only	No histograms for latency (yet)
Fly.io deployment	Simple, cheap, EU region (waw)	Less control than K8s

What I Skipped

Histograms: I don't track webhook processing latency. The handler is fast enough (<10ms) that it hasn't mattered yet.
Gauge metrics: No "current active threads" gauge. I rely on increase() queries instead.
Thread cancellation: No separate metric needed - canceled threads are tracked via thread_reactions_total{reaction="x"} (the ❌ emoji). The Grafana dashboard queries reactions with reaction="x" to surface irrelevant/canceled requests.

Redis Data Structures

The handler stores thread context in Redis for enrichment (not just metrics):

Key Pattern	Type	Content
`reactions:{channel}:{ts}`	Hash	Field: `{reaction}:{user}`, Value: JSON with timestamps
`messages:{channel}:{ts}`	String	JSON: user, text, ts, channel
`replies:{channel}:{thread_ts}`	List	JSON entries for each reply
`processed_events:{event_id}`	String	TTL 5min - idempotency key for Slack retries

Event Deduplication

Slack retries events if it doesn't get a 200 OK fast enough. To prevent double-counting metrics, the handler uses the event_id as an idempotency key with SETNX:

// Before processing, check if we already handled this event
key := fmt.Sprintf("processed_events:%s", slackEvent.EventID)
// SetNX returns true if key was SET (new event), false if key already existed
processed, _ := rdb.SetNX(ctx, key, "1", 5*time.Minute).Result()
if !processed {
    // Event already processed - skip to prevent double-counting
    return
}

This ensures that even if Slack sends the same event twice during a network blip, your metrics are only incremented once.

Lessons Learned

1. Reactions Are Better Than Thread Replies for Signals

A reply requires typing. A reaction is one click. I get 3× more signal from reactions than replies because the friction is lower.

2. Grafana Transformations Are Powerful But Fragile

The "Longest Thread" panel uses 4 transformations: sortBy, limit, convertFieldType, formatTime, and renameByRegex. One breaks, the panel dies. Keep transformations minimal where possible.

3. Rate Limiting Is Essential

Slack's event replay will flood you if your handler was down. Our rate limiter (10 req/s per IP) prevents thundering herd on restart.

Wrapping Up: What You Get Out of This

Building custom Slack metrics isn't about dashboards for the sake of dashboards. It's about making the invisible visible. Here's what this setup gives you after a few weeks of data:

You stop guessing about team health. A quick glance at the KPI row tells you: are we keeping up (green), drowning in context-switching (scream spikes), or fielding the same question repeatedly (repeat reactions)? These aren't vanity metrics - they're leading indicators of burnout, documentation gaps, and process breakdowns.

You catch compliance and quality issues early. That octopus reaction on a thread? It's someone manually changing infrastructure instead of using Terraform. The no_entry on a code review? A recurring contributor quality problem you can address directly. Without this signal, those issues stay hidden in chat history until they become outages.

You understand your real working patterns. The after-hours panel quantifies overtime without timesheets. The longest thread view surfaces your actual bottlenecks - the design debates and stuck incidents that deserve retrospectives. The busiest day chart reveals whether your release rhythm is working or just creating spikes.

You get all of this without asking anyone to change behavior. No new forms, no status updates, no "please label your threads." People just use Slack the way they always have - the reactions and replies they'd make anyway become the data. That's the real win: observability that emerges from existing workflows, not another process to maintain.

The stack is lightweight: a Go handler (~200 lines), Redis for context, Prometheus for storage, Grafana for visualization. But the output is a window into your team's actual collaboration patterns - the stuff Slack's analytics won't show you, and the stuff you need if you want to lead a team effectively.

One final note: the working POC took me a few hours to build - without AI assistance. Not because I'm unusually fast, but because this isn't actually hard when you understand how the pieces fit together. Slack gives you webhooks. Prometheus takes counters. Grafana draws panels. The real skill isn't writing the code - it's recognizing that the data you need is already flowing through your tools, you just haven't built the bridge between them yet. Sometimes the best observability projects are the ones where you stop waiting for a vendor feature and wire up the integration yourself.

Originally published at https://bard.sh/posts/slack-metrics/

DEV Community