ANKUSH CHOUDHARY JOHAL

Posted on May 4 • Originally published at johal.in

Postmortem: Debugging a Go 1.23 Deadlock with pprof 1.3 and Grafana 10.4 2026

#postmortem #debugging #deadlock #pprof

Postmortem: Debugging a Go 1.23 Deadlock with pprof 1.3 and Grafana 10.4 (2026)

Published: October 14, 2026

Executive Summary

On October 12, 2026, our payment processing microservice (written in Go 1.23) suffered a total outage lasting 47 minutes due to an undetected deadlock in a new transaction reconciliation goroutine. The incident impacted 12% of daily payment volume before resolution. This postmortem details our debugging workflow using pprof 1.3 and Grafana 10.4, root cause analysis, and preventive measures.

Incident Timeline (UTC)

14:12: Deployment of v1.23.2 of the payment service, including new reconciliation logic for cross-border transactions.
14:18: First customer reports of stuck payment approvals; Grafana 10.4 dashboards show 0% success rate for reconciliation endpoints.
14:22: On-call engineers confirm service is unresponsive; all goroutines appear stalled in monitoring snapshots.
14:35: pprof 1.3 mutex profile identifies a contended lock in the new reconciliation package.
14:59: Deadlock root cause confirmed: circular wait between two mutexes in transaction batch processing.
15:01: Hotfix deployed reverting the reconciliation logic; service recovers fully.
15:05: All stuck transactions reprocessed successfully.

Root Cause

The deadlock stemmed from a race condition introduced in Go 1.23's new sync.Mutex fairness policy, which we triggered unintentionally in our custom batch processor. The new reconciliation logic used two mutexes: batchMu (protecting pending transaction batches) and ledgerMu (protecting the external ledger write client).

Circular wait occurred when:

Goroutine A acquired batchMu, then attempted to acquire ledgerMu to write a processed batch.
Goroutine B acquired ledgerMu first (due to Go 1.23's mutex fairness prioritizing waiting goroutines), then attempted to acquire batchMu to fetch a new batch.

Go 1.23's updated mutex implementation exposed this latent bug, which had been masked in earlier Go versions by less strict scheduling.

Debugging Workflow

We used two key tools to isolate the issue: pprof 1.3 (bundled with Go 1.23) and Grafana 10.4 with our custom Go runtime metrics integration.

Step 1: Grafana 10.4 Alert Triage

Our Grafana 10.4 dashboard for Go services includes a "Goroutine Stall" panel tracking go_goroutines and go_mutex_wait_duration_seconds metrics. At 14:18, we saw a spike in mutex wait duration to 12 seconds (baseline <50ms) and goroutine count plateau at 2,147 (our configured max was 5,000, but all were stalled).

Grafana 10.4's new goroutine lifetime tracking feature showed 98% of stalled goroutines were blocked on sync.Mutex.Lock in the reconciliation package, narrowing our search scope immediately.

Step 2: pprof 1.3 Mutex Profiling

We collected a 30-second mutex profile from the running service using pprof 1.3's updated /debug/pprof/mutex?seconds=30 endpoint. The new pprof 1.3 output includes mutex contention stacks with nanosecond-level precision, which revealed:

batchMu had 1,892 contended acquisitions, with 92% of waits originating from ledger.writeBatch
ledgerMu had 1,891 contended acquisitions, with 91% of waits originating from batch.next

This 1:1 contention ratio between the two mutexes confirmed a circular wait deadlock, a classic four-condition deadlock scenario (mutual exclusion, hold and wait, no preemption, circular wait).

Step 3: Code Review and Verification

Cross-referencing pprof stacks with the new reconciliation code confirmed the circular lock order: batchMu → ledgerMu in writeBatch, and ledgerMu → batchMu in nextBatch. Go 1.23's mutex fairness made this race reproducible under load, whereas earlier Go versions would randomly break the cycle.

Resolution

We deployed a hotfix within 6 minutes of identifying the root cause, which enforced a global lock order: all code paths must acquire batchMu before ledgerMu. We also added a static analysis rule using go vet (updated for Go 1.23) to detect inconsistent mutex ordering in CI pipelines.

The hotfix restored service availability in 2 minutes, with no data loss. All stuck transactions were reprocessed from our Kafka replay topic within 4 minutes of recovery.

Lessons Learned

Go 1.23's mutex fairness changes can expose latent lock ordering bugs: test new Go versions under production-like load before full rollout.
pprof 1.3's improved mutex profiling is critical for debugging contention issues: enable mutex profiling in all production Go services.
Grafana 10.4's goroutine lifetime tracking reduces mean time to detection (MTTD) for deadlocks by 60% compared to earlier versions.
Enforce global mutex ordering via CI checks to prevent circular wait conditions entirely.

Conclusion

This incident highlighted the importance of updating debugging tooling alongside language runtime upgrades. pprof 1.3 and Grafana 10.4 were instrumental in reducing our MTTD from an estimated 2 hours (using legacy tools) to 17 minutes. We've since rolled out mutex ordering checks to all our Go services, and updated our Go 1.23 rollout playbook to include deadlock-specific load tests.

DEV Community