DEV Community

Cover image for Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second
TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

  • 19M MySQL QPS at BFCM 2023 peak — the scale where rare deadlock patterns become regular incidents
  • 58M requests/minute on application servers at peak — ~967,000 requests/second
  • 99.999%+ uptime maintained through BFCM
  • 3 deadlock patterns identified: transaction ordering, gap locking, and index selection
  • Pre-event deadlock audit — load test at BFCM scale, analyze logs, apply fixes, redeploy before the day
  • At 19M QPS, a one-in-a-million error happens 19 times per second

During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.


The Story

MySQL deadlocks (a situation where two or more transactions are each waiting for locks held by the other, creating a circular dependency that MySQL resolves by automatically aborting one transaction — the "deadlock victim") are a fact of life in high-concurrency relational databases. At small scale, they're rare and easily retried. At Shopify's BFCM scale — 19 million MySQL queries per second, 58 million application requests per minute — even rare deadlock patterns become frequent enough to affect real user experiences. A deadlock rate that generates one deadlock per million transactions is invisible at 1,000 QPS. At 19 million QPS, it generates 19 deadlocks per second continuously.

MySQL's deadlock handling is automatic: when it detects a circular lock dependency, it selects one transaction as the deadlock victim, aborts it, and allows the other to proceed. The deadlock victim receives an error: Deadlock found when trying to get lock; try restarting transaction. Most application frameworks, including Rails, can be configured to automatically retry deadlock victims — but retries add latency, consume connection pool slots, and under sustained deadlock conditions can create retry storms that worsen the overload. The solution at scale is not to retry more aggressively but to eliminate the deadlock patterns entirely.


Why Deadlocks Get Worse at Scale

Deadlock frequency scales super-linearly with concurrency. Doubling transactions per second more than doubles deadlock frequency because the probability of two transactions conflicting increases with every additional concurrent transaction. The same query patterns that were safe at 10× lower traffic can become pathological at BFCM scale. Shopify specifically tests for deadlock conditions before BFCM to identify and eliminate patterns that would become problems under peak load — converting reactive incident response into proactive engineering.

Problem

BFCM Scale Turns Rare Deadlocks into Incidents

At 19M MySQL QPS, deadlock patterns that are statistically invisible at normal load become consistent sources of latency spikes and error rates. MySQL's automatic deadlock resolution aborts one transaction per deadlock, requiring application-level retry logic that adds latency and consumes connection pool capacity.


Cause

Three Structural Deadlock Patterns

Transaction ordering (acquiring the same locks in different orders), gap locking (InnoDB's phantom-read prevention causing unexpected range lock conflicts), and index selection (query planner choosing different indexes under load, causing lock order variation). Each pattern has a different root cause and requires a different fix strategy.


Solution

Schema Changes + Query Pattern Fixes + Index Hints

Transaction ordering: enforce consistent lock acquisition order in application code. Gap locking: switch to READ COMMITTED isolation for high-concurrency tables where phantom reads aren't a concern, or add precise indexes to avoid gap lock ranges. Index selection: use index hints to force consistent query plans under varying load conditions.


Result

BFCM 2023 Completed Without Deadlock Incidents

BFCM 2023 ran at 19M MySQL QPS without deadlock-related incidents. The pre-BFCM deadlock analysis and mitigation work became a repeatable playbook for identifying and eliminating deadlock patterns before major traffic events.


The Fix

Fix Patterns by Deadlock Category

Each of the three deadlock categories has a distinct fix strategy. Transaction ordering conflicts are fixed at the application level by enforcing a canonical lock acquisition order. Gap lock conflicts are fixed either at the database level (changing isolation level) or at the schema level (adding precise indexes). Index selection conflicts are fixed by adding index hints to force consistent query plans.

  • 19M QPS — MySQL queries per second at BFCM peak; the environment where deadlock patterns become per-second incidents
  • 3 patterns — transaction ordering, gap locking, index selection — each requiring a different fix strategy
  • Canonical lock order — application-level enforcement that makes circular lock dependencies impossible
  • BFCM playbook — structured pre-event engineering sprint: load test at scale, analyze deadlock logs, fix, redeploy
-- Pattern 1: Transaction ordering deadlock
-- BAD: Transaction A and B acquire locks in opposite order
--   Transaction A: UPDATE orders, then UPDATE inventory
--   Transaction B: UPDATE inventory, then UPDATE orders
-- Fix: enforce canonical lock order in application code
-- (always lock the lower resource ID first — enforced in application layer)

-- Pattern 2: Gap lock deadlock
-- At REPEATABLE READ, MySQL locks gaps between index values to prevent phantoms
-- Fix A: Use READ COMMITTED for high-concurrency tables
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Disables gap locking — only row-level locks on actual rows
-- Trade-off: phantom reads possible within transaction
-- Acceptable for most CRUD operations; not acceptable for analytics snapshots

-- Fix B: Add precise index to narrow the locked range
-- Before: idx_status (status) — gaps between all status values are locked
-- After: idx_status_created_at (status, created_at) — much narrower range
ALTER TABLE orders ADD INDEX idx_status_created_at (status, created_at);

-- Pattern 3: Index selection inconsistency under load
-- MySQL's query planner may choose different indexes at different concurrency levels
-- Different index = different lock acquisition order = deadlock
-- Fix: Force consistent index selection with hint
SELECT * FROM orders FORCE INDEX (idx_user_id_status)
WHERE user_id = ? AND status = 'pending';
-- Forces the same index plan regardless of load conditions
-- Use sparingly: index hints are maintenance debt
-- Only apply for queries where inconsistent selection is causing production deadlocks
Enter fullscreen mode Exit fullscreen mode

The Gap Lock Problem: The Invisible Conflict Source

MySQL's InnoDB uses REPEATABLE READ (the default MySQL isolation level that prevents other transactions from modifying rows you've already read — implemented via gap locks on index ranges) as its default isolation level. Gap locking prevents phantom reads (where a subsequent read within the same transaction sees newly inserted rows that didn't exist at the transaction's start). The side effect: transactions that touch overlapping index ranges can deadlock on gap locks even if they're accessing completely different rows. At high concurrency, gap lock deadlocks can be more common than row-level lock deadlocks. Switching high-concurrency tables to READ COMMITTED eliminates gap locking at the cost of allowing phantom reads — an acceptable tradeoff for most commerce workloads.

The BFCM preparation playbook
Shopify treats BFCM deadlock prevention as a structured engineering programme: (1) Load test at BFCM scale in staging to generate representative deadlock logs. (2) Analyse deadlock logs to identify patterns and their categories — use SHOW ENGINE INNODB STATUS and performance_schema.data_locks. (3) Apply fix strategy per category: ordering fixes, isolation level changes, or index hints. (4) Re-test to verify the fix eliminates the pattern. (5) Deploy and monitor during BFCM itself. This loop is part of the pre-BFCM engineering sprint every year — not a one-time exercise.

Canonical lock order: application-level enforcement
The canonical lock order fix requires reviewing every code path that modifies multiple related objects within a transaction. In a Rails application, this means identifying all ActiveRecord transactions that update, say, both an Order and an Inventory record, and ensuring they always acquire the lock on the object with the lower ID first (or some other consistent ordering). This is a refactoring task, not a database config change — and it requires careful code review to ensure no code path is missed. If every transaction that needs locks A and B always acquires them in the same order (A before B), circular dependency is impossible by construction.

Index hints: the nuclear option
Index hints override MySQL's query planner for a specific query. They are maintenance debt: if the data distribution changes and the forced index becomes suboptimal, the hint will cause performance regressions that are difficult to diagnose (the hint is in application code, not visible in explain plans without reading the source). Use index hints only for queries where inconsistent index selection is causing production deadlocks, document them extensively, and review them regularly as the database evolves.


Architecture

Shopify's MySQL architecture for BFCM reflects years of scaling decisions: MySQL instances federated across over 100 independent shards for the Shopify Core application, with Vitess managing the most rapidly growing workloads. The deadlock patterns addressed here affect all high-concurrency MySQL deployments — the scale just makes them visible faster and more painfully.

Transaction Ordering Deadlock: The Classic Pattern

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Gap Lock Deadlock: The Invisible Pattern

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).


InnoDB Lock Monitor: Your Diagnostic Tool

MySQL's InnoDB provides several diagnostic commands for deadlock investigation: SHOW ENGINE INNODB STATUS shows the latest deadlock with full transaction and lock details. performance_schema.data_locks shows currently held and waiting locks in real time. performance_schema.events_statements_history shows recent SQL statements per thread. Together, these tools let engineers reconstruct the exact sequence of events that led to a deadlock — which is the prerequisite for applying the correct fix strategy. Forensic analysis of the deadlock log reveals the exact lock ordering conflict, which determines which of the three fix strategies to apply.


Lessons

  1. Deadlock rates scale super-linearly with concurrency. Patterns that are statistically invisible at normal traffic become consistent incidents at peak scale. Test your database workloads at maximum expected concurrency before every major traffic event — not just for performance, but explicitly for deadlock patterns.

  2. Gap locks (InnoDB locks that cover the space between indexed values to prevent phantom reads in REPEATABLE READ isolation) are a frequent and underappreciated source of MySQL deadlocks at high concurrency. Consider using READ COMMITTED isolation for high-throughput tables where phantom read protection is not required — it eliminates an entire class of deadlock patterns at the cost of weaker isolation semantics.

  3. Enforce canonical lock acquisition order in your application layer. If every code path that acquires multiple locks always acquires them in the same order, circular dependencies become impossible. This is the fundamental fix for transaction ordering deadlocks, and it requires reviewing all code paths that modify the same set of resources in a single transaction.

  4. MySQL's query planner can select different indexes under varying concurrency conditions, leading to lock order variability that produces deadlocks at high load but not low load. Use EXPLAIN ANALYZE at peak load (not just nominal load) to understand actual query execution plans, and use index hints selectively for known-problematic queries.

  5. Monitor deadlock rates as a first-class operational metric with real-time visibility. Deadlocks discovered in postmortem analysis are incidents. Deadlocks discovered in real-time monitoring are operational data. Alerting on deadlock rate spikes during high-traffic events gives engineers the ability to act before they escalate to user-visible incidents.


Engineering Glossary

Canonical lock order — an application-level convention where all code paths that acquire multiple locks always do so in the same predetermined order. Prevents circular lock dependencies (transaction A locks X then Y; transaction B locks Y then X) by making it structurally impossible for two transactions to acquire the same locks in opposite sequences.

Deadlock victim — the transaction MySQL automatically selects to abort when it detects a circular lock dependency. Receives the error Deadlock found when trying to get lock; try restarting transaction. Application frameworks can be configured to retry deadlock victims, but retries are a palliative, not a fix.

Gap lock — an InnoDB lock that covers the gap between indexed values, used in REPEATABLE READ isolation to prevent phantom reads. Does not lock specific rows — locks the empty space before them. A common source of unexpected deadlocks at high concurrency, since transactions accessing different rows can deadlock on overlapping gap locks.

Index hint — a SQL clause (FORCE INDEX, USE INDEX) that overrides MySQL's query planner and forces use of a specific index. Used to prevent index selection variability under high load from causing inconsistent lock acquisition orders. Maintenance debt — document all hints and review them as data distributions change.

Phantom read — a scenario where a transaction reads a set of rows, another transaction inserts a new row matching the query, and the first transaction reads again and sees the new "phantom" row. Prevented by gap locks in REPEATABLE READ isolation. Allowed in READ COMMITTED isolation.

READ COMMITTED — a MySQL transaction isolation level weaker than REPEATABLE READ. Does not use gap locks — only row-level locks on actual rows. Eliminates an entire class of gap lock deadlocks at the cost of allowing phantom reads within a transaction. Appropriate for most CRUD operations; not appropriate for analytics or reporting requiring consistent snapshots.

REPEATABLE READ — MySQL's default transaction isolation level. Uses gap locks to prevent phantom reads within a transaction. The isolation level responsible for gap lock deadlocks at high concurrency.


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)


TechLogStack — built at scale, broken in public, rebuilt by engineers.

Top comments (0)