Shopify · Databases · 17 May 2026
During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.
- 19M MySQL QPS at BFCM peak
- 58M requests/min app servers
- 99.999%+ uptime maintained
- Deadlock pattern taxonomy built
- Schema + query-level fixes
- Aug 2024 blog published
The Story
- 19M QPS — MySQL queries per second during BFCM 2023 peak — the environment where deadlock patterns that are rare at normal traffic become regular production incidents
- 58M req/min — Application server request rate at peak BFCM — equivalent to roughly 967,000 requests per second across Shopify's core application servers
- 99.999%+ — Shopify's uptime target maintained through BFCM — the reliability bar that makes every database performance issue a potential SLA risk
- 3 patterns — Main deadlock pattern categories identified and addressed: transaction ordering, gap locking, and index selection — each requiring different fix strategies
MySQL deadlocks (a situation where two or more transactions are each waiting for locks held by the other, creating a circular dependency that MySQL resolves by automatically aborting one transaction (the 'deadlock victim')) are a fact of life in high-concurrency relational databases. At small scale, they're rare and easily retried. At Shopify's Black Friday and Cyber Monday scale — 19 million MySQL queries per second , 58 million application requests per minute — even rare deadlock patterns become frequent enough to affect real user experiences. A deadlock rate that generates one deadlock per million transactions is invisible at 1,000 QPS. At 19 million QPS, it generates 19 deadlocks per second continuously.
MySQL's deadlock handling is automatic: when it detects a circular lock dependency, it selects one transaction as the deadlock victim , aborts it, and allows the other transaction to proceed. The deadlock victim receives an error: Deadlock found when trying to get lock; try restarting transaction. Most application frameworks, including Rails, can be configured to automatically retry deadlock victims — but retries add latency, consume connection pool slots, and under sustained deadlock conditions can create retry storms that worsen the overload. The solution at scale is not to retry more aggressively but to eliminate the deadlock patterns entirely.
WHY DEADLOCKS GET WORSE AT SCALE
Deadlock frequency scales super-linearly with concurrency. Doubling transactions per second more than doubles deadlock frequency because the probability of two transactions conflicting increases with every additional concurrent transaction. The same query patterns that were safe at 10x lower traffic can become pathological at BFCM scale. Shopify specifically tests for deadlock conditions before BFCM to identify and eliminate patterns that would become problems under peak load.
The Three Deadlock Patterns
Shopify's engineering team identified three primary deadlock patterns in their high-concurrency MySQL environment. Pattern 1: Transaction ordering conflicts — two transactions each acquiring locks in opposite order (Transaction A locks row X then row Y; Transaction B locks row Y then row X). Pattern 2: Gap lock conflicts — MySQL's gap locking (a mechanism in InnoDB's REPEATABLE READ isolation level that locks the gap before a row to prevent phantom reads — this can cause unexpected conflicts between transactions that aren't modifying the same rows) mechanism, which prevents phantom reads by locking ranges rather than individual rows, creating unexpected conflicts between transactions accessing the same range. Pattern 3: Index selection conflicts — MySQL choosing suboptimal index plans under high concurrency, causing different transactions to acquire locks in different orders depending on the query execution plan.
Problem
BFCM Scale Turns Rare Deadlocks into Incidents
At 19M MySQL QPS, deadlock patterns that are statistically invisible at normal load become consistent sources of latency spikes and error rates. MySQL's automatic deadlock resolution aborts one transaction per deadlock, requiring application-level retry logic that adds latency and consumes connection pool capacity.
Cause
Three Structural Deadlock Patterns
Transaction ordering (acquiring the same locks in different orders), gap locking (InnoDB's phantom-read prevention causing unexpected range lock conflicts), and index selection (query planner choosing different indexes under load, causing lock order variation). Each pattern has a different root cause and requires a different fix strategy.
Solution
Schema Changes + Query Pattern Fixes + Index Hints
Transaction ordering: enforce consistent lock acquisition order in application code. Gap locking: switch to READ COMMITTED isolation for high-concurrency tables where phantom reads aren't a concern, or add precise indexes to avoid gap lock ranges. Index selection: use index hints to force consistent query plans under varying load conditions.
Result
BFCM 2023 Completed Without Deadlock Incidents
BFCM 2023 ran at 19M MySQL QPS without deadlock-related incidents. The pre-BFCM deadlock analysis and mitigation work became a repeatable playbook for identifying and eliminating deadlock patterns before major traffic events.
⚠️
Gap Locking: The Invisible Conflict Source
MySQL's InnoDB uses REPEATABLE READ (the default MySQL transaction isolation level that prevents other transactions from modifying rows you've already read within a transaction — implemented in part through gap locks on ranges of the index) as its default isolation level, which uses gap locking to prevent phantom reads (a scenario where a transaction reads a set of rows, another transaction inserts a new row in that set, and the first transaction reads again and sees the new 'phantom' row). The side effect: transactions that touch overlapping index ranges can deadlock on gap locks even if they're accessing completely different rows. At high concurrency, gap lock deadlocks can be more common than row-level lock deadlocks. Switching high-concurrency tables to READ COMMITTED isolation eliminates gap locking at the cost of allowing phantom reads — an acceptable tradeoff for many commerce workloads.
🔒
MySQL's deadlock detector runs after every failed lock acquisition attempt , searching for circular lock dependencies. At 19M QPS this detection runs millions of times per second — a reminder that MySQL's deadlock resolution machinery is doing meaningful work constantly, not just during obvious deadlock events.
🗂️
Reading the Deadlock Log: Forensic Analysis
MySQL records the most recent deadlock in
SHOW ENGINE INNODB STATUS. This output includes the transaction SQL, the locks held, the locks waited for, and which transaction was chosen as the victim. Forensic analysis of the deadlock log is the primary diagnosis tool — it reveals the exact lock ordering conflict that caused each deadlock pattern, which informs which fix strategy to apply.
Shopify's approach to BFCM preparation includes explicit deadlock testing at high concurrency levels in staging. Load tests simulate BFCM-scale traffic against representative database states, and the resulting deadlock logs are analyzed to identify patterns that would be problematic at production scale. This pre-event analysis allows the team to eliminate deadlock patterns before they become BFCM incidents — converting reactive incident response into proactive engineering.
THE RETRY LOOP DANGER
MySQL automatically selects a deadlock victim and aborts one transaction. Most frameworks, including Rails with ActiveRecord, can be configured to automatically retry deadlocked transactions. But retries are dangerous under sustained deadlock conditions: each retry re-acquires locks and adds to the contention, potentially creating a retry storm that makes the deadlock situation worse. The correct strategy is to eliminate deadlock patterns, not to retry aggressively — retries are acceptable for occasional, transient deadlocks but not for structural patterns.
The Fix
Fix Patterns by Deadlock Category
Each of the three deadlock categories has a distinct fix strategy. Transaction ordering conflicts are fixed at the application level by enforcing a canonical lock acquisition order — if every transaction that needs to lock resources A and B always acquires them in the same order (A before B), circular dependency is impossible. Gap lock conflicts are fixed either at the database level (changing isolation level for specific tables) or at the schema level (adding precise indexes that reduce the range that InnoDB locks). Index selection conflicts are fixed by adding index hints to force MySQL to use the same index plan consistently across varying load conditions.
-- Pattern 1: Transaction ordering deadlock
-- BAD: Transaction A and B acquire locks in opposite order
-- Transaction A: UPDATE orders, then UPDATE inventory
-- Transaction B: UPDATE inventory, then UPDATE orders
-- Fix: enforce canonical lock order in application code
-- Pattern 2: Gap lock deadlock
-- At REPEATABLE READ isolation, MySQL locks gaps to prevent phantoms
-- Fix option A: Use READ COMMITTED for high-concurrency tables
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- This disables gap locking — only row-level locks on actual rows
-- Trade-off: phantom reads possible within transaction
-- Fix option B: Add precise index to narrow the locked range
-- Before: idx_status (status) -- gaps between statuses are locked
-- After: idx_status_created_at (status, created_at) -- narrower range
ALTER TABLE orders ADD INDEX idx_status_created_at (status, created_at);
-- Pattern 3: Index selection inconsistency under load
-- Under high concurrency, MySQL's query planner may choose different indexes
-- Fix: Force consistent index selection with index hint
SELECT * FROM orders FORCE INDEX (idx_user_id_status)
WHERE user_id = ? AND status = 'pending';
-- Forces the same index plan regardless of load conditions
-- Prevents lock order variation from causing deadlocks
THE BFCM PREPARATION PLAYBOOK
Shopify treats BFCM deadlock prevention as a structured engineering program: (1) Load test at BFCM scale in staging to generate representative deadlock logs. (2) Analyze deadlock logs to identify patterns and their categories. (3) Apply fix strategy per category: ordering fixes, isolation level changes, or index hints. (4) Re-test to verify the fix eliminates the pattern. (5) Deploy and monitor during BFCM itself. This loop is not run once — it's part of the pre-BFCM engineering sprint every year.
ℹ️
READ COMMITTED vs REPEATABLE READ: The Tradeoff
Switching high-concurrency tables from REPEATABLE READ to READ COMMITTED eliminates gap locking and dramatically reduces deadlock frequency — but at the cost of allowing phantom reads within a transaction (a subsequent read in the same transaction might see rows that didn't exist at the transaction's start). For most commerce operations — inserting an order, updating inventory, processing a payment — phantom read isolation isn't required. For reporting or analytics queries that require consistent snapshots, REPEATABLE READ is still appropriate. Shopify evaluates each table's access patterns to determine the right isolation level.
✅
The Performance Monitoring Integration
Shopify monitors deadlock rates as a first-class operational metric. The deadlock rate graph is visible on engineering dashboards alongside query latency and error rates. When deadlock rates spike during BFCM, they're caught within seconds — not minutes — and the specific transaction patterns can be identified from the deadlock log immediately. Real-time deadlock visibility is as important as the fix strategies themselves.
Index hints are a last resort — they trade flexibility for predictability. MySQL's query planner is usually good at selecting optimal indexes, and overriding it can cause performance regressions when data distributions change. But at BFCM scale, unpredictable index selection under high concurrency is more dangerous than a slightly suboptimal but consistent plan. The engineers who added index hints to specific high-volume queries did so with full awareness of the tradeoff, and with monitoring in place to detect if the forced plan becomes problematic over time.
ℹ️
Canonical Lock Order: Application-Level Enforcement
The canonical lock order fix requires reviewing every code path that modifies multiple related objects within a transaction. In a Rails application, this means identifying all ActiveRecord transactions that update, say, both an Order and an Inventory record, and ensuring they always acquire the lock on the object with the lower ID first (or some other consistent ordering). This is a refactoring task, not a database config change — and it requires careful code review to ensure no path is missed.
BEFORE BFCM: THE DEADLOCK AUDIT
Shopify's engineering team conducts a pre-BFCM deadlock audit: run load tests at maximum expected concurrency, collect InnoDB status outputs repeatedly during the load test, parse the deadlock logs, categorize patterns, apply fixes, re-test. The audit is a structured engineering sprint, not an ad-hoc investigation. The earlier the deadlock patterns are found, the more time there is to implement and validate fixes before the day that those patterns become 19-per-second incidents.
Architecture
Shopify's MySQL architecture for the BFCM period reflects years of scaling decisions: a mix of MySQL 5.7 and MySQL 8 instances, federated across over 100 independent shards for the Shopify Core application, with Vitess managing the most rapidly growing workloads. The deadlock patterns that Shopify addressed affect all high-concurrency MySQL deployments — the scale just makes them visible faster and more painfully.
Transaction Ordering Deadlock: The Classic Pattern
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Gap Lock Deadlock: The Invisible Pattern
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
INNODB LOCK MONITOR: YOUR DIAGNOSTIC TOOL
MySQL's InnoDB storage engine provides several diagnostic commands for deadlock investigation: SHOW ENGINE INNODB STATUS shows the latest deadlock with full transaction and lock details. performance_schema.data_locks shows currently held and waiting locks in real time. performance_schema.events_statements_history shows recent SQL statements per thread. Together, these tools let engineers reconstruct the exact sequence of events that led to a deadlock — which is the prerequisite for applying the correct fix strategy.
⚠️
Index Hints: The Nuclear Option
Index hints override MySQL's query planner for a specific query. They are maintenance debt : if the data distribution changes and the forced index becomes suboptimal, the hint will cause performance regressions that are difficult to diagnose (the hint is in the application code, not visible in query explain plans without reading the source). Use index hints only for queries where inconsistent index selection is causing production deadlocks, document them extensively, and review them regularly as the database evolves.
✅
MySQL 8.0 Deadlock Improvements
MySQL 8.0 introduced several improvements relevant to deadlock diagnosis and prevention. NOWAIT and SKIP LOCKED allow queries to immediately return an error or skip locked rows rather than waiting — useful for queue-like patterns. Invisible indexes allow testing new indexes without them being used by the query planner. Improved performance_schema provides better lock visibility. GitHub's MySQL 8.0 upgrade (covered elsewhere in TechLogStack) and Shopify's own MySQL fleet management both reflect the industry's move to take advantage of these improvements at scale.
Lessons
Shopify's deadlock playbook is one of the most practical database engineering documents published by a major engineering blog. It translates academic database theory into actionable production fixes for the three patterns that account for the vast majority of MySQL deadlocks at high concurrency.
- 01. Deadlock rates scale super-linearly with concurrency. Patterns that are statistically invisible at normal traffic become consistent incidents at peak scale. Test your database workloads at maximum expected concurrency before every major traffic event — not just for performance, but explicitly for deadlock patterns.
- 02. Gap locks (InnoDB locks that cover the space between indexed values to prevent phantom reads in REPEATABLE READ isolation) are a frequent and underappreciated source of MySQL deadlocks at high concurrency. Consider using READ COMMITTED isolation for high-throughput tables where phantom read protection is not required — it eliminates an entire class of deadlock patterns at the cost of weaker isolation semantics.
- 03. Enforce canonical lock acquisition order in your application layer. If every code path that acquires multiple locks always acquires them in the same order, circular dependencies become impossible. This is the fundamental fix for transaction ordering deadlocks, and it requires reviewing all code paths that modify the same set of resources in a single transaction.
- 04. MySQL's query planner can select different indexes under varying concurrency conditions, leading to lock order variability that produces deadlocks at high load but not low load. Use EXPLAIN ANALYZE at peak load (not just nominal load) to understand actual query execution plans, and use index hints selectively to force consistent plans for known-problematic queries.
- 05. Monitor deadlock rates as a first-class operational metric with real-time visibility. Deadlocks discovered in postmortem analysis are incidents. Deadlocks discovered in real-time monitoring are operational data. Alerting on deadlock rate spikes during high-traffic events gives engineers the ability to act on deadlock patterns before they escalate to user-visible incidents.
⚠️
Deadlocks Are Not Random
Engineers sometimes treat MySQL deadlocks as random background noise — occasional events that retries handle automatically. At high concurrency, deadlocks are not random. They are deterministic consequences of specific query patterns executing at sufficient concurrency. Every deadlock pattern can be analyzed, categorized, and eliminated. Treating deadlocks as random events prevents you from doing that work.
THE ISOLATION LEVEL TRADEOFF IN PRACTICE
Changing table-level isolation from REPEATABLE READ to READ COMMITTED is a schema-level decision that requires careful per-table evaluation. Questions to answer: Do any queries on this table require phantom read protection? (Analytics queries often do; CRUD operations often don't.) What application logic depends on consistent snapshots within a transaction? What is the read-to-write ratio? READ COMMITTED reduces read-side lock contention, which benefits high-concurrency tables where writes dominate. The decision requires understanding the application's correctness requirements, not just the performance characteristics.
At 19 million queries per second, even a one-in-a-million database error happens 19 times a second — which is Shopify's way of saying there's no such thing as 'too rare to matter.'
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)