TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

#database #shopify #engineer #performance

Shopify · Databases · 17 May 2026

During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.

19M MySQL QPS at BFCM peak
58M requests/min app servers
99.999%+ uptime maintained
Deadlock pattern taxonomy built
Schema + query-level fixes
Aug 2024 blog published

The Story

19M QPS — MySQL queries per second during BFCM 2023 peak — the environment where deadlock patterns that are rare at normal traffic become regular production incidents
58M req/min — Application server request rate at peak BFCM — equivalent to roughly 967,000 requests per second across Shopify's core application servers
99.999%+ — Shopify's uptime target maintained through BFCM — the reliability bar that makes every database performance issue a potential SLA risk
3 patterns — Main deadlock pattern categories identified and addressed: transaction ordering, gap locking, and index selection — each requiring different fix strategies

MySQL deadlocks (a situation where two or more transactions are each waiting for locks held by the other, creating a circular dependency that MySQL resolves by automatically aborting one transaction (the 'deadlock victim')) are a fact of life in high-concurrency relational databases. At small scale, they're rare and easily retried. At Shopify's Black Friday and Cyber Monday scale — 19 million MySQL queries per second , 58 million application requests per minute — even rare deadlock patterns become frequent enough to affect real user experiences. A deadlock rate that generates one deadlock per million transactions is invisible at 1,000 QPS. At 19 million QPS, it generates 19 deadlocks per second continuously.

MySQL's deadlock handling is automatic: when it detects a circular lock dependency, it selects one transaction as the deadlock victim , aborts it, and allows the other transaction to proceed. The deadlock victim receives an error: Deadlock found when trying to get lock; try restarting transaction. Most application frameworks, including Rails, can be configured to automatically retry deadlock victims — but retries add latency, consume connection pool slots, and under sustained deadlock conditions can create retry storms that worsen the overload. The solution at scale is not to retry more aggressively but to eliminate the deadlock patterns entirely.

WHY DEADLOCKS GET WORSE AT SCALE

Deadlock frequency scales super-linearly with concurrency. Doubling transactions per second more than doubles deadlock frequency because the probability of two transactions conflicting increases with every additional concurrent transaction. The same query patterns that were safe at 10x lower traffic can become pathological at BFCM scale. Shopify specifically tests for deadlock conditions before BFCM to identify and eliminate patterns that would become problems under peak load.

The Three Deadlock Patterns

Shopify's engineering team identified three primary deadlock patterns in their high-concurrency MySQL environment. Pattern 1: Transaction ordering conflicts — two transactions each acquiring locks in opposite order (Transaction A locks row X then row Y; Transaction B locks row Y then row X). Pattern 2: Gap lock conflicts — MySQL's gap locking (a mechanism in InnoDB's REPEATABLE READ isolation level that locks the gap before a row to prevent phantom reads — this can cause unexpected conflicts between transactions that aren't modifying the same rows) mechanism, which prevents phantom reads by locking ranges rather than individual rows, creating unexpected conflicts between transactions accessing the same range. Pattern 3: Index selection conflicts — MySQL choosing suboptimal index plans under high concurrency, causing different transactions to acquire locks in different orders depending on the query execution plan.

Problem

BFCM Scale Turns Rare Deadlocks into Incidents

At 19M MySQL QPS, deadlock patterns that are statistically invisible at normal load become consistent sources of latency spikes and error rates. MySQL's automatic deadlock resolution aborts one transaction per deadlock, requiring application-level retry logic that adds latency and consumes connection pool capacity.

Cause

Three Structural Deadlock Patterns

Transaction ordering (acquiring the same locks in different orders), gap locking (InnoDB's phantom-read prevention causing unexpected range lock conflicts), and index selection (query planner choosing different indexes under load, causing lock order variation). Each pattern has a different root cause and requires a different fix strategy.

Solution

Schema Changes + Query Pattern Fixes + Index Hints

Transaction ordering: enforce consistent lock acquisition order in application code. Gap locking: switch to READ COMMITTED isolation for high-concurrency tables where phantom reads aren't a concern, or add precise indexes to avoid gap lock ranges. Index selection: use index hints to force consistent query plans under varying load conditions.

Result

BFCM 2023 Completed Without Deadlock Incidents

BFCM 2023 ran at 19M MySQL QPS without deadlock-related incidents. The pre-BFCM deadlock analysis and mitigation work became a repeatable playbook for identifying and eliminating deadlock patterns before major traffic events.

⚠️

Gap Locking: The Invisible Conflict Source

MySQL's InnoDB uses REPEATABLE READ (the default MySQL transaction isolation level that prevents other transactions from modifying rows you've already read within a transaction — implemented in part through gap locks on ranges of the index) as its default isolation level, which uses gap locking to prevent phantom reads (a scenario where a transaction reads a set of rows, another transaction inserts a new row in that set, and the first transaction reads again and sees the new 'phantom' row). The side effect: transactions that touch overlapping index ranges can deadlock on gap locks even if they're accessing completely different rows. At high concurrency, gap lock deadlocks can be more common than row-level lock deadlocks. Switching high-concurrency tables to READ COMMITTED isolation eliminates gap locking at the cost of allowing phantom reads — an acceptable tradeoff for many commerce workloads.

🔒

MySQL's deadlock detector runs after every failed lock acquisition attempt , searching for circular lock dependencies. At 19M QPS this detection runs millions of times per second — a reminder that MySQL's deadlock resolution machinery is doing meaningful work constantly, not just during obvious deadlock events.

🗂️

Reading the Deadlock Log: Forensic Analysis

MySQL records the most recent deadlock in SHOW ENGINE INNODB STATUS. This output includes the transaction SQL, the locks held, the locks waited for, and which transaction was chosen as the victim. Forensic analysis of the deadlock log is the primary diagnosis tool — it reveals the exact lock ordering conflict that caused each deadlock pattern, which informs which fix strategy to apply.

Shopify's approach to BFCM preparation includes explicit deadlock testing at high concurrency levels in staging. Load tests simulate BFCM-scale traffic against representative database states, and the resulting deadlock logs are analyzed to identify patterns that would be problematic at production scale. This pre-event analysis allows the team to eliminate deadlock patterns before they become BFCM incidents — converting reactive incident response into proactive engineering.

THE RETRY LOOP DANGER

MySQL automatically selects a deadlock victim and aborts one transaction. Most frameworks, including Rails with ActiveRecord, can be configured to automatically retry deadlocked transactions. But retries are dangerous under sustained deadlock conditions: each retry re-acquires locks and adds to the contention, potentially creating a retry storm that makes the deadlock situation worse. The correct strategy is to eliminate deadlock patterns, not to retry aggressively — retries are acceptable for occasional, transient deadlocks but not for structural patterns.

The Fix

Fix Patterns by Deadlock Category

Each of the three deadlock categories has a distinct fix strategy. Transaction ordering conflicts are fixed at the application level by enforcing a canonical lock acquisition order — if every transaction that needs to lock resources A and B always acquires them in the same order (A before B), circular dependency is impossible. Gap lock conflicts are fixed either at the database level (changing isolation level for specific tables) or at the schema level (adding precise indexes that reduce the range that InnoDB locks). Index selection conflicts are fixed by adding index hints to force MySQL to use the same index plan consistently across varying load conditions.

-- Pattern 1: Transaction ordering deadlock
-- BAD: Transaction A and B acquire locks in opposite order
-- Transaction A: UPDATE orders, then UPDATE inventory  
-- Transaction B: UPDATE inventory, then UPDATE orders
-- Fix: enforce canonical lock order in application code

-- Pattern 2: Gap lock deadlock
-- At REPEATABLE READ isolation, MySQL locks gaps to prevent phantoms
-- Fix option A: Use READ COMMITTED for high-concurrency tables
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- This disables gap locking — only row-level locks on actual rows
-- Trade-off: phantom reads possible within transaction

-- Fix option B: Add precise index to narrow the locked range
-- Before: idx_status (status) -- gaps between statuses are locked
-- After: idx_status_created_at (status, created_at) -- narrower range
ALTER TABLE orders ADD INDEX idx_status_created_at (status, created_at);

-- Pattern 3: Index selection inconsistency under load
-- Under high concurrency, MySQL's query planner may choose different indexes
-- Fix: Force consistent index selection with index hint
SELECT * FROM orders FORCE INDEX (idx_user_id_status)
WHERE user_id = ? AND status = 'pending';
-- Forces the same index plan regardless of load conditions
-- Prevents lock order variation from causing deadlocks

THE BFCM PREPARATION PLAYBOOK

Shopify treats BFCM deadlock prevention as a structured engineering program: (1) Load test at BFCM scale in staging to generate representative deadlock logs. (2) Analyze deadlock logs to identify patterns and their categories. (3) Apply fix strategy per category: ordering fixes, isolation level changes, or index hints. (4) Re-test to verify the fix eliminates the pattern. (5) Deploy and monitor during BFCM itself. This loop is not run once — it's part of the pre-BFCM engineering sprint every year.

ℹ️

READ COMMITTED vs REPEATABLE READ: The Tradeoff

Switching high-concurrency tables from REPEATABLE READ to READ COMMITTED eliminates gap locking and dramatically reduces deadlock frequency — but at the cost of allowing phantom reads within a transaction (a subsequent read in the same transaction might see rows that didn't exist at the transaction's start). For most commerce operations — inserting an order, updating inventory, processing a payment — phantom read isolation isn't required. For reporting or analytics queries that require consistent snapshots, REPEATABLE READ is still appropriate. Shopify evaluates each table's access patterns to determine the right isolation level.

✅

The Performance Monitoring Integration

Shopify monitors deadlock rates as a first-class operational metric. The deadlock rate graph is visible on engineering dashboards alongside query latency and error rates. When deadlock rates spike during BFCM, they're caught within seconds — not minutes — and the specific transaction patterns can be identified from the deadlock log immediately. Real-time deadlock visibility is as important as the fix strategies themselves.

Index hints are a last resort — they trade flexibility for predictability. MySQL's query planner is usually good at selecting optimal indexes, and overriding it can cause performance regressions when data distributions change. But at BFCM scale, unpredictable index selection under high concurrency is more dangerous than a slightly suboptimal but consistent plan. The engineers who added index hints to specific high-volume queries did so with full awareness of the tradeoff, and with monitoring in place to detect if the forced plan becomes problematic over time.

ℹ️

Canonical Lock Order: Application-Level Enforcement

The canonical lock order fix requires reviewing every code path that modifies multiple related objects within a transaction. In a Rails application, this means identifying all ActiveRecord transactions that update, say, both an Order and an Inventory record, and ensuring they always acquire the lock on the object with the lower ID first (or some other consistent ordering). This is a refactoring task, not a database config change — and it requires careful code review to ensure no path is missed.

BEFORE BFCM: THE DEADLOCK AUDIT

Shopify's engineering team conducts a pre-BFCM deadlock audit: run load tests at maximum expected concurrency, collect InnoDB status outputs repeatedly during the load test, parse the deadlock logs, categorize patterns, apply fixes, re-test. The audit is a structured engineering sprint, not an ad-hoc investigation. The earlier the deadlock patterns are found, the more time there is to implement and validate fixes before the day that those patterns become 19-per-second incidents.

Architecture

Shopify's MySQL architecture for the BFCM period reflects years of scaling decisions: a mix of MySQL 5.7 and MySQL 8 instances, federated across over 100 independent shards for the Shopify Core application, with Vitess managing the most rapidly growing workloads. The deadlock patterns that Shopify addressed affect all high-concurrency MySQL deployments — the scale just makes them visible faster and more painfully.

Transaction Ordering Deadlock: The Classic Pattern

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Gap Lock Deadlock: The Invisible Pattern