Shopify · Databases · 17 May 2026
The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.
- KateSQL → Vitess migration
- user_id as sharding key
- VTGate transparent to app
- Dynamic connection switcher
- Zero downtime cutover
- Jan 2024 blog published
The Story
Shopify launched the Shop app in April 2020, giving consumers a personalized browsing and checkout experience across Shopify's merchant network. By 2023, the Shop app had achieved remarkable growth — and its backend database was approaching the scaling ceiling that every fast-growing application eventually hits. The database powering the Shop backend, running on Shopify's internal managed MySQL system called KateSQL, was a single MySQL instance. Single-instance databases have a hard vertical limit: no matter how much you upgrade the hardware, there's a maximum amount of data and queries per second one machine can handle. Horizontal sharding was the only path forward , and Shopify's team chose Vitess (an open-source MySQL scaling system developed at YouTube that adds horizontal sharding, connection pooling, and query routing on top of standard MySQL) to execute it.
Vitess has a deceptively clean architecture at the application level: applications connect to VTGate (Vitess's query routing proxy — a stateless service that accepts MySQL connections from applications, parses queries, and routes them to the correct shard based on the query's sharding key) as if it were a regular MySQL server. VTGate speaks the MySQL wire protocol, so applications need only update their database connection string. Queries are then routed by VTGate to the appropriate VTTablet (a Vitess process that runs alongside each MySQL instance and manages the connection pool, health checks, and query execution for that shard), which communicates directly with the underlying MySQL process. From the application's perspective, there is one database. From the infrastructure's perspective, there are many. This transparency is what makes Vitess viable for a Rails monolith like Shopify's — the application code doesn't change, only the database topology.
🔑
Shopify chose user_id as the sharding key for the Shop app's user-owned data. Almost all tables in the database are associated with a user, so user_id was a natural choice — it distributes data evenly, ensures all of a user's data lives on the same shard, and keeps user-scoped queries on a single shard without cross-shard joins.
THE VITESSIFYING PHASE
Shopify coined the term 'Vitessifying' for the process of transforming an existing MySQL database into a Vitess keyspace without immediately sharding. In this first phase, a VTTablet is added alongside each MySQL process, and the application is reconfigured to connect through VTGate — but all data still lives on a single shard. This allows the team to validate Vitess integration, test VTGate routing, and gain operational familiarity with Vitess before making the more complex sharding changes.
Problem
Single-Instance Database Approaching Its Ceiling
The Shop app's backend was scaling rapidly but its database was a single MySQL instance. Vertical scaling had diminishing returns and a hard ceiling. The engineering team needed horizontal sharding to support continued growth without database-induced bottlenecks.
Cause
Rails Monolith Expected One Database
Shopify's Shop backend was a Rails application that, like most Rails apps, expected a single primary database connection. Introducing sharding without a transparent proxy would require extensive application-level changes to route queries to the correct shard — a significant refactoring risk. The alternative was a transparent proxy that handled sharding invisibly.
Solution
Vitess + Dynamic Connection Switcher
The migration proceeded in phases: first Vitessifying (adding VTTablet and VTGate without sharding), then adding application-layer VTGate connectivity, then splitting tables into the user and global keyspaces, then horizontally sharding the user keyspace by user_id. A dynamic connection switcher allowed gradual traffic migration from the old system to VTGate, with the percentage adjustable without a deploy.
Result
Horizontally Scalable, App Unchanged
The Shop app backend gained horizontal scalability via Vitess sharding without requiring the application to understand sharding. The connection string changed; the application code did not. Shopify can now add shards as the Shop app continues to grow without additional application-level changes.
⚠️
The Auto-Increment Problem in Sharded Systems
Rails applications default to using auto-incrementing integer primary IDs — a database feature that generates unique IDs by incrementing a counter. In a sharded system, multiple shards generating auto-increment IDs independently would produce duplicate IDs across shards. Vitess solves this with a Sequences table in an unsharded keyspace: VTTablets cache blocks of IDs from the Sequences table and distribute them, ensuring globally unique IDs across all shards. The cache size of 1000 IDs per VTTablet reduces the per-ID write overhead while maintaining uniqueness.
The schema migration challenge was particularly subtle. When running schema migrations (DDL (Data Definition Language — SQL statements like ALTER TABLE that change database structure rather than data) operations) on a sharded Vitess cluster, all shards must apply the migration and complete before the Rails application can query the table schema. If the migration completes on some shards but not others, a Rails query checking the schema might get an inconsistent view — triggering a dump of a potentially incorrect schema. Shopify's solution: migrations tracked across all shards, and schema dumps only triggered after all shards confirmed completion. This required custom Rails tooling to coordinate with Vitess's sharding topology.
🧩
Two Keyspaces: Users and Global
Shopify split the Shop app data into two keyspaces (a Vitess concept for a logical database that can span one or more shards): a sharded 'users' keyspace containing all user-owned tables (sharded by user_id), and an unsharded 'global' keyspace for data that doesn't belong to individual users and must be accessed without a sharding key. This two-keyspace architecture is the standard pattern for Vitess migrations: shard what scales with users, keep globally-accessed lookup data unsharded.
Vitessifying is our internal terminology for the process of transforming an existing MySQL into a keyspace in a Vitess cluster. This allows us to start using core Vitess functionality without explicitly moving data.
— — Shopify Engineering — via 'Horizontally scaling the Rails backend of Shop app with Vitess'
ℹ️
Dynamic Connection Switching: Gradual Traffic Migration
Rather than a hard cutover from KateSQL to VTGate, Shopify built a dynamic connection switcher that allowed them to gradually route increasing percentages of traffic through VTGate while monitoring for performance differences. Starting at a small percentage and slowly ramping to 100% gave the team confidence in VTGate's behavior under real production load before fully committing. The percentage was adjustable at runtime without a code deploy — giving operators immediate control during the migration window.
ℹ️
Shopify's First Vitess Production Deployment
The Shop app backend migration was Shopify's first deployment of Vitess in production. This wasn't just a database migration — it was building organizational competency with a new database infrastructure layer from scratch. The team had to learn Vitess's operational model, its failure modes, its monitoring requirements, and its configuration nuances simultaneously with executing a live migration. Phasing the migration was in part a strategy to build this knowledge incrementally.
⚠️
Cross-Shard Queries: The Scatter-Gather Problem
When a query cannot be routed to a single shard — because it lacks a sharding key or spans multiple shards — Vitess performs a scatter-gather operation : it sends the query to all shards and aggregates the results. Scatter-gather is more expensive than single-shard queries. Shopify's engineering team reviewed the Shop app's query patterns to identify scatter queries and either added sharding keys to make them single-shard or moved the data they accessed into the global keyspace. Unhandled scatter queries can become performance bottlenecks at scale.
The Fix
The Vitess Migration Playbook: Four Phases
Shopify's Vitess migration was carefully sequenced into phases that minimized risk at each step. Phase 1 (Vitessifying) validated the Vitess stack without sharding. Phase 2 (dual connectivity) validated that the application could talk to VTGate alongside the existing system. Phase 3 (keyspace splitting) separated tables into users and global keyspaces. Phase 4 (sharding) performed the actual horizontal split of the users keyspace by user_id. Each phase produced a stable, production-validated state before the next phase began — the classic incremental risk management strategy.
- 4 phases — Migration phases: Vitessify → dual connectivity → keyspace split → horizontal shard — each independently production-validated before proceeding
- user_id — Sharding key — ensures all data for a user lives on the same shard, making user-scoped queries single-shard with no cross-shard joins for most operations
- 0 app changes — Application code changes required to complete the sharding — VTGate's MySQL protocol compatibility meant only the connection string changed
- 1000 IDs — VTTablet sequence cache size — each shard pre-fetches 1000 globally-unique IDs from the Sequences table to avoid per-insert writes to the sequence source
-- Simplified Vitess VSchema for Shopify's two-keyspace architecture
-- VSchema tells VTGate how to route queries to shards
-- USERS keyspace: sharded by user_id
-- All user-owned tables have user_id as the Primary VIndex (shard key)
{
"sharded": true,
"vindexes": {
"hash": { "type": "hash" } -- consistent hash of user_id
},
"tables": {
"orders": {
"columnVindexes": [
{ "column": "user_id", "name": "hash" } -- shard on user_id
],
-- Vitess Sequence for globally unique primary key
"autoIncrement": {
"column": "id",
"sequence": "GLOBAL_KEYSPACE.orders_seq" -- lives in unsharded global keyspace
}
},
"user_preferences": {
"columnVindexes": [
{ "column": "user_id", "name": "hash" }
]
}
}
}
-- GLOBAL keyspace: unsharded (no user_id)
-- Merchant data, category data, other cross-user lookups
{
"sharded": false,
"tables": {
"merchants": {}, -- accessed without sharding key
"categories": {}
}
}
ℹ️
Schema Migrations Across Multiple Shards
Running
ALTER TABLEon a sharded Vitess cluster requires coordination: the DDL must be applied to all shards, and the application must not attempt to query the new schema until all shards have confirmed completion. Shopify built tooling to track migration status across all shards and only allow the Rails schema dump (used to verify the schema is as expected) after all shards reported completion. Without this coordination, a Rails schema check on a partially-migrated cluster could return an inconsistent view.THE SHOPIFY FIRST: VITESS IN PRODUCTION
The Shop app backend was Shopify's first production deployment of Vitess. This meant the team was building operational knowledge from scratch — learning Vitess's failure modes, monitoring requirements, and operational procedures while also executing a live migration. The careful phasing of the migration (Vitessify first, shard second) was in part a strategy to build this operational experience incrementally rather than learning all of Vitess's complexity at once.
✅
VTGate: MySQL Protocol Transparency
VTGate's most valuable property for application developers is that it speaks the standard MySQL wire protocol. Any MySQL client — including ActiveRecord, the ORM that powers Rails — can connect to VTGate without modification. From the application's perspective, VTGate is just another MySQL server. The sharding logic, the shard topology, the cross-shard routing — all invisible to the application layer.
✅
ProxySQL to VTGate: The Connection String Change
The Shop app had previously been using ProxySQL as its database proxy — a standard approach for MySQL connection pooling and query routing. Replacing ProxySQL with VTGate was the connection-layer change that made Vitess integration possible. From the application's perspective, both ProxySQL and VTGate speak the MySQL wire protocol; the change was transparent to Rails. The dual connectivity phase let the team validate VTGate behavior alongside ProxySQL before fully committing.
VITESS RESOURCE ALLOCATION
One operational detail that surprised the Shopify team: VTTablet requires significant resource allocation. Vitess's own rule of thumb is allocating an equal number of CPUs to VTTablet as to the mysqld process it runs alongside. Memory consumption for VTTablet is generally low, but CPU requirements are substantial — VTTablet handles connection pooling, health checking, query execution, and replication management. Underprovisioning VTTablet creates a bottleneck in the query path that can limit the effective throughput of the underlying MySQL instance.
Architecture
Vitess's architecture introduces two new components between the application and MySQL: VTGate (the stateless query router, deployed as multiple replicas for high availability) and VTTablet (a sidecar process running alongside each MySQL instance). The application connects to VTGate using a standard MySQL connection. VTGate consults the VSchema (Vitess Schema — a configuration document that describes how keyspaces and shards are organized and which columns are used as sharding keys) to determine which shard a query should target, then forwards it to the appropriate VTTablet. The MySQL instances themselves are unchanged — they continue running as standard MySQL servers with replication configured for high availability.
Vitess Architecture: Rails App → VTGate → Sharded MySQL
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Migration Phases: From KateSQL to Sharded Vitess
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
CONNECTION POOLING: AN OFTEN-OVERLOOKED VITESS BENEFIT
Beyond sharding, VTTablet provides connection pooling at the database level. A Rails application with 100 Puma worker threads might open 100 MySQL connections — and 100 application instances might open 10,000. VTTablet multiplexes these connections to a much smaller pool against the actual MySQL process. At Shopify's scale, this connection efficiency is a meaningful resource saving in addition to the sharding capability.
ℹ️
Sequence Caching: Trading Latency for Throughput
VTTablet's sequence ID caching (set at 1000 in Shopify's production config) is a throughput-versus-latency tradeoff. Without caching , every INSERT requires a roundtrip to the Sequences table in the global keyspace to get the next ID — adding latency to every write. With caching of 1000 IDs , 999 out of every 1000 INSERTs get their ID from the local cache instantly, with only every 1000th INSERT requiring a roundtrip. IDs have gaps in the sequence after a server restart (cached-but-unused IDs are lost) but remain globally unique.
ℹ️
VTOrc: Automated Topology Management
In a sharded Vitess cluster, managing primary/replica failover across dozens of shards manually would be operationally prohibitive. VTOrc (Vitess Orchestrator — an automated MySQL topology manager integrated into Vitess that detects primary failures and promotes replicas automatically, maintaining high availability without manual operator intervention) handles this automatically. When a shard's primary fails, VTOrc promotes the best available replica and updates VTGate's routing table — keeping the cluster available without human intervention.
Lessons
Shopify's Vitess migration demonstrates that horizontal database sharding doesn't have to mean rewriting your application. With the right proxy architecture, the sharding is in the infrastructure — invisible to the application layer.
- 01. Vitessify before you shard. Adding Vitess to an existing MySQL database without sharding (Vitessifying) is a safe, low-risk first step that validates the Vitess stack and builds operational knowledge before attempting the more complex sharding migration. Shopify's phased approach reflects this: get comfortable with Vitess on one shard before splitting into many.
- 02. Choose your sharding key (the column value used to determine which shard a row belongs to — the most important architectural decision in horizontal sharding because it determines data locality and query routing) carefully and early. user_id was the right choice for Shopify's user-centric application: it distributes data evenly, keeps user data colocated on one shard, and makes user-scoped queries single-shard. A bad sharding key creates hot shards, cross-shard joins, and an architecture that fights itself.
- 03. Auto-increment IDs break in sharded systems. Every sharded application needs a strategy for globally unique IDs. Vitess Sequences, UUIDs, Snowflake IDs — the choice matters for performance, sortability, and debuggability. Don't discover this problem during your sharding migration; design for it before migration begins.
- 04. Schema migrations on sharded clusters require explicit cross-shard coordination. Any tooling that inspects or depends on schema state must be sharding-aware. Rails's schema dump, ActiveRecord migrations, and ORM schema introspection all need to understand that schema changes must be applied to all shards before the application can assume they've taken effect.
- 05. A dynamic connection switcher that allows gradual traffic migration is the safety mechanism that makes production sharding migrations recoverable. Being able to route 1% → 5% → 25% → 100% of traffic through the new system, with instant rollback by setting the percentage back to 0%, is the difference between a migration you can execute confidently and one that requires a maintenance window.
⚠️
VSchema Maintenance: An Ongoing Obligation
The VSchema must be updated every time the database schema changes. A new table needs a VSchema entry defining its sharding key. A new index needs evaluation for VIndex configuration. Vitess amplifies the schema change process : what was previously a single DDL operation now requires DDL plus VSchema update, coordinated across all shards. Teams adopting Vitess need processes and tooling to ensure VSchema updates are not overlooked during schema migrations.
VITESS AS SHOPIFY STANDARD
Following the Shop app success, Shopify has been expanding Vitess adoption to other services. The first deployment built the organizational knowledge and tooling (custom Rails integration, dynamic connection switcher, cross-shard schema migration tooling) that makes subsequent deployments faster and safer. Infrastructure investments compound : the second Vitess deployment benefits from all the work done during the first.
Shopify added horizontal database sharding to a Rails app, and the app continued insisting there was only one database — which is either a beautiful abstraction or a comfortable lie, and honestly both.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)