- KateSQL → Vitess — single MySQL instance to horizontally sharded Vitess cluster
- user_id as sharding key — all of a user's data lives on the same shard; most queries never cross shards
- VTGate speaks standard MySQL wire protocol — Rails connection string changed; application code unchanged
- 4 phases — Vitessify → dual connectivity → keyspace split → horizontal shard; each production-validated before next
- Dynamic connection switcher — gradual traffic migration, adjustable at runtime without a deploy
- Shopify's first production Vitess deployment — built organisational competency from scratch
The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.
The Story
Shopify launched the Shop app in April 2020, giving consumers a personalised browsing and checkout experience across Shopify's merchant network. By 2023, the Shop app had achieved remarkable growth — and its backend database was approaching the scaling ceiling that every fast-growing application eventually hits. The database powering the Shop backend, running on Shopify's internal managed MySQL system called KateSQL, was a single MySQL instance. Single-instance databases have a hard vertical limit: no matter how much you upgrade the hardware, there's a maximum amount of data and queries per second one machine can handle. Horizontal sharding was the only path forward, and Shopify's team chose Vitess (an open-source MySQL scaling system developed at YouTube that adds horizontal sharding, connection pooling, and query routing on top of standard MySQL) to execute it.
Vitess has a deceptively clean architecture at the application level: applications connect to VTGate (Vitess's query routing proxy — a stateless service that accepts MySQL connections from applications, parses queries, and routes them to the correct shard based on the query's sharding key) as if it were a regular MySQL server. VTGate speaks the MySQL wire protocol, so applications need only update their database connection string. Queries are then routed by VTGate to the appropriate VTTablet (a Vitess process that runs alongside each MySQL instance and manages the connection pool, health checks, and query execution for that shard). From the application's perspective, there is one database. From the infrastructure's perspective, there are many. This transparency is what makes Vitess viable for a Rails monolith — the application code doesn't change, only the database topology.
Vitessifying is our internal terminology for the process of transforming an existing MySQL into a keyspace in a Vitess cluster. This allows us to start using core Vitess functionality without explicitly moving data.
— Shopify Engineering, via 'Horizontally scaling the Rails backend of Shop app with Vitess'
Problem
Single-Instance Database Approaching Its Ceiling
The Shop app's backend was scaling rapidly but its database was a single MySQL instance. Vertical scaling had diminishing returns and a hard ceiling. The engineering team needed horizontal sharding to support continued growth without database-induced bottlenecks.
Cause
Rails Monolith Expected One Database
Shopify's Shop backend was a Rails application that expected a single primary database connection. Introducing sharding without a transparent proxy would require extensive application-level changes to route queries to the correct shard — a significant refactoring risk. The alternative was a transparent proxy that handled sharding invisibly.
Solution
Vitess + Dynamic Connection Switcher
The migration proceeded in four phases: first Vitessifying (adding VTTablet and VTGate without sharding), then dual connectivity (validating VTGate alongside the existing system), then splitting tables into user and global keyspaces, then horizontally sharding the user keyspace by user_id. A dynamic connection switcher allowed gradual traffic migration, with the percentage adjustable without a deploy.
Result
Horizontally Scalable, App Unchanged
The Shop app backend gained horizontal scalability via Vitess sharding without requiring the application to understand sharding. The connection string changed; the application code did not. Shopify can now add shards as the Shop app continues to grow without additional application-level changes.
The Fix
The Vitess Migration Playbook: Four Phases
Shopify's Vitess migration was carefully sequenced into phases that minimised risk at each step. Each phase produced a stable, production-validated state before the next phase began.
- 4 phases — Vitessify → dual connectivity → keyspace split → horizontal shard; each independently production-validated
- user_id — sharding key; all data for a user lives on the same shard, making user-scoped queries single-shard
- 0 app changes — VTGate's MySQL protocol compatibility meant only the connection string changed
- 1000 IDs — VTTablet sequence cache size; pre-fetches globally-unique IDs to avoid per-insert writes to the sequence source
-- Simplified Vitess VSchema for Shopify's two-keyspace architecture
-- VSchema tells VTGate how to route queries to shards
-- USERS keyspace: sharded by user_id
-- All user-owned tables have user_id as the Primary VIndex (shard key)
-- {
-- "sharded": true,
-- "vindexes": { "hash": { "type": "hash" } },
-- "tables": {
-- "orders": {
-- "columnVindexes": [
-- { "column": "user_id", "name": "hash" } -- shard on user_id
-- ],
-- "autoIncrement": {
-- "column": "id",
-- "sequence": "GLOBAL_KEYSPACE.orders_seq" -- globally unique IDs
-- }
-- }
-- }
-- }
-- VTGate routing example: how user_id determines the shard
-- hash(user_id=12345) mod num_shards = shard index
-- All data for user 12345: always goes to the same shard
-- Cross-shard query example (scatter-gather — avoid where possible):
SELECT * FROM orders WHERE status = 'pending'; -- no user_id = scatter
-- Becomes 4 queries (one per shard) + merge — expensive
-- Single-shard query example (preferred):
SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending';
-- VTGate routes to exactly one shard — no scatter
The schema migration challenge was particularly subtle. When running DDL (Data Definition Language — SQL statements like ALTER TABLE that change database structure) operations on a sharded Vitess cluster, all shards must apply the migration and complete before the Rails application can query the new schema. If the migration completes on some shards but not others, a Rails schema check might get an inconsistent view. Shopify built custom tooling to track migration status across all shards and only allow the Rails schema dump after all shards confirmed completion.
The auto-increment problem in sharded systems
Rails applications default to using auto-incrementing integer primary IDs — a database counter that increments on each insert. In a sharded system, multiple shards generating auto-increment IDs independently would produce duplicate IDs across shards. Vitess solves this with a Sequences table in the unsharded global keyspace: VTTablets cache blocks of IDs from the Sequences table and distribute them, ensuring globally unique IDs across all shards. The cache size of 1000 IDs per VTTablet reduces the per-ID write overhead while maintaining uniqueness.
Dynamic connection switching: gradual traffic migration
Rather than a hard cutover from KateSQL to VTGate, Shopify built a dynamic connection switcher that allowed them to gradually route increasing percentages of traffic through VTGate while monitoring for performance differences. Starting at a small percentage and slowly ramping to 100% gave the team confidence in VTGate's behaviour under real production load before fully committing. The percentage was adjustable at runtime without a code deploy — giving operators immediate control during the migration window.
Cross-shard queries: the scatter-gather problem
When a query cannot be routed to a single shard — because it lacks a sharding key or spans multiple shards — Vitess performs a scatter-gather operation: it sends the query to all shards and aggregates the results. Scatter-gather is more expensive than single-shard queries. Shopify reviewed the Shop app's query patterns to identify scatter queries and either added sharding keys to make them single-shard or moved the accessed data into the global keyspace. Unhandled scatter queries can become performance bottlenecks at scale.
Architecture
Vitess introduces two new components between the application and MySQL: VTGate (the stateless query router, deployed as multiple replicas for high availability) and VTTablet (a sidecar process running alongside each MySQL instance). The application connects to VTGate using a standard MySQL connection. VTGate consults the VSchema (Vitess Schema — a configuration document describing how keyspaces and shards are organised and which columns are used as sharding keys) to determine which shard a query should target, then forwards it to the appropriate VTTablet. The MySQL instances themselves are unchanged — they continue running as standard MySQL servers.
Vitess Architecture: Rails App → VTGate → Sharded MySQL
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Migration Phases: From KateSQL to Sharded Vitess
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Vitessify before you shard. Adding Vitess to an existing MySQL database without sharding is a safe, low-risk first step that validates the Vitess stack and builds operational knowledge before attempting the more complex sharding migration. Get comfortable with Vitess on one shard before splitting into many.
Choose your sharding key (the column value used to determine which shard a row belongs to — the most important architectural decision in horizontal sharding) carefully and early. user_id was the right choice for Shopify's user-centric application: it distributes data evenly, keeps user data colocated on one shard, and makes user-scoped queries single-shard. A bad sharding key creates hot shards, cross-shard joins, and an architecture that fights itself.
Auto-increment IDs break in sharded systems. Every sharded application needs a strategy for globally unique IDs. Vitess Sequences, UUIDs, Snowflake IDs — the choice matters for performance, sortability, and debuggability. Don't discover this problem during your sharding migration; design for it before migration begins.
Schema migrations on sharded clusters require explicit cross-shard coordination. Any tooling that inspects or depends on schema state must be sharding-aware. Rails's schema dump, ActiveRecord migrations, and ORM schema introspection all need to understand that schema changes must be applied to all shards before the application can assume they've taken effect.
A dynamic connection switcher that allows gradual traffic migration is the safety mechanism that makes production sharding migrations recoverable. Being able to route 1% → 5% → 25% → 100% of traffic through the new system, with instant rollback by setting the percentage back to 0%, is the difference between a migration you can execute confidently and one that requires a maintenance window.
Engineering Glossary
KateSQL — Shopify's internal managed MySQL system. The single-instance database that powered the Shop app backend before the Vitess migration.
Keyspace — a Vitess concept for a logical database that can span one or more shards. Shopify uses two keyspaces: a sharded 'users' keyspace (all user-owned tables, sharded by user_id) and an unsharded 'global' keyspace (cross-user lookup data accessible without a sharding key).
Scatter-gather — a Vitess query execution pattern where a query lacking a sharding key is sent to all shards in parallel, with results aggregated before returning. Correct but expensive — multiplies database load by the number of shards. Should be minimised during migration by adding sharding keys or moving data to the global keyspace.
Sharding key — the column value used to determine which shard a row belongs to. The most important architectural decision in horizontal sharding. A good key (like user_id) distributes data evenly and keeps related data on the same shard. A bad key creates hot shards and cross-shard joins.
VSchema — Vitess Schema: a configuration document that describes how keyspaces and shards are organised, which columns are used as sharding keys, and how globally unique IDs are generated. Must be updated whenever the database schema changes.
VTGate — Vitess's stateless query routing proxy. Accepts standard MySQL connections from applications, parses queries, and routes them to the correct shard based on the VSchema. Speaks the MySQL wire protocol — applications connect to VTGate identically to connecting directly to MySQL.
VTOrc — Vitess Orchestrator: an automated MySQL topology manager integrated into Vitess that detects primary shard failures and promotes replicas automatically, maintaining high availability without manual operator intervention.
VTTablet — a Vitess sidecar process running alongside each MySQL instance. Manages the connection pool, health checks, query execution, and replication management for that shard. Also handles sequence ID caching for globally unique primary keys.
Vitessifying — Shopify's term for the process of transforming an existing MySQL database into a Vitess keyspace without immediately sharding. Adds VTTablet and VTGate, validates the integration, and builds operational familiarity before the more complex sharding changes.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)