TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened

#database #shopify #analytics #backend

Shopify · Databases · 17 May 2026

The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.

KateSQL → Vitess migration
user_id as sharding key
VTGate transparent to app
Dynamic connection switcher
Zero downtime cutover
Jan 2024 blog published

The Story

Shopify launched the Shop app in April 2020, giving consumers a personalized browsing and checkout experience across Shopify's merchant network. By 2023, the Shop app had achieved remarkable growth — and its backend database was approaching the scaling ceiling that every fast-growing application eventually hits. The database powering the Shop backend, running on Shopify's internal managed MySQL system called KateSQL, was a single MySQL instance. Single-instance databases have a hard vertical limit: no matter how much you upgrade the hardware, there's a maximum amount of data and queries per second one machine can handle. Horizontal sharding was the only path forward , and Shopify's team chose Vitess (an open-source MySQL scaling system developed at YouTube that adds horizontal sharding, connection pooling, and query routing on top of standard MySQL) to execute it.

Vitess has a deceptively clean architecture at the application level: applications connect to VTGate (Vitess's query routing proxy — a stateless service that accepts MySQL connections from applications, parses queries, and routes them to the correct shard based on the query's sharding key) as if it were a regular MySQL server. VTGate speaks the MySQL wire protocol, so applications need only update their database connection string. Queries are then routed by VTGate to the appropriate VTTablet (a Vitess process that runs alongside each MySQL instance and manages the connection pool, health checks, and query execution for that shard), which communicates directly with the underlying MySQL process. From the application's perspective, there is one database. From the infrastructure's perspective, there are many. This transparency is what makes Vitess viable for a Rails monolith like Shopify's — the application code doesn't change, only the database topology.

🔑

Shopify chose user_id as the sharding key for the Shop app's user-owned data. Almost all tables in the database are associated with a user, so user_id was a natural choice — it distributes data evenly, ensures all of a user's data lives on the same shard, and keeps user-scoped queries on a single shard without cross-shard joins.

THE VITESSIFYING PHASE

Shopify coined the term 'Vitessifying' for the process of transforming an existing MySQL database into a Vitess keyspace without immediately sharding. In this first phase, a VTTablet is added alongside each MySQL process, and the application is reconfigured to connect through VTGate — but all data still lives on a single shard. This allows the team to validate Vitess integration, test VTGate routing, and gain operational familiarity with Vitess before making the more complex sharding changes.

Problem

Single-Instance Database Approaching Its Ceiling

The Shop app's backend was scaling rapidly but its database was a single MySQL instance. Vertical scaling had diminishing returns and a hard ceiling. The engineering team needed horizontal sharding to support continued growth without database-induced bottlenecks.

Cause

Rails Monolith Expected One Database

Shopify's Shop backend was a Rails application that, like most Rails apps, expected a single primary database connection. Introducing sharding without a transparent proxy would require extensive application-level changes to route queries to the correct shard — a significant refactoring risk. The alternative was a transparent proxy that handled sharding invisibly.

Solution

Vitess + Dynamic Connection Switcher

The migration proceeded in phases: first Vitessifying (adding VTTablet and VTGate without sharding), then adding application-layer VTGate connectivity, then splitting tables into the user and global keyspaces, then horizontally sharding the user keyspace by user_id. A dynamic connection switcher allowed gradual traffic migration from the old system to VTGate, with the percentage adjustable without a deploy.

Result

Horizontally Scalable, App Unchanged

The Shop app backend gained horizontal scalability via Vitess sharding without requiring the application to understand sharding. The connection string changed; the application code did not. Shopify can now add shards as the Shop app continues to grow without additional application-level changes.

⚠️

The Auto-Increment Problem in Sharded Systems

Rails applications default to using auto-incrementing integer primary IDs — a database feature that generates unique IDs by incrementing a counter. In a sharded system, multiple shards generating auto-increment IDs independently would produce duplicate IDs across shards. Vitess solves this with a Sequences table in an unsharded keyspace: VTTablets cache blocks of IDs from the Sequences table and distribute them, ensuring globally unique IDs across all shards. The cache size of 1000 IDs per VTTablet reduces the per-ID write overhead while maintaining uniqueness.

The schema migration challenge was particularly subtle. When running schema migrations (DDL (Data Definition Language — SQL statements like ALTER TABLE that change database structure rather than data) operations) on a sharded Vitess cluster, all shards must apply the migration and complete before the Rails application can query the table schema. If the migration completes on some shards but not others, a Rails query checking the schema might get an inconsistent view — triggering a dump of a potentially incorrect schema. Shopify's solution: migrations tracked across all shards, and schema dumps only triggered after all shards confirmed completion. This required custom Rails tooling to coordinate with Vitess's sharding topology.

🧩

Two Keyspaces: Users and Global

Shopify split the Shop app data into two keyspaces (a Vitess concept for a logical database that can span one or more shards): a sharded 'users' keyspace containing all user-owned tables (sharded by user_id), and an unsharded 'global' keyspace for data that doesn't belong to individual users and must be accessed without a sharding key. This two-keyspace architecture is the standard pattern for Vitess migrations: shard what scales with users, keep globally-accessed lookup data unsharded.

Vitessifying is our internal terminology for the process of transforming an existing MySQL into a keyspace in a Vitess cluster. This allows us to start using core Vitess functionality without explicitly moving data.

— — Shopify Engineering — via 'Horizontally scaling the Rails backend of Shop app with Vitess'

ℹ️

Dynamic Connection Switching: Gradual Traffic Migration

Rather than a hard cutover from KateSQL to VTGate, Shopify built a dynamic connection switcher that allowed them to gradually route increasing percentages of traffic through VTGate while monitoring for performance differences. Starting at a small percentage and slowly ramping to 100% gave the team confidence in VTGate's behavior under real production load before fully committing. The percentage was adjustable at runtime without a code deploy — giving operators immediate control during the migration window.

ℹ️

Shopify's First Vitess Production Deployment

The Shop app backend migration was Shopify's first deployment of Vitess in production. This wasn't just a database migration — it was building organizational competency with a new database infrastructure layer from scratch. The team had to learn Vitess's operational model, its failure modes, its monitoring requirements, and its configuration nuances simultaneously with executing a live migration. Phasing the migration was in part a strategy to build this knowledge incrementally.

⚠️

Cross-Shard Queries: The Scatter-Gather Problem

When a query cannot be routed to a single shard — because it lacks a sharding key or spans multiple shards — Vitess performs a scatter-gather operation : it sends the query to all shards and aggregates the results. Scatter-gather is more expensive than single-shard queries. Shopify's engineering team reviewed the Shop app's query patterns to identify scatter queries and either added sharding keys to make them single-shard or moved the data they accessed into the global keyspace. Unhandled scatter queries can become performance bottlenecks at scale.

The Fix

The Vitess Migration Playbook: Four Phases

Shopify's Vitess migration was carefully sequenced into phases that minimized risk at each step. Phase 1 (Vitessifying) validated the Vitess stack without sharding. Phase 2 (dual connectivity) validated that the application could talk to VTGate alongside the existing system. Phase 3 (keyspace splitting) separated tables into users and global keyspaces. Phase 4 (sharding) performed the actual horizontal split of the users keyspace by user_id. Each phase produced a stable, production-validated state before the next phase began — the classic incremental risk management strategy.

4 phases — Migration phases: Vitessify → dual connectivity → keyspace split → horizontal shard — each independently production-validated before proceeding
user_id — Sharding key — ensures all data for a user lives on the same shard, making user-scoped queries single-shard with no cross-shard joins for most operations
0 app changes — Application code changes required to complete the sharding — VTGate's MySQL protocol compatibility meant only the connection string changed
1000 IDs — VTTablet sequence cache size — each shard pre-fetches 1000 globally-unique IDs from the Sequences table to avoid per-insert writes to the sequence source

-- Simplified Vitess VSchema for Shopify's two-keyspace architecture
-- VSchema tells VTGate how to route queries to shards

-- USERS keyspace: sharded by user_id
-- All user-owned tables have user_id as the Primary VIndex (shard key)
{
  "sharded": true,
  "vindexes": {
    "hash": { "type": "hash" } -- consistent hash of user_id
  },
  "tables": {
    "orders": {
      "columnVindexes": [
        { "column": "user_id", "name": "hash" } -- shard on user_id
      ],
      -- Vitess Sequence for globally unique primary key
      "autoIncrement": {
        "column": "id",
        "sequence": "GLOBAL_KEYSPACE.orders_seq" -- lives in unsharded global keyspace
      }
    },
    "user_preferences": {
      "columnVindexes": [
        { "column": "user_id", "name": "hash" }
      ]
    }
  }
}

-- GLOBAL keyspace: unsharded (no user_id)
-- Merchant data, category data, other cross-user lookups
{
  "sharded": false,
  "tables": {
    "merchants": {}, -- accessed without sharding key
    "categories": {}
  }
}

ℹ️

Schema Migrations Across Multiple Shards

Running ALTER TABLE on a sharded Vitess cluster requires coordination: the DDL must be applied to all shards, and the application must not attempt to query the new schema until all shards have confirmed completion. Shopify built tooling to track migration status across all shards and only allow the Rails schema dump (used to verify the schema is as expected) after all shards reported completion. Without this coordination, a Rails schema check on a partially-migrated cluster could return an inconsistent view.

THE SHOPIFY FIRST: VITESS IN PRODUCTION

The Shop app backend was Shopify's first production deployment of Vitess. This meant the team was building operational knowledge from scratch — learning Vitess's failure modes, monitoring requirements, and operational procedures while also executing a live migration. The careful phasing of the migration (Vitessify first, shard second) was in part a strategy to build this operational experience incrementally rather than learning all of Vitess's complexity at once.

✅

VTGate: MySQL Protocol Transparency

VTGate's most valuable property for application developers is that it speaks the standard MySQL wire protocol. Any MySQL client — including ActiveRecord, the ORM that powers Rails — can connect to VTGate without modification. From the application's perspective, VTGate is just another MySQL server. The sharding logic, the shard topology, the cross-shard routing — all invisible to the application layer.

✅

ProxySQL to VTGate: The Connection String Change

The Shop app had previously been using ProxySQL as its database proxy — a standard approach for MySQL connection pooling and query routing. Replacing ProxySQL with VTGate was the connection-layer change that made Vitess integration possible. From the application's perspective, both ProxySQL and VTGate speak the MySQL wire protocol; the change was transparent to Rails. The dual connectivity phase let the team validate VTGate behavior alongside ProxySQL before fully committing.

VITESS RESOURCE ALLOCATION

One operational detail that surprised the Shopify team: VTTablet requires significant resource allocation. Vitess's own rule of thumb is allocating an equal number of CPUs to VTTablet as to the mysqld process it runs alongside. Memory consumption for VTTablet is generally low, but CPU requirements are substantial — VTTablet handles connection pooling, health checking, query execution, and replication management. Underprovisioning VTTablet creates a bottleneck in the query path that can limit the effective throughput of the underlying MySQL instance.

Architecture

Vitess's architecture introduces two new components between the application and MySQL: VTGate (the stateless query router, deployed as multiple replicas for high availability) and VTTablet (a sidecar process running alongside each MySQL instance). The application connects to VTGate using a standard MySQL connection. VTGate consults the VSchema (Vitess Schema — a configuration document that describes how keyspaces and shards are organized and which columns are used as sharding keys) to determine which shard a query should target, then forwards it to the appropriate VTTablet. The MySQL instances themselves are unchanged — they continue running as standard MySQL servers with replication configured for high availability.

Vitess Architecture: Rails App → VTGate → Sharded MySQL

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Migration Phases: From KateSQL to Sharded Vitess