loading...

Sharding a multi-tenant app with Postgres

craigkerstiens profile image Craig Kerstiens Originally published at citusdata.com ・5 min read

Whether you’re building marketing analytics, a portal for e-commerce sites, or an application to cater to schools, if you’re building an app and your customer is another business then a multi-tenant approach is the norm. The same code runs for all customers, but each customer sees their own private data set, except in the cases of holistic internal reporting.

Early in your application’s life, customer data has a simple structure which evolves organically. Typically all information relates to a central customer/user/tenant table. With a smaller amount of data (10’s of GB) it’s easy to scale the application by throwing more hardware at it, but what happens when you’ve had so much success that your data no longer fits in memory on a single box, or you need more concurrency? You scale out by re-architecting your application–and it’s often painful (and expensive.)

Options for scaling out your database

This scale-out model for databases has worked well for the likes of Google and Instagram, but doesn't have to be as complicated as you might think.

If you're able to model your multi-tenant data in the right way sharding can be much simpler–you do not need to re-architect your application to scale out, and youcan keep the power you need from a database including joins, indexing, and more. I work at Citus Data, where we’ve created a database that scales out Postgres (an extension to Postgres, actually): we’ve done the hard work of sharding so you don’t have to. While Citus lets you scale out your processing power and memory and storage, how you model your data will determine the ease and flexibility you get from the system. If you're building a multi-tenant SaaS application, hopefully the following example highlights how you can plan early for scaling without having to contort too much of your application.

Scaling out Postgres with ease by adopting a multi-tenant data model

At the core of most SaaS applications, tenancy is already built in, whether you realize it or not. By “tenancy”, we mean the notion that your SaaS application has multiple customers (“tenants”) who are all sharing the same application but whose data needs to be kep separate from each other. (The same way that multiple tenants can live in the same building, but each have their own separate apartment.)

Anyway, as we mentioned above, you may have a users table. Let's look at a very basic SaaS schema that highlights this:

CREATE TABLE stores (
  id UUID,
  owner_email VARCHAR(255),
  owner_password VARCHAR(255),
  name VARCHAR(255),
  url VARCHAR(255),
  last_login_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ
)

CREATE TABLE products (
  id UUID,
  name VARCHAR(255),
  description TEXT,
  price INTEGER,
  quantity INTEGER,
  store_id UUID,
  created_at TIMESTAMPTZ,
  updated_at TIMESTAMPTZ
)

CREATE TABLE purchases (
  id UUID,
  product_id UUID,
  customer_id UUID,
  store_id UUID,
  price INTEGER,
  purchased_at TIMESTAMPTZ,
)

The above schema highlights an overly simplified multi-tenant e-commerce site. Say for example, someone like an Etsy or Shopify. And of course there are a number of queries you would run against this:

List the products for a particular store:

SELECT id, 
       name,
       price
FROM products
WHERE store_id = foo;

Or let’s say you want to compute how many purchases exist weekly for a given store:

SELECT date_trunc('week', purchased_at),
       sum(price * quantity)
FROM purchases,
     stores
WHERE stores.id = products.stores_id
  AND store_id = ‘foo

From here you could envision how to give each store its own presence and analytics. Now if we fast-forward a bit and start to look at scaling this out then we have a choice to make on how we'll shard the data.

The easiest level to do this at is the tenant level or in this case on store_id. With the above data model the largest tables over time are likely to be products and purchases, we could shard on both of these. Though if we choose products or purchases, the difficulty lies in the fact that we may want to do queries that focus on some high level item such as store. If we choose store_id then all data for a particular store would exist on the same node, this would allow you push down all computations directly to the a single node.

Multi-tenancy and co-location, a perfect pair

Co-locating data within the same physical instance avoids sending data over the network during joins. This can result in much faster operations. With Citus, there are a number of ways to move your data around so you can join and query it in a flexible manner, but for this class of multi-tenant SaaS apps it’s simple if you can ensure data ends up on the shard. To do this though we need to push down our store_id to all of our tables.

The key that makes this all possible is including your store_id on all tables. By doing this you can easily shard out all your data so it’s located on the same shard. In the above data model we coincidentally had store_id on all of our tables, but if it weren’t there you could add it. This would put you in a good position to distribute all your data so it’s stored on the same nodes. So now let’s try sharding our tenants, in this case stores:

SELECT master_create_distributed_table('stores', 'id', 'hash');
SELECT master_create_distributed_table('products', 'store_id', 'hash');
SELECT master_create_distributed_table('purchases', 'store_id', 'hash');

SELECT master_create_worker_shards('stores', 16);
SELECT master_create_worker_shards('products', 16);
SELECT master_create_worker_shards('purchases', 16);

Now you’re all set. Again, you’ll notice that we shard everything by store_id–this allows all queries to be routed to a single Postgres instance. The same queries as before should work just fine for you as long as you have store_id on your query. An example layout of your data now may look something like:

Example of multi-tenant sharded tables

The alternative to colocation is to choose some lower-level shard key such as orders or products. This has a trade-off of making joins and querying more difficult because you have to send more data over the network and make sure things work in a distributed way. This lower-level key can be useful for consumer focused datasets, if your analytics are always against the entire data set as is often the case in metrics-focused use cases.

To scale out your Postgres database, you have more good choices than you think.

It’s important to note as we did in our sharding JSON in Postgres post that different distribution models can have different benefits and trade-offs. In some cases modeling on a lower-level entity id such as products or purchases can be the right choice– you gain more parallelism for analytics and trade off simplicity in querying a single store. Either choice of picking a multi-tenant data model or adopting a more distributed document model can be made to scale, but each comes with its own trade-offs. If you have the need today to scale out your multi-tenant app then give Citus Cloud a try or if you have any questions on which type of scale-out data model might work best for your application, please don’t hesitate to reach out to my team at Citus. We can help. (And did we mention that Citus is available as open source, as a database service in AWS, and on-prem?)

Originally posted on the Citus Data blog

Discussion

markdown guide
 

Does Citus have capabilities (or are there other utilities for) sharding at the schema level?

I made a trade-off of doing multi-tenancy at the schema level to make security easier. Tenants run under a role which only has access to its own schema. Versus having to make sure I add WHERE tenant_id = ... to everything. That and tenant data is easier to isolate for maintenance, backup/restore. I figured if we started outgrowing a single node, I could do some manual sharding at first. You know, maintaining a map of which customer schema is located on which node. I can use map-reduce to query across nodes. Then I can figure out an automated strategy from there. That is the plan anyway. Comments?

 

We don't have anything to help with that, primarily because schemas at a higher level of scale can introduce a number of problems. A high number of schemas can cause issues with backups, query time (as the planner has to prune off how to route the query), and then things like migrations which were once trivial can now take hours or longer even. There's actually a good article by the authors of a popular gem about some of the learnings from this approach - influitive.io/our-multi-tenancy-jo.... As a result we have created our own gem which is focused on helping with this use case without having to manage all the logic of ensuring which tenant you're querying (github.com/citusdata/activerecord-...).

 

Thanks for the article. I thought through most of these issues prior to choosing schemas-based. I have already designed in most of the mitigations mentioned in the article. I'm also using UUIDs to identify everything, so moving to column-based is on the table if scale gets large enough.

For me there is an additional wrinkle with column-based, however. Migrations may actually be worse. We store most of the data as deltas in one main table and other tables are built from those. Migrating is literally dropping table(s), recreating, and replaying deltas. We can do that in parallel per node. But as column-based, loading the deltas in order of occurrence will be essentially random-access across nodes. Seems like that will be far slower.

Edit Nvm, reviewing my notes, I had planned to move deltas to another type of DB at that point. It wouldn't be a concern for Postgres.