Strategies for Using PostgreSQL as a Database for Multi-Tenant Services

#cloudnative #opensource #multitenancy #postgres

It is the second time in the past couple of years that I am privileged to be a part of a team that is building a resilient multi-tenant SaaS infrastructure. The definition of multi-tenancy is an architecture where a single software instance (that may consist of multiple services / micro services) serves multiple tenants / entities, which can represent either consumers or corporate users of the service.

While building such an infrastructure in a way that it would both have a fast time-to-market and scale to support the business growth introduces a lot of challenges (and not only ones related to software engineering), today I'd like to focus on various strategies that we considered (and, ended up implementing) particularly around the SQL Database part of our service infrastructure.

I will refrain from advocating for use-cases where an SQL Database provides advantages over other database types and assume that the discussion on what database you might need in your service should take place in any case.

Requirements for multi-tenant service database

Most of the time, when dealing with building a multi-tenant infrastructure, my thoughts are revolving around building a system that would scale for thousands of tenants (hopefully), while providing a sufficient level of separation for operational, security and privacy reasons. One of the most important considerations, IMHO, is being able to contain the blast radius in case of either human error or breach. This means that, from the security perspective, I'd always try to restrict database access credentials to specific tenant(s) relevant for the session.
The obvious downside of the above approach is its potential impact on the ability to manage database connections pool and to multiplex sessions for multiple tenants on the same database connection.

Alternative A - Database per tenant

In this alternative, we maintain a separate database for each tenant. It allows defining access credentials that would provide a very hard boundary between data belonging to various tenants.
Whether dealing with separate database clusters (very costly and can complicate the infrastructure deployment) or logical databases within a single cluster (much more feasible to manage), this approach is well suited for environments where a strong separation of data is required.

One potential perk that can be gained if working this way, is the ability to maintain different schema versions for different tenants, implementing a gradual upgrade.

Downsides of this approach are mostly around its operational costs. Imagine, for example, that every schema update needs to be run on every database, as opposed to doing it only once. Also, consider the impact this approach would have on the database connections pool in an environment where your service (and the micro services it consists of) would need to serve a large amount of sessions for multiple tenants in parallel.

Additional challenge is the complexity to maintain cross-tenant data. It would require creating a separate database and then complicating the data access layer that would need to cross-reference data from per-tenant databases with items in a cross-tenant one.

Alternative B - PostgreSQL Schema per tenant

Specific to PostgreSQL, every named database contains another layer of containment for database objects, called schema. According to the documentation, the original goals for defining schemas are around organizing the data and allowing parallel operations without collisions.

PostgreSQL allows schemas to be owned by different database roles, therefore establishing a foundation for per-tenant data access control.

Additionally, it is possible to create cross-references between different schemas, as well as maintaining separate versions for separate tenants.

The operational complexity downside still remains. Introducing updates to database structure will require making changes to tables in all the schemas.

Depending on the nature of your multi-tenant service, there have been some reports about the scalability of this approach, indicating that have a PostgreSQL database cluster with a significantly large amount of database schemas can cause performance issues.

Alternative C1 - Shared tables with column for tenant identification

This approach is common in environments that do not require strong separation of tenant data. There is a single database and single schema containing the data of all tenants, however, every table that contains per-tenant data has a special column that indicates which tenant the row belongs to.

Major downside of this approach is that it requires a very close attention to multi-tenancy in the data access layer implementation of services / micro-services. An inexperienced developer can easily make a mistake of not adding the 'WHERE tenant_id=<...>' to the query and impacting data of multiple tenants. Same is true (and even more concerning) for considering someone hostile taking over your micro-services and obtaining database access credentials. Blast radius in such case is very wide and the impact of such a breach may be catastrophic.

On the positive side, it is very easy to establish such a database infrastructure and to scale it up to a large amount of service tenants.

While this is a popular choice, its downsides (mainly - the face that there is absolutely no server-side enforcement of access separation) have driven some software architects to considering something more advanced.

Alternative C2 - Shared tables with column for tenant identification and row level security policies

Row level security (RLS) in SQL databases is a mechanism that allows managing access (with high granularity) to rows in tables that match a certain expression.

In order to benefit from this capability, row level security must be enabled for the relevant tables ('ALTER TABLE ... ENABLE ROW LEVEL SECURITY'), as well as policy objects specifying the relevant permissions should be defined ('CREATE POLICY ON ').

Particularly in the case where a dedicated field (such as 'tenant_id' from the example above) contains the separating criteria, the using_expression and check_expression parts of the policy should contain the boolean SQL expression.

While permissions can be assigned to different roles, making the access control enforcement very tight, the overall requirements can be relaxed a bit, involving a use of a session variable with a unique value for a context separation. When assigning policies to roles, it is worth noting that the role defined as an owner of the table will not be affected by the policy and it is worth separating ownership from access control.

Summary

All of the alternatives presented above can be used (and, actually, are being used today) in multi-tenant software services. Choosing the relevant option depends on various requirements. Below is a short list of ones that we usually considered in our discussions:

What will be the amount of tenants in your system?
How frequently do we expect to update a database scheme?
How frequently do we expect to delete tenants (and all their data)?
Will our service need to be compliant with SOC 2 Privacy Chapter?
Will our service need to be compliant with ISO 27001?
Will we need to implement privacy operations (GDPR / California Privacy Law - driven)?

In the next posts of the series we will provide more technical details on the implementation.

Please feel free to comment on your thoughts and considerations about the subject.

Kudos to @kostyay and @eldadru for the research that lead to this article.