Adrien Olivon

Posted on Dec 23, 2022

Data Segregation with PostgreSQL

#webdev

How to split data by concern using SQL in a Monolithic application ?

TL;DR: There are many ways to do this, with pros & cons, and I share my thoughts on: Column creation, Database spliting, Row Level Policy, Partitions, Inheritance and Schemas.

Let's imagine our application allows users to search for doctors (like Doctolib). And doctors have one speciality, like General Practitioner, Dentist, etc.

How can we separate the data by country, so that French users search for French doctors, Germans for Germans, etc.?

The base schema is:

CREATE TABLE specialities (
    id    UUID NOT NULL,
   name  TEXT PRIMARY KEY
);

CREATE TABLE doctors (
   name           TEXT PRIMARY KEY,
   speciality_id  UUID NOT NULL,
   CONSTRAINT FOREIGN KEY spe_fk REFERENCES specialities.id
)

⚠️This is not a showcase on how my compagny deal with this kind of issue it's just an opinion in monolitic architecture context.

Why?

I am looking for a solution to separate demo and non-demo data in an *efficient way. One idea is: that we could use the stability of the production environment to show a demo of our product so we could keep as little environments as possible (maybe even only the prod).

*efficient : simple, non invasive for the product rules (⇒ no if demo then do in the code), that does not impact current performances, secure (⇒ no data-leak in demo or demo doctors searchable in prod)

But in general, it can be a way to improve scaling or to comply with business rules:

Horizontal Scaling

Let's say we have 2 kinds of data: Cold and Hot

Cold data is static and is more about business configuration. It only grows with business rules and it often does not need to scale a lot. (e.g. specialities)
Hot data is more user/realm/environment related data; this data grows with application use and the scaling can be an issue. (e.g. doctors)
In general Cold data is required by Hot data and not vice versa.

One solution to have a better scaling is to identify Hot data and group it by trait. e.g. doctors can be grouped by country so doctolib.com become ⇒ doctolib.fr + doctolib.de + doctolib.it because French user search only for French doctors.
When searching for doctors we would deal with 3 times less doctors.

Business rules

It's a common product requirement to be able to separate data:

for internal use: you may need an isolated dataset for testing/demo purpose, like UAT environment
for external use: your users should have access to subpart of the data. e.g. doctors can be grouped by country ⇒ french users must see only French doctors

Solutions

Create a column

For each table that we want to separate we add a column (here: country).
Then you just have to add a WHERE condition on each query: WHERE country = {current_country};

CREATE TABLE doctors (
   name           TEXT PRIMARY KEY,
   speciality_id  UUID NOT NULL,
+++ country        TEXT NOT NULL
);

SELECT * FROM doctors WHERE country = {current_country};

Pretty simple and obvious but it's great to remember simple ways too. However perf might go down and business code will always need to add WHERE country conditions on each query (maybe using default scope).

Split databases

We create a database for each group of data, here 3 databases: doctolib_fr, doctolib_de, doctolib_it.

Then depending on your context you can connect to the database and without any additional code each querie will return only data of one group.

CREATE DATABASE doctolib_fr;
CREATE DATABASE doctolib_de;
CREATE DATABASE doctolib_it;

CONNECT doctolib_{locale};

SELECT * FROM doctors;

This solution is heavly separated, every thing is duplicated and it could be good if you are looking for isolation.

It's has also huge downsides: Deployment, monitoring (and tooling in general), version synchronisation, cold data synchronisation.
By breaking the single source of truth principle the cold data is hard to maintain uniformly and can eventually develop a few differences.

Row Level Policy

The idea is to restrict the access on each group of data to one database user using Row Level Policy. In our example user doctolib_fr have access to French doctors only.

Our policies will do the filtering automatically 🎉, so each querie in the application does not need to know the context: no WHERE country = 'fr' everywhere.

CREATE USER doctolib_fr;
CREATE USER doctolib_de;
CREATE USER doctolib_it;

CREATE TABLE doctors (
   name           TEXT PRIMARY KEY,
   speciality_id  UUID NOT NULL,
+++ country        TEXT NOT NULL -- DEFAULT get_coutry_of_current_user
);

ALTER TABLE doctors ENABLE ROW LEVEL SECURITY;

GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;
CREATE POLICY doctors_fr_policy ON doctors TO doctolib_fr USING (country = 'fr');

GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;
CREATE POLICY doctors_de_policy ON doctors TO doctolib_de USING (country = 'de');

GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;
CREATE POLICY doctors_it_policy ON doctors TO doctolib_it USING (country = 'it');

CONNECT as doctolib_{locale};

SELECT * FROM doctors;  -- implicitly SELECT * FROM doctors WHERE country = '{locale}';

EXPLAIN ANALYZE                              QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on doctors  (cost=0.00..19.38 rows=4 width=80) (actual time=0.002..0.002 rows=0 loops=1)
   Filter: (country = 'fr'::text)
 Planning Time: 1.994 ms
 Execution Time: 0.024 ms

In terms of perf it has a lot in common with the first solution Create a column. We don't put explicitly the WHERE conditions in queries but behind the scene it should act the same.

What are Row Level Security ?

Doc from Postgres says that by default you can restrict access to table for a role (user or group) using GRANT e.g. GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;

Row level security can be enabled using ALTER TABLE doctors ENABLE ROW LEVEL SECURITY; then you can restrict access to any row.
Here CREATE POLICY doctors_it_policy ON doctors TO doctolib_it USING (country = 'it'); we give access to role doctolib_it to all rows with country with values equals to 'it'.

Row Level Policies & Partitions

We reuse the previous solution with our users restricted to a group of data and we structure a bit these groups.

We will transform our 'logical' groups into real Postgres partitions. So we have 3 partitions: doctors_fr, doctors_de and doctors_it.

CREATE USER doctolib_fr;
CREATE USER doctolib_de;
CREATE USER doctolib_it;

CREATE TABLE doctors (
   name           TEXT PRIMARY KEY,
   speciality_id  UUID NOT NULL,
+++ country        TEXT NOT NULL -- DEFAULT get_coutry_of_current_user
+++ ) PARTITION BY LIST (country);

ALTER TABLE doctors ENABLE ROW LEVEL SECURITY;

GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;
CREATE POLICY doctors_fr_policy ON doctors TO doctolib_fr USING (country = 'fr');

GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;
CREATE POLICY doctors_de_policy ON doctors TO doctolib_de USING (country = 'de');

GRANT SELECT, INSERT, UPDATE, DELETE ON doctors TO doctolib_fr;
CREATE POLICY doctors_it_policy ON doctors TO doctolib_it USING (country = 'it');

+++ CREATE TABLE doctors_fr PARTITION OF doctors FOR VALUES IN ('fr');
+++ CREATE TABLE doctors_de PARTITION OF doctors FOR VALUES IN ('de');
+++ CREATE TABLE doctors_it PARTITION OF doctors FOR VALUES IN ('it');

CONNECT as doctolib_{locale};

SELECT * FROM doctors; -- implicitly SELECT * FROM doctors_{locale};

EXPLAIN ANALYZE                              QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on doctors_fr doctors  (cost=9.78..18.24 rows=4 width=80) (actual time=4.078..4.078 rows=0 loops=1)
   Recheck Cond: (country = 'fr'::text)
   ->  Bitmap Index Scan on doctors_fr_pkey  (cost=0.00..9.78 rows=4 width=0) (actual time=4.076..4.076 rows=0 loops=1)
         Index Cond: (country = 'fr'::text)
 Planning Time: 0.125 ms
 Execution Time: 4.105 ms

Because the partition doctors_fr (which acts like a table) contains only French doctors, when we ask to Postgres to search for French doctors, it can understand that it has only to scan the partition doctors_fr (it's Partition Pruning see doc from Postgres) and that's a good thing for perfs!

What is Table Partitioning ?

Doc from Postgres says that Partitioning refers to splitting large table into smaller physical pieces, with benefits on performance and table managment (DROP TABLE ALTER TABLE DETACH PARTITION)

There are sevral partitioning types: Range, List, Hash ⇒ here we use partition by list : after table definition PARTITION BY LIST (country) and CREATE TABLE doctors_fr PARTITION OF doctors FOR VALUES IN ('fr');

Then we canSELECT FROM doctors_fr; or SELECT FROM doctors_de; to select data scoped by country. AndSELECT FROM doctors; will returns doctors from all country.

Unqualified name, schema and inheritance

For this last solution we will create a main table and for each country 1 table inheriting the main table.
We create our tables with the same names but each one in it's own schema. Schemas will be assigned to user so they can easly find the good table for their jobs.

The main table: SUPER_SCHEMA.doctors and children: FR_SCHEMA.doctors, DE_SCHEMA.doctors, IT_SCHEMA.doctors.
Then we will restrict the access to schema to only one user, so user doctolib_fr will be given the access to FR_SCHEMA.

The next step is to configure search_path (more info below) for each user so they can use the table without specifying the table.
On user doctolib_fr SET search_path TO FR_SCHEMA,public; so he has FR_SCHEMA by default accesible even without prefixing by the schema name.
So he can query SELECT * FROM doctors with same result as SELECT * FROM FR_SCHEMA.doctors.

CREATE USER doctolib_fr;
CREATE USER doctolib_de;
CREATE USER doctolib_it;

CREATE SCHEMA SUPER_SCHEMA;
CREATE SCHEMA FR_SCHEMA;
CREATE SCHEMA DE_SCHEMA;
CREATE SCHEMA IT_SCHEMA;

GRANT USAGE ON SCHEMA FR_SCHEMA TO doctolib_fr;
GRANT USAGE ON SCHEMA DE_SCHEMA TO doctolib_de;
GRANT USAGE ON SCHEMA IT_SCHEMA TO doctolib_it;

CREATE TABLE SUPER_SCHEMA.doctors (
   name           TEXT PRIMARY KEY,
   speciality_id  UUID NOT NULL
);

CREATE TABLE FR_SCHEMA.doctors () INHERITS (SUPER_SCHEMA.doctors);
CREATE TABLE DE_SCHEMA.doctors () INHERITS (SUPER_SCHEMA.doctors);
CREATE TABLE IT_SCHEMA.doctors () INHERITS (SUPER_SCHEMA.doctors);

GRANT ALL ON TABLE FR_SCHEMA.doctors TO doctolib_fr;
GRANT ALL ON TABLE FR_SCHEMA.doctors TO doctolib_de;
GRANT ALL ON TABLE FR_SCHEMA.doctors TO doctolib_it;

CONNECT as doctolib_{locale};
SET search_path TO {locale}_SCHEMA,public;

SELECT * FROM doctors; -- implicitly SELECT * FROM {locale}_SCHEMA.doctors;

EXPLAIN ANALYZE                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on doctors  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.002..0.002 rows=0 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 0.020 ms

In the end we have almost the same behavior as with the Row Level Policies and partitionning but we do not need a country column 🎉.

What is Table Inheritance ?

Doc from Postgres is pretty explicit, you can inherit table schema (columns, constraint, ect) from a table while creating a new one.

CREATE TABLE FR_SCHEMA.doctors() INHERITS (SUPER_SCHEMA.doctors); creates a new table model in FR_SCHEMA with same column & cie that you can query directly to get scoped FR doctors ⇒ SELECT * FROM FR_SCHEMA.doctors;
and if you want to get doctors from all countries SELECT * FROM SUPER_SCHEMA.doctors;

Usage of schemas is not necessary, we could have done : CREATE TABLE doctors_fr () INHERITS (doctors);CREATE TABLE doctors_fr () INHERITS (doctors);

Inheritance also allows to add column or constraint e.g. if FR doctors had a specific tag we could add it: CREATE TABLE doctors_fr (specific_tag TEXT) INHERITS (SUPER_SCHEMA.doctors);

What are Unqualified names and search_path ?

Docs from Postgres says that tables are often referred to by unqualified names because of the syntax database.schema.table when you target a table using only it's name table or here doctors you use unqualified name.

search_path is a client connection configuration variable (Doc from Postgres) you can find it using SHOW search_path; and modify it with SET search_path TO FR_SCHEMA,public; It contains ordered schema that Postgres will use to find unqualified names.

So by adding a schema to this list you make table usable without prefexing the name by the schema. This example is using search_path like an import in a classic programming language.

Here we use unqualified names and search_path to move country acknowledgement from the query to the connection configuration. 🎉

Conclusion

Of course merging solutions is an option, like splitting in 2 databases, Cold and Hot data in separated databases, + using Row Level Security on the Hot database.
Sharding can also be a solution but I'm clueless on this topic so maybe next time. 👋