Original Post: https://www.dylanpaulus.com/posts/easily-generate-mock-data-with-postgresql
When writing articles about PostgreSQL there are times I need to generate large swaths of data to test and demonstrate a topic (for example, these Timescale articles here and here).
It can be a pain to manually insert data or create a CSV and import it into PostgreSQL--which I always have to StackOverflow.
Lately, I ran into a much simpler solution to generating data: the generate_series function.
It's perfectly suited for integers and timestamp data types, streamlining various tasks such as populating test databases or setting up date ranges for reports.
Generating Mock Data
generate_series can be thought of as a for-loop or generator to produce a range of numbers or dates.
To generate a range of numbers, we call generate_series with a start and end value:
SELECT * FROM generate_series(1, 10);
We can specify a step value by providing generate_series as an optional third argument.
SELECT * FROM generate_series(1, 10, 2.5);
Inserting Mock Data
generate_series' power comes to fruition when we need to insert mock data into a table.
We can easily insert a range of numbers into a table by using generate_series in a SELECT statement and then inserting the results into a table.
For the rest of this article, we'll use users table that looks like this for our examples:
CREATE TABLE users (
id serial PRIMARY KEY,
name text NOT NULL,
created_on timestamptz
);
Then, if we wanted to insert five users into the table we could run:
INSERT INTO users (name)
SELECT
'Dylan'
FROM generate_series(1, 5);
This pattern can be used to insert any number of rows into a table.
If you're feeling brave, try inserting 1,000,000 users into the table by changing generate_series(1, 5) to generate_series(1, 1000000)!
Up to this point we are creating a lot of duplicate data.
What if we wanted to create a bunch of users with all distinct names?
We can access the current index of the loop (using as) to generate a unique name for each user.
That would look like this:
INSERT INTO users (name)
SELECT
'Dylan number ' || i
FROM generate_series(1, 5) as i;
Creating Time-Series Data
generate_series is incredibly effective at generating time series data.
This is where I get the most use out of it because it can become extremely tedious to create a lot of data around time ranges.
Similar to the integer example, generate a range of timestamps by providing a start and end value followed by an optional step or interval.
For example, to generate a user for every hour in a week, we could run the following SQL:
INSERT INTO users (name, created_on)
SELECT
'Dylan',
time_hour
FROM generate_series(
TIMESTAMPTZ '2023-11-01',
TIMESTAMPTZ '2023-11-07',
INTERVAL '1 hour'
) as time_hour;
This query creates 146 rows with a user named Dylan every hour between November 1st and November 7th, 2023.
You may be asking yourself, can we use multiple generate_series to populate multiple fields? Yes!
But, we need to use joins to combine the data from the two series.
To quickly create a grid of users and created_on dates we can take the cartesian product of two series, otherwise known as a cross join, by adding another generate_series to the FROM clause.
INSERT INTO users (name, created_on)
SELECT
'Dylan number' || i,
time_hour
FROM
generate_series(1, 5) as i,
generate_series(
TIMESTAMPTZ '2023-11-01',
TIMESTAMPTZ '2023-11-07',
INTERVAL '2 days'
) as time_hour;
In the INSERT statement above, we'll create four time_hour timestamps for every two days in the week for every iteration of i (there are five of them).
Resulting in 4 * 5 = 20 twenty rows created.
To better grasp what's happening here, it'll be easier to look at the SELECT output of the users table.
Using multiple generate_series functions combined with a cartesian product can be a powerful combination for creating unique, time-series data.
Conclusion
PostgreSQL's generate_series is a game-changer for developers like me, who frequently dive into the deep end of data generation for testing and optimizing queries. It's a powerhouse that can effortlessly create anything from a handful to millions of rows, transforming a potentially arduous task into a few simple lines of SQL.
So, when you need to mock up data, let generate_series be your go-to solution.
Happy data generating!






Top comments (0)