Ryan Booz for Tiger Data (Creators of TimescaleDB)

Posted on Dec 13, 2021 • Originally published at blog.timescale.com

How to create (lots!) of sample time-series data with PostgreSQL generate_series()

#postgres #database #analytics #timeseries

Intro to PostgreSQL generate_series()
PostgreSQL generate_series() with dates
How to generate time-series data
Performance considerations when generating sample data
Tips for quickly increasing sample dataset scale
Choosing a date range
Calculating total rows
Let´s review

As the makers of TimescaleDB, we often need to quickly create lots of sample time-series data to demonstrate a new database feature, run a benchmark, or talk about use cases internally. We also see users in our community Slack asking how they can test a feature or decide if their data is a good fit for TimescaleDB.

Although using real data from your current application would be great (and ideal), knowing how to quickly create a representative time-series dataset using varying cardinalities and different lengths of time, is a helpful - and advantageous - skill to have.

Fortunately PostgreSQL provides a built-in function to help us create this sample data using the SQL that we already know and love - no external tools required.

In this three-part blog series, we'll go through a few ways to use the generate_series() function to create large datasets in PostgreSQL, including:

What PostgreSQL generate_series() is and how to use it for basic data generation (similar to our tutorial to Simulate IoT Data)
How to create more realistic looking time-series data with custom PostgreSQL functions
Ways to create complex time-series data, using additional PostgreSQL math functions and JOINs.

By the end of this series, you'll be ready to test almost any PostgreSQL or TimescaleDB feature, create quick datasets for general testing, meetup presentations, demos, and more!

Intro to PostgreSQL generate_series()

generate_series() is a built-in PostgreSQL function that makes it easy to create ordered tables of numbers or dates. The PostgreSQL documentation calls it a Set Returning Function because it can return more than one row.

The function is simple and easy to get started with, taking at least two required arguments to specify the start and stop parameters for the generated data.

SELECT * FROM generate_series(1,5);

This will produce output that looks like this:

generate_series|
---------------|
              1|
              2|
              3|
              4|
              5|

From this first example, we can see that generate_series() returns sequential numbers between a start parameter and stop parameter. When used to generate numeric data, generate_series() will increment the values by 1. However, there is an optional third parameter that can be used to specify the increment length, known as the step parameter.

For example, if we wanted to generate rows that counted by two from 0 to 10, we could use this SQL instead:

SELECT * from generate_series(0,10,2);

 generate_series 
-----------------
               0
               2
               4
               6
               8
              10

PostgreSQL generate_series() with dates

Using generate_series() to produce a range of dates is equally straightforward. The only difference is that the third parameter, the step INTERVALused to increment the date, is required for date generation, as shown here:

SELECT * from generate_series(
    '2021-01-01',
    '2021-01-02', INTERVAL '1 hour'
  );

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:00:00+00
 2021-01-01 02:00:00+00
 2021-01-01 03:00:00+00
 2021-01-01 04:00:00+00
 2021-01-01 05:00:00+00
 2021-01-01 06:00:00+00
 2021-01-01 07:00:00+00
 2021-01-01 08:00:00+00
 2021-01-01 09:00:00+00
 2021-01-01 10:00:00+00
 2021-01-01 11:00:00+00
 2021-01-01 12:00:00+00
 2021-01-01 13:00:00+00
 2021-01-01 14:00:00+00
 2021-01-01 15:00:00+00
 2021-01-01 16:00:00+00
 2021-01-01 17:00:00+00
 2021-01-01 18:00:00+00
 2021-01-01 19:00:00+00
 2021-01-01 20:00:00+00
 2021-01-01 21:00:00+00
 2021-01-01 22:00:00+00
 2021-01-01 23:00:00+00
 2021-01-02 00:00:00+00
(25 rows)

Notice that the returned dates are inclusive of the start and stop values, just as we saw with the numeric example before. The reason we got 25 rows (representing 25 hours rather than 24 as you might expect) is that the stop value can be reached using the equal one-hour INTERVAL (the step parameter). As long as the INTERVAL can increment evenly up to the stop date, it will be included.

However, if the step interval resulted in the stop value being skipped over, it will not be included in your output. For example, if we modify the step INTERVAL above to '1 hour 25 minutes', the result only returns 17 rows, the last of which is before the stop value.

SELECT * from generate_series(
    '2021-01-01','2021-01-02', 
    INTERVAL '1 hour 25 minutes'
   );

    generate_series     
------------------------
 2021-01-01 00:00:00+00
 2021-01-01 01:25:00+00
 2021-01-01 02:50:00+00
 2021-01-01 04:15:00+00
 2021-01-01 05:40:00+00
 2021-01-01 07:05:00+00
 2021-01-01 08:30:00+00
 2021-01-01 09:55:00+00
 2021-01-01 11:20:00+00
 2021-01-01 12:45:00+00
 2021-01-01 14:10:00+00
 2021-01-01 15:35:00+00
 2021-01-01 17:00:00+00
 2021-01-01 18:25:00+00
 2021-01-01 19:50:00+00
 2021-01-01 21:15:00+00
 2021-01-01 22:40:00+00
(17 rows)

How to generate time-series data

Now that we understand how to use generate_series(), how do we create some time-series data to insert into TimescaleDB for testing and visualization?

To do this, we utilize a standard feature of relational databases and SQL. Recall that generate_series() is a Set Returning Function that returns a "table" of data (a set) just as if we had selected it from a table. Therefore, just like when we select data from a regular table with SQL, we can add more columns of data using other functions or static values.

You may have done this at some point with a string of text that needed to be repeated for each returned row.

SELECT 'Hello Timescale!' as myStr, * FROM generate_series(1,5);

     myStr     | generate_series 
------------------+-----------------
 Hello Timescale! |               1
 Hello Timescale! |               2
 Hello Timescale! |               3
 Hello Timescale! |               4
 Hello Timescale! |               5
(5 rows)

In this example, we simply added data to the rows being returned from generate_series(). For every row it returned, we added a column with some static text.

But, these added columns don't have to be static data! Using the built-in random() function (for example), we can generate data that starts to look a little more like data we'd see when monitoring computer CPU values (i.e., realistic data to use for our demos and tests!).

SELECT random()*100 as CPU, * FROM generate_series(1,5);

        cpu         | generate_series 
--------------------+-----------------
 48.905450626783775 |               1
  71.94031820213382 |               2
 25.210553719011486 |               3
  19.24163308357194 |               4
  8.434915599133674 |               5
(5 rows)

In this example, generate_series() produced 5 rows of data and for every row, PostgreSQL also executed the random() function to produce a value. With a little creativity, this can become a pretty efficient method for generating lots of data quickly.

Performance considerations when generating sample data

Functions

Before getting too deep into the many ways we can use various functions with generate_series() to create more realistic data, let's tackle one potential downside with functions before your mind starts thinking about a myriad of ways to produce interesting, sample data.

Functions are a powerful database tool, allowing complex code to be hidden behind a standard interface. In the example above, I don't have to know how random() produces a value. I just have to call it in my SQL statement (without arguments) and a random value is returned for every row.

Unfortunately, functions can slow down your query and use lots of resources if you're not careful. This is true any time you call functions in SQL, not just when creating sample data.

The reason for this inefficiency is that scalar functions are executed once for each column and row in which they are used. So, if you produce a set of 1,000 rows with generate_series() and then add on 5 additional columns with data generated by functions, PostgreSQL has to effectively process 5,001 function calls, once to generate the initial series of data which is returned as a set, and 5 times that for each row because of the additional 5 columns.

The main issue is that PostgreSQL has no idea how to determine how much "work" it will take to generate the result from each function in the query. Something like random() required very minimal effort, so calling it 5,000 or even 1 million times won't break the bank for most machines. However, if the function you're calling generates lots of temporary data internally before producing a result, be aware that it could perform more slowly and consume more resources on your database server.

The takeaway here is that nothing comes for free, especially when functions are called once for every column, for every row. Most of the time the data you generate will easily complete in a handful of seconds for tens of millions of rows. But, if you notice a query that generates data taking a long time, consider finding alternative ways to generate data for the columns that require more resources.

Transactions

There is a second caveat to be aware of when using a tool like generate_series(), at least as demonstrated throughout this series. All data is created and inserted as a single transaction. You can put in more time and effort to create functions and methods for breaking up the work, but understand if you create (SELECT) one large set of data with 100 million rows, it's likely to consume a lot of memory and CPU on the server while the process occurs.

This doesn't mean it's broken or any less valid of a method for generating data. However, as we demonstrate additional ways to create more realistic data in parts 2 and 3 of this series, recognize that the more data you create, the more resources you'll need from the server without managing batches and transactions with a more advanced setup.

Alright, let's get back to the fun!

Tips for quickly increasing sample dataset scale

So far we've looked at how generate_series() works and how you can use functions to add additional dynamic content to the output. But how do we quickly get to the scale of tens of millions of rows (or more)?

This is where the concept of a Cartesian product comes into play with databases. A Cartesian product (otherwise known as a CROSS JOIN) takes two or more result sets (rows from multiple tables or functions) and produces a new result set that contains rows equal to the count of the first set multiplied by the count of the second set.

In doing so, the database outputs every row from the first table with the value of the first row from the second table. It does this over and over until all rows in both tables have been iterated. (and yes, if you join more than two tables this way, the process just keeps multiplying, tables processed left to right)

Let's look at a small example using two generate_series() in the same select statement.

SELECT * from generate_series(1,10) a, generate_series(1,2) b;

a |b|
--+-+
 1|1|
 2|1|
 3|1|
 4|1|
 5|1|
 6|1|
 7|1|
 8|1|
 9|1|
10|1|
 1|2|
 2|2|
 3|2|
 4|2|
 5|2|
 6|2|
 7|2|
 8|2|
 9|2|
10|2|

As you can see, PostgreSQL generated 10 rows for the first series and 2 rows for the second series. After processing and iterating over each table consecutively, the query produced a total of 20 rows (10 x 2). For generating lots of data quickly, this is about as easy as it gets!

The same process applies when using generate_series() to return a set of dates. If you (effectively) CROSS JOIN multiple sets (numbers or dates), the total number of rows in the final set will be a product of all sets. Again, for generating sample time-series data, you'd be hard-pressed to find an easier method!

In this example, we generate 12 timestamps an hour apart, a random value representing CPU usage, and then a second series of four values that represent IDs for fake devices. This should produce 48 rows (eg. 12 timestamps x 4 device IDs = 48 rows).

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
    '2021-01-01 00:00:00',
    '2021-01-01 11:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;


time               |device_id|cpu_usage          |
-------------------+---------+-------------------+
2021-01-01 00:00:00|        1|0.35415126479989567|
2021-01-01 01:00:00|        1| 14.013393572770028|
2021-01-01 02:00:00|        1|   88.5015939122006|
2021-01-01 03:00:00|        1|  97.49037810105996|
2021-01-01 04:00:00|        1|  50.22781125586846|
2021-01-01 05:00:00|        1|  77.93431470586931|
2021-01-01 06:00:00|        1|  45.73481750582076|
2021-01-01 07:00:00|        1|   70.7999843735724|
2021-01-01 08:00:00|        1|   4.72949831884506|
2021-01-01 09:00:00|        1|  85.29122113229981|
2021-01-01 10:00:00|        1| 14.539664281598874|
2021-01-01 11:00:00|        1|  45.95244258556228|
2021-01-01 00:00:00|        2|  46.41196423062297|
2021-01-01 01:00:00|        2|  74.39903569177027|
2021-01-01 02:00:00|        2|  85.44087332221935|
2021-01-01 03:00:00|        2|  4.329394730750735|
2021-01-01 04:00:00|        2| 54.645873866589056|
2021-01-01 05:00:00|        2|  6.544334492894777|
2021-01-01 06:00:00|        2|  39.05071228953645|
2021-01-01 07:00:00|        2|  71.07264365438404|
2021-01-01 08:00:00|        2|   72.4732704336219|
2021-01-01 09:00:00|        2| 34.533280927542975|
2021-01-01 10:00:00|        2| 26.764760864598003|
2021-01-01 11:00:00|        2|  62.32048879645227|
2021-01-01 00:00:00|        3|  63.01888063314749|
2021-01-01 01:00:00|        3|  21.70606884856987|
2021-01-01 02:00:00|        3|  32.47610779097485|
2021-01-01 03:00:00|        3| 47.565982341726354|
2021-01-01 04:00:00|        3|  64.34867263419619|
2021-01-01 05:00:00|        3|  57.74424991855476|
2021-01-01 06:00:00|        3| 55.593286571750156|
2021-01-01 07:00:00|        3|  36.92650110894995|
2021-01-01 08:00:00|        3| 53.166926049881624|
2021-01-01 09:00:00|        3| 10.009505806123897|
2021-01-01 10:00:00|        3| 58.067700285561585|
2021-01-01 11:00:00|        3|  81.58883725078034|
2021-01-01 00:00:00|        4|   78.1768041898232|
2021-01-01 01:00:00|        4|  84.51505102850199|
2021-01-01 02:00:00|        4| 24.029611792753514|
2021-01-01 03:00:00|        4|  17.08996115345549|
2021-01-01 04:00:00|        4| 29.642690955760997|
2021-01-01 05:00:00|        4|  90.83844806413275|
2021-01-01 06:00:00|        4| 6.5019080489854275|
2021-01-01 07:00:00|        4|    32.336484070672|
2021-01-01 08:00:00|        4|    55.595524107963|
2021-01-01 09:00:00|        4|   97.5442141375293|
2021-01-01 10:00:00|        4|   37.0741925805568|
2021-01-01 11:00:00|        4| 19.093927249791776|

Choosing a date range

Now it's time to put all of the features and concepts together. We've seen how to use generate_series() to create a sample table of data (both numbers and dates), add static and dynamic content to each row, and finally how to join multiple sets of data together to create a deterministic number of rows to create test data.

The final piece of the puzzle is to figure out how to create date ranges that are more dynamic using date math. Once again, PostgreSQL makes this straightforward because date math is handled automatically.

When generating sample data you usually have an idea of the duration of time you want to generate - 1 month, 6 months, or a year - for instance. But calculating the exact start and end timestamps for your use case can get pretty tedious.

Instead, it's often easier to use now() or a static ending timestamp (ie. '2021-06-01 00:00:00'), and then using date math to get the starting timestamp based on your chosen interval. This makes it very easy to generate data for different durations simply by changing the interval. And to be clear, you can go the other direction (picking a start timestamp and adding time), but that can often lead to data in the future.

Let's look at three examples to demonstrate ways of creating dynamic date ranges.

Create 6 months of data with one-hour intervals, ending now()

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
    now() - INTERVAL '6 months',
    now(),
    INTERVAL '1 hour'
   ) as time, 
generate_series(1,4) device_id;

Notice that we use the PostgreSQL function now() to automatically choose the ending timestamp, and then use date math (- INTERVAL '6 months') to let PostgreSQL find the starting timestamp for us. With almost no effort we can easily generate weeks, months, or years of time-series data by changing the INTERVAL we subtract from now().

Create 1 year of data with one-hour intervals, ending on a timestamp

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
    '2021-08-01 00:00:00' - INTERVAL '6 months',
    '2021-08-01 00:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

This example is the same as the first, but here we specify the timestamp that we want to end on. PostgreSQL can still do the date math for us (subtracting 6 months in this example), but we get control over the exact ending timestamp to use. This is particularly useful when you want the actual time portion of the timestamp to begin and end on even, rounded hours, minutes, or seconds. When we use now(), the timestamp can produce (what feels like) random timestamp causing all of the time-series data to have timestamps that are increments of now().

Create 1 year of data with one-hour intervals, beginning on a timestamp

For this last example, we're going to specify the start timestamp instead. This can be useful when you need to test a specific fiscal period or maybe some logic that deals with the turning of each year. In this case, maybe you just need a month of data but it needs to cross over from one year to the next. In that case, trying to figure out the math of how far back to start and how far to go forward might be more complicated than you want to worry about.

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
    '2020-12-15 00:00:00',
    '2020-12-15 00:00:00' + INTERVAL '2 months',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

Each of these examples uses date math to help you find the appropriate start and end timestamps so that you can easily adjust to create more or less data while still fitting the range profile that you need.

Speaking of calculating how many rows you want to generate… 😉

Calculating total rows

We now have all the tools we need to create datasets of almost any size. For time-series data specifically, it's a straightforward calculation to figure out how many rows your query will generate.

Total Rows = Readings per hour * Total hours * Number of "things" being tracked

Using this formula, we can quickly determine how many rows will be created by changing any combination of the total range (start/end timestamps), how many readings per hour for each item, or the total number of items. Let's look at a couple of examples to get an idea of how quickly you could create many rows of time-series data for your testing scenario.

Range of readings	Length of interval	Number of "devices"	Total rows
1 year	1hour	4	35,040
1 year	10 minutes	100	5,256,000
6 months	5 minutes	1,000	52,560,000

As you can see, the numbers start to add up very quickly.

Let's review

In this first post, we've demonstrated that it's pretty easy to generate lots of data using the PostgreSQL generate_series() function. We also learned that when you select multiple sets (using generate_series() or selecting from tables and functions), PostgreSQL will produce what's known as a Cartesian product, the selection of all rows from all tables - the product of all rows. With this knowledge, we can quickly create large and diverse datasets to test various features of PostgreSQL and TimescaleDB.

For more on how to create, find, and use demo data in tutorials, blog posts, and more, please watch my Creating sample data video:

This was part 1 of Generating sample time-series data three-part series.

In part 2 of this series, we demonstrate how to use custom user-defined functions to create more realistic-looking data to use for testing, including generated text, numbers constrained by a range, and even fake JSON data.

Part 2: Generating more realistic sample time-series data with PostgreSQL generate_series()

Part 3: Coming soon

If you have questions about using generate_series() or have any questions about TimescaleDB, please join our community Slack channel where you'll find an active community and a handful of the Timescale team most days.

If you want to try creating larger sets of sample time-series data using generate_series() and see how the exciting features of TimescaleDB work, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)

See what's possible when you join relational data in PostgreSQL with time-series superpowers!

DEV Community