Table of Contents
In the lifecycle of any application, developers are often asked to create proofofconcept features, test newly released functionality, and visualize data analysis. In many cases, the available test data is stale, not representative of normal usage, or simply doesn't exist for the feature being implemented. In situations like this, knowing how to quickly create sample timeseries data with native PostgreSQL and SQL functions is a valuable skill to draw upon!
In this threepart series on generating sample timeseries data, we demonstrate how to use the builtin PostgreSQL function, generate_series()
, to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.
In part 1 and part 2 of the series, we reviewed how generate_series() works, how to join multiple series using a CROSS JOIN to create large datasets quickly, and finally how to create and use custom PostgreSQL functions as part of the query to generate more realistic values for your dataset. If you haven't used generate_series() much before, we recommend first reading the other two posts. The first one is an intro to the generate_series() function, and the second one shows how to generate more realistic data.
With those skills in hand, you can quickly and easily generate tens of millions of rows of realisticlooking data. Even still, there's one more problem that we hinted at in part 2  all of our data, regardless of how it's formatted or constrained, is still based on the random()
function. This means that over thousands or millions of samples, every device we create data for will likely have the same MAX() and MIN() value, and the distribution of random values over millions of rows for each device generally means that all devices will have similar average values.
This third post demonstrates a few methods for influencing how to create data that mimics a desired shape or trend. Do you need to simulate timeseries values that cycle over time? What about demonstrating a counter value that resets every so often to test the counter_agg hyperfunction? Are you trying to create new dashboards that display sales data over time, influenced for different months of the year when sales would ebb and flow?
Below we'll cover all of these examples to provide you with the final building blocks to create awesome sample data for all of your testing and exploration needs. Remember, however, that these examples are just the beginning. Keep playing. Tweak the formulas or add different relational data to influence the values that get generated so that it meets your use case.
Data inception
Timeseries data often has patterns. Weather temperatures and rainfall measurements change in a (mostly) predictable way throughout the year. Vibration measurements from an IoT device connected to an air conditioning system usually increase in the summer and decrease in the winter. Manufacturing data that measures the total units produced per hour (and the percentage of defective units) usually follow a pattern based on shift schedules and seasonal demand.
If you want to demonstrate this kind of data without having access to the production dataset, how would you go about it using generate_series()
? SQL functions ended up being pretty handy when we discussed different methods for creating realisticlooking data in part 2. Do you think they might help here? 😉
Two options to easily return the row number
Remember, for our purposes we're specifically talking about creating sample timeseries data. Every row increases along the time axis, and if we use the multiplication formula from part 1, we can determine how many rows our sample data query will generate. Using builtin SQL functions, we can quickly start manipulating data values that change with the cycle of time. 💥
There are many reasons why it can be helpful to know the ordinal position of each row number in a query result. That's why standard SQL dialects have some variation of the row_number() over()
window function. This simple, yet powerful, window function allows us to return the row number of a result set, and can utilize the ORDER BY and PARTITION keywords to further determine the row values.
SELECT ts, row_number() over(order by time) AS rownum
FROM generate_series('20220101','20220105',INTERVAL '1 day') ts;
ts rownum
++
20220101 00:00:00.000 0500 1
20220102 00:00:00.000 0500 2
20220103 00:00:00.000 0500 3
20220104 00:00:00.000 0500 4
20220105 00:00:00.000 0500 5
In a normal query, this can be useful for tasks like paging data in a web API when there is a need to consistently return values based on a common partition.
There's one problem though. row_number() over()
requires PostgreSQL (and any other SQL database) to process the query results twice to add the values correctly. Therefore, it's very useful, but also very expensive as datasets grow.
Fortunately, PostgreSQL helps us once again for our specific use case of generating sample timeseries data.
Through this series of blog posts on generating sample timeseries data, we've discussed that generate_series()
is a Set Returning Function (SRF). Like the results from a table, set data can be JOINed and queried. Additionally, PostgreSQL provides the WITH ORDINALITY
clause that can be applied to any SRF to generate an additional, incrementing BIGINT column. The best part? It doesn't require a second pass through the data in order to generate this value!
SELECT ts AS time, rownum
FROM generate_series('20220101','20220105',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);
time rownum
++
20220101 00:00:00.000 0500 1
20220102 00:00:00.000 0500 2
20220103 00:00:00.000 0500 3
20220104 00:00:00.000 0500 4
20220105 00:00:00.000 0500 5
Because it serves our purpose and is more efficient, the remainder of this post will use WITH ORDINALITY
. However, remember that you can accomplish the same results using row_number() over()
if that's more comfortable for you.
Harnessing the row value
With increasing timestamps and an increasing integer on every row, we can begin to use other functions to create interesting data.
Remember from the previous blog posts that calling a function as part of your query executes the function for each row and returns the value. Just like a regular column, however, we don't have to actually emit that column in the final query results. Instead, the function value for that row can be used in calculating values in other columns.
As an example, let's modify the previous query. Instead of displaying the row number, let's multiply the value by 2. That is, the function value is treated as an input to a multiplication formula.
SELECT ts AS time, 2 * rownum AS rownum_by_two
FROM generate_series('20220101','20220105',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);
time rownum_by_two
++
20220101 00:00:00.000 0500 2
20220102 00:00:00.000 0500 4
20220103 00:00:00.000 0500 6
20220104 00:00:00.000 0500 8
20220105 00:00:00.000 0500 10
Easy enough, right? What else can we do with the row number value?
Counters with reset
Many timeseries datasets record values that reset over time, often referred to as counters. The odometer on a car is an example. If you drive far enough, it will "roll over" to zero again and start counting upward. The same is true for many utilities, like water and electric meters, that track consumption. Eventually, the total digits will increment to the point where the counter resets and starts from zero again.
To simulate this with timeseries data, we can use the incrementing row number and after some period of time, reset the count and start over using the modulus operator (%).
– This example resets the counter every 10 rows
WITH counter_rows AS (
SELECT ts,
CASE WHEN rownum % 10 = 0 THEN 10
ELSE rownum % 10 END AS row_counter
FROM generate_series(now()  INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT ts, row_counter
FROM counter_rows;
ts row_counter
++
20220107 13:17:46.427 0500 1
20220107 13:17:47.427 0500 2
20220107 13:17:48.427 0500 3
20220107 13:17:49.427 0500 4
20220107 13:17:50.427 0500 5
20220107 13:17:51.427 0500 6
20220107 13:17:52.427 0500 7
20220107 13:17:53.427 0500 8
20220107 13:17:54.427 0500 9
20220107 13:17:55.427 0500 10
20220107 13:17:56.427 0500 1
…  …
By putting the CASE statement inside of the CTE, the counter data can be selected more easily to test other functions. For instance, to see how the rate()
and delta()
hyperfunctions work, we can use time_bucket()
to group our 1second readings into 1minute buckets.
WITH counter_rows AS (
SELECT ts,
CASE WHEN rownum % 10 = 0 THEN 10
ELSE rownum % 10 END AS row_counter
FROM generate_series(now()  INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT time_bucket('1 minute', ts) bucket,
delta(counter_agg(ts,row_counter)),
rate(counter_agg(ts, row_counter))
FROM counter_rows
GROUP BY bucket
ORDER BY bucket;
bucket deltarate
+++
20220107 13:25:00.000 0500 33.0 1.0
20220107 13:26:00.000 0500 59.0 1.0
20220107 13:27:00.000 0500 59.0 1.0
20220107 13:28:00.000 0500 59.0 1.0
20220107 13:29:00.000 0500 59.0 1.0
20220107 13:30:00.000 0500 26.0 1.0
time_bucket()
outputs the starting time of the bucket, which based on our date math for generate_series()
produces four complete buckets of 1minute aggregated data, and two partial buckets  one for the minute we are currently in, and a second bucket for the partial 5 minutes ago. We can see that the delta correctly calculates the difference between the last and first readings of each bucket, and the rate of change (the increment between each reading) correctly displays a unit of one.
What are some other ways we can use these PostgreSQL functions to generate different shapes of data to help you explore other features of SQL and TimescaleDB quickly?
Increasing trend over time
With the knowledge of how to create an ordinal value for each row of data produced by generate_series()
, we can explore other ways of generating useful timeseries data. Because the row number value will always increase, we can easily produce a random dataset that always increases over time but has some variability to it. Consider this a very rough representation of daily website traffic over the span of two years.
SELECT ts, (10 + 10 * random()) * rownum as value FROM generate_series
( '20200101'::date
, '20211231'::date
, INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
In reality this chart isn't very realistic or representative. Any website that gains and loses viewers upwards of 50% per day probably isn't going to have great longterm success. Don't worry, we can do better with this example after we learn about another method for creating shaped data using sine waves.
Simple cycles (sine wave)
Using the builtin sin()
and cos()
PostgreSQL functions, we can generate data useful for graphing and testing functions that need a predictable data trend. This is particularly useful for testing TimescaleDB downsampling hyperfunctions like lttb or asap. These functions can take tens of thousands (or millions) of data points and return a smaller, but still accurately representative dataset for graphing.
We'll start with a basic example that produces one row per day, for 30 days. For each row number value, we'll get the sine value that can be used to graph a wave.
– subtract 1 from the row number for wave to start
 at zero radians and produce a more representative chart
SELECT ts,
cos(rownum1) as value
FROM generate_series('20210101','20210130',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
ts value 
++
20210101 00:00:00.000 0500 1.0
20210102 00:00:00.000 0500 0.5403023058681398
20210103 00:00:00.000 0500 0.4161468365471424
20210104 00:00:00.000 0500 0.9899924966004454
20210105 00:00:00.000 0500 0.6536436208636119
20210106 00:00:00.000 0500 0.28366218546322625
…  …
Unfortunately, the graph of this SINE wave doesn't look all that appealing. For one month of daily data points, we only have ~6 distinct data points from peak to peak of each wave.
The reason our sine wave is so jagged is because sine and cosine values are measured in radians (based on 𝞹), not degrees. A complete cycle (peaktopeak) on a sine wave happens from zero to 2*𝞹 (~6.28…). Therefore, every ~6 rows of data will produce a complete period in the wave  unless we find a way to modify that value.
To take control over the sine/cosine values, we need to think about how to modify the data based on the date range and interval (how many rows) and what we want the wave to look like.
This means we need to take a quick trip back to math class to talk about radians.
Math class flashback!
Step back with me for a minute to primary school and your favorite math subject  Algebra 2 (or Trigonometry as the case may be). How many hours did you spend working with graph paper (or graphing calculators) determining the amplitude, period, and shift of a sine or cosine graph?
If you reach even further into your memory, you might remember this formula which allows you to modify the various aspects of a wave.
There's a lot here, I know. Let's primarily focus on the two numbers that matter most for our current use case:
X = the "number of radians", which is the row number in our dataset
B = a value to multiply the row number by, to decrease the "radian" value for each row
(A, C, and D change the height and placement of the wave, but to start, we want to elongate each period and provide more "points" on the line to graph.)
Let's start with a small dataset example, generating cosine data for three months of daily timestamps with no modifications.
SELECT ts,
cos(rownum) as value
FROM generate_series('20210101','20210331',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
ts value 
++
20210101 00:00:00.000 0500 0.5403023058681398
20210102 00:00:00.000 0500 0.4161468365471424
20210103 00:00:00.000 0500 0.9899924966004454
20210104 00:00:00.000 0500 0.6536436208636119
20210105 00:00:00.000 0500 0.28366218546322625
20210106 00:00:00.000 0500 0.960170286650366
20210107 00:00:00.000 0500 0.7539022543433046
20210108 00:00:00.000 0500 0.14550003380861354
20210109 00:00:00.000 0500 0.9111302618846769
…  …
20210329 00:00:00.000 0400 0.9993732836951247
20210330 00:00:00.000 0400 0.5101770449416689
20210331 00:00:00.000 0400 0.4480736161291701
In this example, we see ~14 peaks in our wave because there are 90 points of data and without modification, the wave will have a period (peaktopeak) every ~6.28 points. To lengthen the cycle, we need to perform some simple division.
[cycle modifying value] = 6.28/[total interval (rows) per cycle]
Using the same 3 months of generated daily values, let's see how to modify the data to lengthen the period of the wave.
One cycle per month (30 days)
If we want our daily data to cycle every 30 days, multiply our row number value by 6.28/30.
6.28/30 = .209 (the row number radians modifier)
SELECT ts, cos(rownum * 6.28/30) as value
FROM generate_series('20210101','20210331',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
One cycle per quarter (90 days)
6.28/90 = .07 (this is our radians modifier)
SELECT ts, cos(rownum * 6.28/90) as value
FROM generate_series('20210101','20210331',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
To modify the overall length of the period, you need to modify the row number value based on the total number of rows in the result and the granularity of the timestamp.
Here are some example values that you can use to modify the wave period based on the interval used with generate_series()
.
generate_series() interval  Desired period length  Divide 6.28 by… 

daily  1 month  30 
daily  3 months  90 
hourly  1 day  24 
hourly  1 week  168 
hourly  1 month  720 
minute  1 hour  60 
minute  1 day  1440 
Modifying the wave amplitude and shift
Another tweak we can make to our wave data is to change the amplitude (difference between the min and max peaks) and, as necessary, shift the wave up or down on the Yaxis.
To do this, multiply the cosine value by the value that maximum value you want the wave to have. For example, we can multiply the monthly cycle data by 10, which changes the overall minimum and maximum values of the data.
SELECT ts, 10 * cos(rownum * 6.28/30) as value
FROM generate_series('20210101','20210331',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
Notice that the min/max values are now from 10 to 10.
We can take it one step further by adding a value to the output which will shift the final values up or down on the Yaxis. In this example, we modified the previous query by adding 10 to the value of each row which results in values from 0 to 20.
SELECT ts, 10 + 10 * cos(rownum * 6.28/30) as value
FROM generate_series('20210101','20210331',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
Why spend so much time showing you how to generate and manipulate sine or cosine wave data, especially when we rarely see repeatable data this smooth in real life?
One of the main advantages of using consistent, predictable data like this in testing is that you can easily tell if your application, charting tools, and query are working as expected. Once you begin adding in unpredictable, reallife data, it can be difficult to determine if the data, query, or application are producing unexpected results. Quickly generating known data with a specific pattern can help rule out errors with the query, at least.
The second advantage of using a known dataset is that it can be used to shape and influence the results of other queries. Earlier in this post, we demonstrated a very simplistic example of increasing website traffic by multiplying the row number and a random value. Let's look at how we can join both datasets to create a better shape for the sample website traffic data.
Better website traffic samples
One of the key takeaways from this series of posts is that generate_series()
returns a set of data that can be JOINed and manipulated like data from a regular table. Therefore, we can join together our rough "website traffic" data and our sine wave to produce a smoother, more realistic set of data to experiment with. SQL for the win!
Overall this is one of the more complex examples we've presented, utilizing multiple common table expressions (CTE) to break the various sets into separate tables that we can query and join. However, this also means that you can independently modify the time range and other values to change the data that is generated from this query for your own experimentation.
 This is the generate series data
 with a "short" date to join with later
WITH daily_series AS (
SELECT ts, date(ts) AS day, rownum FROM generate_series
( '20200101'
, '20211231'
, '1 day'::interval) WITH ORDINALITY AS t(ts, rownum)
),
 This selects the time, "day", and a
 random value that represents our daily website visits
daily_value AS (
SELECT ts, day, rownum, random() AS val
FROM daily_series
ORDER BY day
),
 This cosine wave dataset has the same "day" values which allow
 it to be joined to the daily_value easily. The wave value is used to modify
 the "website" value by some percentage to smooth it out
 in the shape of the wave.
daily_wave AS (
SELECT
day,
 6.28 radians divided by 180 days (rows) to get
 one peak every 6 months (twice a year)
1 + .2 * cos(rownum * 6.28/180) as p_mod
FROM daily_series
day
)
 (500 + 20 * val) = 500520 visits per day before modification
 p_mod = an adjusted cosine value that raises or lowers our data each day
 row_number = a big incremental value for each row to quickly increase "visits" each day
SELECT dv.ts, (500 + 20 * val) * p_mod * rownum as value
FROM daily_value dv
INNER JOIN daily_wave dw ON dv.DAY=dw.DAY
order by ts;
Without much effort, we are able to generate a timeseries dataset, use two different SQL functions, and join multiple sets together to create fun, graphical data. In this example, our traffic peaks twice a year (every ~180 days) during July and late December.
But we don't have to stop there. We can carry our website traffic example one step further by applying just a little more control over how much the data increases or decreases during certain periods.
Once again, relational data to the rescue!
Influence the pattern with relational data
As a final example, let's consider one other type of data that we can include in our queries which influence the final generated values  relational data. Although we've been using data that was created using generate_series()
to produce some fun and interesting sample datasets, we can just as easily JOIN to other data in our database to further manipulate the final result.
There are many ways you could JOIN to and use additional data depending on your use case and the type of timeseries data you're trying to mimic. For example:

IoT data from weather sensors: store the typical weekly temperature highs/lows in a database table and use those values as input to the
random_between()
function we created in post 2  Stock data analysis: store the dates for quarterly disclosures and a hypothetical factor that will influence the impact on stock price moving forward
 Sales or website traffic: store the monthly or weekly change observed in a typical sales cycle. Does traffic or sales increase a quarterend? What about during the endofyear holiday season?
To demonstrate this, we'll use the fictitious website traffic data from earlier in this post. Specifically, we've decided that we want to see a spike in traffic during June and December.
First, we create a regular PostgreSQL table to store the numerical month (112) and a float value which will be used to modify our generated data (up or down). This will allow us to tweak the overall shape for a given month.
CREATE TABLE overrides (
m_val INT NOT NULL,
p_inc FLOAT4 NOT null
);
INSERT INTO overrides(m_val, p_inc) VALUES
(1,.1.04), – 4% residual increase from December
(2,1),
(3,1),
(4,1),
(5,1),
(6,1.10), June increase of 10%
(7,1),
(8,1),
(9,1),
(10,1),
(11,1.08),  8% early shoppers sales/traffic growth
(12,1.18);  18% holiday increase
Using this simple dataset, let's first join it to the "simplistic" query that had randomly growing data over time.
WITH daily_series AS (
 a random value that increases over time based on the row number
SELECT ts, date_part('month',ts) AS m_val, (10 + 10*random()) * rownum as value FROM generate_series
( '20200101'::date
, '20211231'::date
, INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum)
)
 join to the `overrides` table to get the 'p_inc' value
 for the month of the current row
SELECT ts, value * p_inc AS value FROM daily_series ds
INNER JOIN overrides o ON ds.m_val=o.m_val
ORDER BY ts;
Joining to the overrides
table based on the month of each data point, we are able to multiply the percentage increase (p_inc
) value and the fake website traffic value to influence the trend of our data during specific time periods.
Combining everything we've learned and taking this example one step further, we can enhance the cosine data query with the same monthly override values to tweak our fake, cyclical timeseries data that represents growing website traffic with a more realistic shape.
 This is the generate series data
 with a "short" date to join with later
WITH daily_series AS (
SELECT ts, date(ts) AS day, rownum FROM generate_series
( '20200101'
, '20211231'
, '1 day'::interval) WITH ORDINALITY AS t(ts, rownum)
),
 This selects the time, "day", and a
 random value that represents our daily website visits
 'm_val' will be used to join with the 'overrides' table
daily_value AS (
SELECT ts, day, date_part('month',ts) as m_val, rownum, random() AS val
FROM daily_series
ORDER BY day
),
 This cosine wave dataset has the same "day" values which allow
 it to be joined to the daily_value easily. The wave value is used to modify
 the "website" value by some percentage to smooth it out
 in the shape of the wave.
daily_wave AS (
SELECT
day,
 6.28 radians divided by 180 days (rows) to get
 one peak every 6 months (twice a year)
1 + .2 * cos(rownum * 6.28/180) as p_mod
FROM daily_series
day
)
 (500 + 20 * val) = 500520 visits per day before modification
 p_mod = an adjusted cosine value that raises or lowers our data each day
 row_number = a big incremental value for each row to quickly increase "visits" each day
 p_inc = a monthly adjustment value taken from the 'overrides' table
SELECT dv.ts, (500 + 20 * val) * p_mod * rownum * p_inc as value
FROM daily_value dv
INNER JOIN daily_wave dw ON dv.DAY=dw.DAY
inner join overrides o on dv.m_val=o.m_val
order by ts;
Wrapping it up
In this 3rd and final blog post of our series about generating sample timeseries datasets, we demonstrated how to add shape and trend into your sample timeseries data (e.g., increasing web traffic over time and quarterly sales cycles) using builtin SQL functions and relational data. With a little bit of math mixed in, we learned how to manipulate the pattern of generated data, which is particularly useful for visualizing timeseries data and learning analytical PostgreSQL or TimescaleDB functions.
To see some of these examples in action, watch my video on creating realistic sample data:
If you have questions about using generate_series() or have any questions about TimescaleDB, please join our community Slack channel, where you'll find an active community and a handful of the Timescale team most days.
If you want to try creating larger sets of sample timeseries data using generate_series() and see how the exciting features of TimescaleDB work, sign up for a free 30day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)
Top comments (0)