Greg Schafer for Tangram Vision

Posted on Apr 30, 2021 • Originally published at tangramvision.com

Creating PostgreSQL Test Data with SQL, PL/pgSQL, and Python

#tutorial #postgres #sql #python

After exploring various ways to load test data into PostgreSQL for my last blog post, I wanted to dive into different approaches for generating test data for PostgreSQL. Generating test data, rather than using static manually-created data, can be valuable for a few reasons:

Writing the logic for generating test data forces you to take a second look at your data model and consider what values are allowed and which values are edge cases.
Tools for generating test data make it easier to set up data per test. I would argue this is better than the alternatives of (a) hand-creating data per test or (b) trying to maintain a single dataset that is used across the entire test suite. The first option is tedious, and the second option can be brittle. As an example, if you're testing an e-commerce website and your test suite uses hard-coded product details and deactivating the product in your test dataset causes many tests to unexpectedly fail, then those tests were reliant on a pre-condition that happened to be satisfied in your test dataset. Generating data per test can make such pre-conditions more explicit and clear, especially for colleagues who inherit your tests and test data in the future.
Unless you already have a large dataset from a production environment or a partner company that you can use (hopefully after anonymization!), generating test data is the only way to get large datasets for benchmarking and load testing.

Similar to the previous article, if you're using an Object-Relational Mapping (ORM) library, then you'll probably create and persist objects into the database using the ORM or use the ORM to dump and restore test data fixtures using JSON or CSV. If you're not using an ORM, the approaches in this article may provide some learning or inspiration for how you can best generate data for your particular testing situation.

Follow Along with Docker

Similar to the previous article, you can follow along using Docker and the scripts in a subfolder of our Tangram Vision blog repo: https://gitlab.com/tangram-vision-oss/tangram-visions-blog/-/tree/main/2021.04.30_GeneratingTestDataInPostgreSQL

Unlike the previous article, I've provided a Dockerfile to add Python into the Postgres Docker image so we can run Python inside the PostgreSQL database. As described in the repo's README, you can build the docker image and run examples with:

docker build . --tag=postgres-test-data-blogpost

# The base postgres image requires a password to be set, but we'll just be
# testing locally, so no need to set a strong password.
docker run --name=postgres --rm --env=POSTGRES_PASSWORD=foo \
    --volume=$(pwd)/schema.sql:/docker-entrypoint-initdb.d/schema.sql \
    --volume=$(pwd):/repo \
    postgres-test-data-blogpost -c log_statement=all

The repo contains a variety of files that start with add-data- which demonstrate different ways of loading and generating test data. After the Postgres Docker container is running, you can run add-data- files in a new terminal window with a command like:

docker exec --workdir=/repo postgres \
    psql --host=localhost --username=postgres \
         --file=add-data-insert-random.sql

If you want to interactively poke around the database with psql, use:

docker exec --interactive --tty postgres \
    psql --host=localhost --username=postgres

Sample Schema

For example code and data, I'll use the following simple schema again:

Musical artists have a name
An artist can have many albums (one-to-many), which have a title and release date
Genres have a name
Albums can belong to many genres (many-to-many)

Sample schema relating musical artists, albums, and genres.

Generating Data

Using static datasets has advantages (you know exactly what data is in your database), but they can be tedious to maintain over time and impractical to create if you need a lot of data (e.g. for benchmarking or load testing). Generating data is an alternative approach which lets you define how data should look in one place and then generate and use as much data as you like.

There are a few different tools for generating test data that are worth exploring, from plain ol' SQL to higher-level programming languages like Python.

SQL

If you're like me, you may have started this article not expecting SQL to be capable of generating test data. With [generate_series](https://www.postgresql.org/docs/current/functions-srf.html) and [random](https://www.postgresql.org/docs/current/functions-math.html#FUNCTIONS-MATH-RANDOM-TABLE) and a little creativity, however, SQL is well-equipped to generate a variety of data.

To create 5 artists with 8 random hex characters for their names, you can do the following:

INSERT INTO artists (name)
SELECT substr(md5(random()::text), 1, 8) FROM generate_series(1, 5) as _g;

If you want to use random words instead of random hex characters, you can pick words from the system dictionary. I've copied Ubuntu's american-english word list to /usr/share/dict/words in the Docker image, so we just need to load it and pick a word randomly:

-- Temporary tables are only accessible to the current psql session and are
-- dropped at the end of the session.
CREATE TEMPORARY TABLE words (word TEXT);

-- The WHERE clauses excludes possessive words (almost 30k of them!)
COPY words (word) FROM '/usr/share/dict/words' WHERE word NOT LIKE '%''%';

-- Randomly order the table and pick the first result
SELECT * FROM words ORDER BY random() LIMIT 1;

No joke, the first word that the above query returned for me was "bravo". I don't know whether to be encouraged or creeped out.

On a separate note, the dictionary contains words that may be offensive and inappropriate in some settings. If you're pulling test data from the dictionary and don't want these words to pop up in your next demo to customers/bosses, make sure to take appropriate precautions!

Anyway, moving on... using these tools (and a few more), we can generate interesting test data for all of our tables. Comments in the code below explain extra functions and techniques being used.

-- Excerpt from add-data-insert-random.sql in the sample code repo

-- Use 8 random hex chars as the genre name.
INSERT INTO genres (name)
SELECT substr(md5(random()::text), 1, 8) FROM generate_series(1, 5) AS _g;

INSERT INTO artists (name)
SELECT
  -- Pick one random word as the artist name.
  (SELECT * FROM words ORDER BY random() LIMIT 1)
FROM generate_series(1, 4) AS _g;

INSERT INTO albums (artist_id, title, released)
SELECT
  -- Select a random artist from the artists table.
  -- NOTE: random() is only evaluated once in this subquery unless it depends on
  -- the outer query, hence the "_g*0" after random().
  (SELECT id FROM artists ORDER BY random()+_g*0 LIMIT 1),

  -- Select the first 1-3 rows after randomly sorting the word list, then join
  -- them with spaces between each word and capitalize the first letter of each
  -- word.
  initcap(array_to_string(array(
    SELECT * FROM words ORDER BY random()+_g*0 LIMIT ceil(random() * 3)
  ), ' ')),

  -- Subtract between 0-5 years from today as the album release date.
  (now() - '5 years'::interval * random())::date
FROM generate_series(1, 8) AS _g;

-- Assign a random album a random genre. Repeat 10 times.
INSERT INTO album_genres (album_id, genre_id)
SELECT
  (SELECT id FROM albums ORDER BY random()+_g*0 LIMIT 1),
  (SELECT id FROM genres ORDER BY random()+_g*0 LIMIT 1)
FROM generate_series(1, 10) AS _g
-- If we insert a row that already exists, do nothing (don't raise an error)
ON CONFLICT DO NOTHING;

But that's not all! We can define functions in SQL to reuse logic — if we want genres, artist names, and album titles to all be random words, then we can move random-word-picking into a function and use it in many places:

-- Excerpt from add-data-insert-random-function.sql in the sample code repo
CREATE OR REPLACE FUNCTION generate_random_title(num_words int default 1) RETURNS text AS $$
  SELECT initcap(array_to_string(array(
    SELECT * FROM words ORDER BY random() LIMIT num_words
  ), ' '))
$$ LANGUAGE sql;

INSERT INTO genres (name)
SELECT generate_random_title()
FROM generate_series(1, 5) AS _g;

INSERT INTO artists (name)
-- Generate 1-2 random words as the artist name.
SELECT generate_random_title(ceil(random() * 2 + _g * 0)::int)
FROM generate_series(1, 4) AS _g;

-- ...

PL/pgSQL

If the declarative style of SQL is awkward/difficult, we can turn to PL/pgSQL to generate test data in PostgreSQL using a more procedural/imperative programming style. PL/pgSQL provides familiar programming concepts like variables, conditionals, loops, return statements, and exception handling.

To demonstrate some of what PL/pgSQL can do, let's specify some more requirements for our generated data — roughly half of our artists should have names starting with "DJ" and all albums by DJ artists should belong to an "Electronic" genre. That implementation might look like:

-- Excerpt from add-data-plpgsql-insert.sql in the sample code repo
DO $$
DECLARE
  -- Declare (and optionally assign) variables used in the below code block.
  genre_options text[] := array['Hip Hop', 'Jazz', 'Rock', 'Electronic'];
  artist_name text;
  dj_album RECORD;
BEGIN
  -- Convert each array option into a row and insert them into genres table.
  INSERT INTO genres (name) SELECT unnest(genre_options);

  FOR i IN 1..8 LOOP
    SELECT generate_random_title(ceil(random() * 2)::int) INTO artist_name;
    -- About 50% of the time, add 'DJ ' to the front of the artist's name.
    IF random() > 0.5 THEN
      artist_name = 'DJ ' || artist_name;
    END IF;
    INSERT INTO artists (name)
    SELECT artist_name;
  END LOOP;

  -- ...

  -- Ensure all albums by a 'DJ' artist belong to the Electronic genre.
  FOR dj_album IN
    SELECT albums.* FROM albums
    INNER JOIN artists ON albums.artist_id = artists.id
    WHERE artists.name LIKE 'DJ %'
  LOOP
    RAISE NOTICE 'Ensuring DJ album % belongs to Electronic genre!', quote_literal(dj_album.title);
    INSERT INTO album_genres (album_id, genre_id)
    SELECT dj_album.id, (SELECT id FROM genres WHERE name = 'Electronic')
    -- If we insert a row that already exists, do nothing (don't raise an error)
    ON CONFLICT DO NOTHING;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

As you can see in the above code snippet, PL/pgSQL lets us:

Test conditions with IF statements (which can have ELSIF and ELSE blocks or alternately be represented with CASE statements),
Loop over a range of integers with FOR i IN 1..8 LOOP (which can loop in reverse or with a step),
Loop over rows from a query, as in the FOR dj_album IN ... example above,
Print helpful log statements with RAISE,
and do all the above in a performant way, because the client can send the whole code block to the server to execute, rather than serializing and sending each statement to the server one at a time as it would with raw SQL.

There's much more to learn about PL/pgSQL than I can cover here in a reasonable amount of space, but hopefully the above provides some insight into its capabilities to help you decide what tool makes sense for you!

Using Python

PL/pgSQL isn't the only procedural language available with PostgreSQL, it also supports Python! The Python procedural language, plpython3u for Python 3, is "untrusted" (hence the u at the end of the name), meaning you must be a superuser to create functions, and Python code can access and do anything that a superuser could. Luckily, we're generating test data in non-production environments, so Python is an acceptable option despite these security concerns.

To use plpython3u, we need to install python3 and postgresql-plpython3-$PG_MAJOR system packages and create the extension in the SQL script with the command below. I've already taken these steps for the Docker image and plpython script in the sample code repo.

CREATE EXTENSION IF NOT EXISTS plpython3u;

The main difference to be aware of when using Python in PostgreSQL is that all database access happens via the plpy module that is automatically imported in plpython3u blocks. The following example should help clarify some basics of using plpython3u and the plpy module:

-- Excerpt from add-data-plpython-intro.sql in the sample code repo
DO $$
    print("Print statements don't appear anywhere!")

    # Manually convert value to string, quote it, and interpolate
    artist_name = plpy.quote_nullable("DJ Okawari")
    returned = plpy.execute(f"INSERT INTO artists (name) VALUES ({artist_name})")
    plpy.info(returned)  # Outputs the next line
    # INFO:  <PLyResult status=7 nrows=1 rows=[]>

    # Let PostgreSQL parameterize the query
    artist_name = "Ella Fitzgerald"
    plan = plpy.prepare("INSERT INTO artists (name) VALUES ($1) RETURNING *", ["text"])
    returned = plan.execute(plan, [artist_name])
    plpy.info(returned)  # Outputs the next line
    # INFO:  <PLyResult status=11 nrows=1 rows=[{'artist_id': 2, 'name': 'Ella Fitzgerald'}]>

    returned = plpy.execute("SELECT * FROM artists")
    plpy.info(returned)  # Outputs the next line
    # INFO:  <PLyResult status=5 nrows=2 rows=[{'artist_id': 1, 'name': 'DJ Okawari'}, {'artist_id': 2, 'name': 'Ella Fitzgerald'}]>
$$ LANGUAGE plpython3u;

Here are the most important insights from the above code:

You can't print out debugging information with the Python print statement, you need to use logging methods available in the plpy module (such as info, warning, error).
The [plpy.execute function](https://www.postgresql.org/docs/12/plpython-database.html) can execute a simple string as a query. If you're interpolating variables into the query, you are responsible for converting the variable value into a string and properly quoting it.
Alternately, use plan = plpy.prepare then plan.execute to prepare and execute a query, which allows you to leave data conversion and quoting up to PostgreSQL. As a bonus, you can save plans so the database only has to parse the query string and formulate an execution plan once.
The return value of plpy.execute can tell you the status of the query, how many rows were inserted or returned, and the rows themselves.

Now that we have an understanding of how to use Python in PostgreSQL, let's apply it to generating test data for our sample schema. While we could translate the previous section's PL/pgSQL code to Python with very few changes, doing so wouldn't capitalize on the biggest advantage of using Python — the plethora of standard and third-party libraries available.

The Faker Package

Faker is a Python package that provides many helpers for generating fake data. You can generate realistic-looking first and last names, addresses, emails, URLs, job titles, company names, and much more. Faker also supports generating random words and sentences, and generating random data across many different data types (numbers, strings, dates, JSON, and more). Using Faker is straightforward:

-- Excerpt from add-data-plpython-faker.sql in the sample code repo
DO $$
    from random import randint, choice
    from faker import Faker

    fake = Faker()

    for _ in range(6):
        plan = plpy.prepare("INSERT INTO artists (name) VALUES ($1)", ["text"])
        plan.execute([fake.name()])

    # Alternately, we could add "RETURNING artist_id" to the above query and
    # save those values to avoid making this extra query for all artist_ids
    artist_ids = [row["artist_id"] for row in plpy.execute("SELECT artist_id FROM artists")]
    for _ in range(10):
        title = " ".join(word.title() for word in fake.words(nb=randint(1, 3)))
        plan = plpy.prepare(
            "INSERT INTO albums (artist_id, title, released) VALUES ($1, $2, $3)",
            ["int", "text", "date"],
        )
        plan.execute([choice(artist_ids), title, fake.date()])

    # ...
$$ LANGUAGE plpython3u;

The dataclasses Module

If you prefer to create Python objects to represent rows from your different tables, you could use a variety of different packages, such as attrs, factory_boy, or the built-in module dataclasses. These packages allow you to declare a field per table column and associate data types and factories for generating test data.

Please note that if you go very far down this path of representing rows as Python objects, you will find yourself re-creating a lot of ORM functionality. In that case, you should probably just use an ORM!

Here's an example of how you could use the dataclasses module to generate test data for our sample schema:

-- Excerpt from add-data-plpython-dataclasses.sql in the sample code repo
DO $$
    from dataclasses import dataclass, field
    import datetime
    from random import randint, choice
    from typing import List, Any, Type, TypeVar

    from faker import Faker

    T = TypeVar("T", bound="DataGeneratorBase")
    fake = Faker()

    # This is a useful base class for tracking instances so we can use them in
    # relationships (picking a random artist or genre to foreign key to).
    class DataGeneratorBase:
        def __new__(cls: Type[T], *args: Any, **kwargs: Any) -> T:
            "Track class instances in a list on the class"
            instance = super().__new__(cls, *args, **kwargs)  # type: ignore
            if "instances" not in cls.__dict__:
                cls.instances = []
            cls.instances.append(instance)
            return instance

    @dataclass
    class Genre(DataGeneratorBase):
        genre_id: int = field(init=False)
        name: str = field(default_factory=fake.street_name)

    @dataclass
    class Artist(DataGeneratorBase):
        artist_id: int = field(init=False)
        name: str = field(default_factory=fake.name)

    @dataclass
    class Album(DataGeneratorBase):
        album_id: int = field(init=False)
        artist: Artist = field(default_factory=lambda: choice(Artist.instances))
        title: str = field(
            default_factory=lambda: " ".join(
                word.title() for word in fake.words(nb=randint(1, 3))
            )
        )
        released: datetime.date = field(default_factory=fake.date)
        genres: List[Genre] = field(
            # Use Faker to pick a list of genres to avoid duplicates
            default_factory=lambda: fake.random_elements(Genre.instances, length=randint(0, 3), unique=True)
        )

    for _ in range(6):
        g = Genre()
        # "RETURNING id" lets us get the database-generated and store it on the
        # Python object for later reference without needing to issue additional
        # queries.
        plan = plpy.prepare(
            "INSERT INTO genres (name) VALUES ($1) RETURNING genre_id", ["text"]
        )
        g.genre_id = plan.execute([g.name])[0]["genre_id"]
    for _ in range(6):
        artist = Artist()
        plan = plpy.prepare(
            "INSERT INTO artists (name) VALUES ($1) RETURNING artist_id", ["text"]
        )
        artist.artist_id = plan.execute([artist.name])[0]["artist_id"]
    for _ in range(8):
        album = Album()
        plan = plpy.prepare(
            "INSERT INTO albums (artist_id, title, released) VALUES ($1, $2, $3) RETURNING album_id",
            ["int", "text", "date"],
        )
        album.album_id = plan.execute(
            [album.artist.artist_id, album.title, album.released]
        )[0]["album_id"]

        # Insert album_genres rows
        for g in album.genres:
            plan = plpy.prepare(
                "INSERT INTO album_genres (album_id, genre_id) VALUES ($1, $2)",
                ["int", "int"],
            )
            plan.execute([album.album_id, g.genre_id])
$$ LANGUAGE plpython3u;

The above snippet defines classes for each main table in our example schema: Genre, Artist, and Album. Then, it defines fields for each column along with a default_factory function that tells Python (or the Faker package, in many cases) how to generate suitable test data. I made the Album class the "owner" of the many-to-many relationship with Genres, so when an Album is created, it automatically picks 0-3 existing Genres to associate itself with during initialization.

The second half of the code passes the Python objects into SQL INSERT queries, returning the primary key IDs (which weren't generated during object creation, due to the init=False field argument) so they can be saved on the objects and used later when setting foreign keys. This highlights a difficulty with doing this sort of object-relational mapping yourself — you have to figure out dependencies between your types of data and enforce an ordering (in Python and SQL) so that you have database-created IDs at the right times. This can be a bit tedious and messy, especially if you have circular dependencies or self-referencing relationships in your tables.

Importing External .py Files

If your data model or data-generation code start to get complex, it can be annoying to have a lot of Python code in SQL files — your IDE won't want to lint, type-check, and auto-format your Python code! Luckily, you can keep your Python code in external .py files that you import and execute from inside a plpython3u block, using the technique shown below:


-- Excerpt from add-data-plpython-external-pyfile.sql in the sample code repo
DO $$
    import importlib.util

    # The second argument is the filepath on the server (inside the container)
    spec = importlib.util.spec_from_file_location("add_test_data", "/repo/add_test_data.py")
    add_test_data = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(add_test_data)
    add_test_data.main(plpy)
$$ LANGUAGE plpython3u;

The add_test_data.py file can look the exact same as the body of the plpython3u block from the previous example, but you'll need to wrap the bottom half (which uses plpy to run queries) in a function that accepts plpy as an argument, so it looks like:

# Excerpt from add_test_data.py in the sample code repo

# ...
def main(plpy: Any) -> None:
    for _ in range(6):
        g = Genre()
    # ...

Other (Trusted) Ways to Use Python

I want to briefly touch on two ways of using Python outside of PostgreSQL — running Python externally may be preferable if you want or need to avoid the untrusted nature of plpython3u. These approaches let you maintain your Python code completely independent of the database, which may be beneficial for reusability and maintainability.

You could use Python scripts to generate test data into CSV files and then load those into PostgreSQL with the COPY command. With this approach, however, you will likely end up with a multi-step process to generate and load test data. If you invoke a Python script (which outputs CSV) within the SQL COPY command, then you can't populate multiple tables with a single command. If you use multiple SQL COPY commands, it becomes convoluted to reference IDs across tables (foreign keys) across multiple Python script executions. The remaining reasonable approach is a multi-step one: run a Python script that saves multiple CSV files to disk (one per database table) and then run an SQL COPY command per CSV file to load the data.
You could run Python scripts that connect to PostgreSQL via a client library such as psycopg2. The psycopg2 package is used by many ORMs, such as the Django ORM and SQLAlchemy, but it doesn't impose any restrictions on how you handle your data — it just provides a Python interface for connecting to PostgreSQL, sending SQL commands, and receiving results.

Thank you for joining me on this exploration of loading test data (in the previous blog post) and generating test data for PostgreSQL! We tried out a variety of approaches and got some hands-on experience with code — I hope this helps you understand how to use these different approaches, weigh their tradeoffs, and choose which approach makes the most sense for your team and project.

If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

DEV Community

Creating PostgreSQL Test Data with SQL, PL/pgSQL, and Python

Follow Along with Docker

Sample Schema

Generating Data

SQL

PL/pgSQL

Using Python

The Faker Package

The dataclasses Module

Importing External .py Files

Other (Trusted) Ways to Use Python

Top comments (0)

Read next

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Why Is Spark Slow??

How to chat with Local LLM in Obsidian

The Justin Beiber database problem!