Dmitriy

Posted on Dec 16, 2025

Breaking data for fun

#data #database #programming #tutorial

Every day, thousands of engineers do their best to protect data quality and integrity: making sure every transaction stays intact. But sometimes you need the exact opposite — to deliberately break data or even invent it from scratch. And that’s not so simple either.

In this article, I’ll explain what fake (synthetic, test — call them what you like) data is, why you might need it, how to generate it, and how to use it. The article is based on my talk at PGConf.Russia.

A real-life case

Imagine you’re building a new information system. To test basic functionality — like registration and login — you need test data: usernames, full names, emails, phone numbers, and so on. The system will check them for validity and uniqueness: is the username free, is the email correct, does the address match the format.

Using real data for this is a bad idea for several reasons:

Legality. In many countries it’s prohibited or strictly limited by data protection laws (GDPR in Europe, 152-FL in Russia).
Security. Test environments are usually less protected than production. Using real data risks leaks — sensitive information could end up in insecure environments or in the wrong hands.
Scalability. You might simply not have enough real data. What if you need tens of thousands of accounts with different roles and privileges to test complex scenarios?

The simplest approach is to use placeholders like “Ivan Ivanov”, “Vasya Pupkin”, “John Doe”. But that list and your patience will run out fast, leaving you typing random nonsense. And if you need to test load with thousands of users, your testbed turns into a junkyard of meaningless, invalid records. Many of us have seen this sad picture:

Not only does it look ugly, it creates real problems as data grows:

Validation. Apps and databases check login lengths, email and phone formats. Random junk usually fails those checks.
Uniqueness. Logins, emails, internal IDs must be unique within a table or group (department, company).
Business logic. Data should reflect real business scenarios. An HR employee must have an “HR” role, an order can’t be delivered before it’s created.

Luckily, there are plenty of ways to generate high-quality test data that not only looks real but also respects all your system’s rules. Ideally, your data should look something like this:

Then you get the same fields (name, position, address) — but looking far more solid and solving the issues above:

Safe. Fully synthetic: names, addresses, positions are made up, so no risk even if leaked.
Plausible. Looks real, which is critical for testing UI/UX and logic.
Valid. Ready for checks from simple email format to complex business rules.

Let’s go through how to achieve this and what pitfalls to avoid.

Why “fake” data matters

Synthetic data has far more uses than you might think. Key scenarios:

Functional testing. Generate endless cases — valid and edge — to see how the system reacts. Will all triggers fire? Will it crash on invalid formats? Fake data is essential here.
Load testing. What happens when user numbers jump 100x? To find out, you need data that simulates that load. Generation lets you create datasets bigger than production and find bottlenecks before real users do.
Demos and pilots. Showing a prototype with a client’s real data is a big no-no (and often a breach of NDA). With synthetic data — go ahead: looks good, safe, and won’t expose anyone’s account.

You can hand such data to contractors, use it in conference talks, or share it with partners — no damage if it leaks. It’s also perfect for screenshots in docs and articles, without masking tricks.

Four ways to get “fake” data

Depending on your goals and existing data, you can use one of four methods:

Masking. Ideal for creating a testbed from a real production DB. Personal data is anonymized — values replaced or encrypted. Key advantage: structure, volume, statistical properties, and relational integrity are preserved. You get a full prod copy, risk-free.
Generation from scratch. Best for a clean setup — just the schema, no data. Data is generated artificially, based on formats, validation rules, and size requirements. Great for predictable datasets tailored to specific tests or initializing new projects.
Extrapolation. Useful when you have historical data and want to forecast future loads. You can “grow” existing data to simulate scenarios like doubling user numbers or a year of system growth. This helps reveal future bottlenecks.
Combined approach. The most practical real-world scenario. Typically you’ll combine methods: masking for user data, extrapolation for transactions to simulate load, generation for new features not yet in prod.

Tools: from scripts to data factories

To tackle data masking and generation, there’s a wide range of tools out there. Broadly speaking, they fall into three categories, depending on how complex and comprehensive they are.

DIY scripts. You can always roll your own with SQL or any general-purpose language, using libraries like the classic Python Faker or its PostgreSQL-specific wrappers such as pg_faker. The upside: you have full control and it’s easy to get started. The downside: maintaining these scripts quickly becomes a pain if your schema changes often. That’s why this approach is best for one-off jobs or when you’re working with static schemas that rarely change.
Open-source tools. Projects like PostgreSQL Anonymizer or GreenMask can scan the database and apply declarative configs (YAML, JSON) that define masking and generation rules across the entire DB. Their main strength is automation and built-in strategies that preserve data integrity. The tradeoff is that you’ll need to invest some time learning and setting them up. These are a great fit when you need regular test copies of production databases that keep relational consistency intact.
Test Data Management (TDM) platforms. Think Tonic, Gretel, and similar products. These are more than just utilities — they’re full-fledged data factories. They can work not only with databases but also with other data sources, spinning up entire test environments in the cloud on demand. Their strengths: maximum automation, scalability, and support for highly complex environments. Their weaknesses: high cost and steep implementation effort. These platforms are usually found in large enterprises where managing the full lifecycle of test data is a must.

Despite the variety of options, most of these tools are powered by the same set of fundamental ideas — the so-called fake data patterns. Let’s take a closer look at them.

Masking patterns

Static masking is one of the simplest and most reliable approaches. The idea is to replace sensitive data with fake values that partially or fully preserve the format of the original data. For instance, you could mask an email by keeping its structure (the part before @ and the domain) while altering the actual characters.

The main advantage of static masking is how easy it is to implement. Even standard SQL functions are enough to whip up something usable on the spot. But the method has important limitations: such data can’t be used as primary or foreign keys, since that would break referential integrity. On top of that, masked values might fail business validation if the application checks them against real-world patterns.

Hashing is another widely used method that anonymizes data while preserving uniqueness. This is done with hash functions — both standard and custom. Some tools even support advanced options like format-preserving hashing, which allows you to mask an email so that it remains structurally valid without exposing the original. The key advantage here is determinism: identical input values always produce the same hash. This makes it possible to maintain relationships across tables, which is particularly useful when testing complex systems with interconnected data.

There are also several other important data masking patterns:

Noising. Adding random noise to numeric values. This keeps statistical characteristics intact but makes individual records unrecognizable. For example, you could add a random deviation of ±10% to employee salaries.
Generalization. Replacing exact values with broader ranges or categories. Instead of a specific age (say, 28), you store an interval like 20–40. Instead of precise coordinates, you keep only the city or district. This reduces detail while preserving analytical value.
Aggregation. Replacing source data with aggregate statistics. Instead of showing each employee’s salary, you display the department average or company median. This is common in reporting and analytics.
Rule-based masking. Applying business logic to masking. For instance, hiding movie titles with an 18+ rating, masking credit card numbers according to a pattern, or applying different masking rules depending on the user’s role.
Shuffling. Randomly swapping values within a column. This keeps the dataset composition but breaks record-level consistency.
Structured data masking: Special methods for handling complex formats. For JSON or XML, this may mean masking values by keys, working with nested structures, or even altering document structure. For binary data, specific algorithms ensure file formats remain valid.
Randomized masking: Replacing values with randomly generated ones that match the required format. For example, generating a random but valid-looking email, phone number, or identifier.

Each of these approaches has its strengths and limitations, and in practice the best results usually come from combining several methods depending on the type of data and the use case.

Generation patterns

Generating rows
The very first step in testing or bootstrapping an application is filling empty tables with data — whether it’s just a few rows or millions of records. Specialized tools make this trivial: you simply specify the desired number of rows (absolute or relative) in a GUI or config file, and the data magically appears.

If you stick to standard tools, our good old SQL comes to the rescue. Since SQL:1999, the standard has supported generating sequences of rows using recursive queries. It’s a powerful, though not always intuitive, method:

with recursive tmp (r) as (
  select 0
  union all
  select r+1
  from tmp
  where r < 100500)
select r from tmp;

PostgreSQL makes this even easier with the built-in generate_series function, which can create numeric, time-based, and even text sequences in just a couple of lines. It’s a perfect entry point for quickly generating test data right at the database level:

select generate_series
from generate_series(1, 100500)

If you’re curious, I highly recommend Timescale’s fantastic series of articles on generate_series.

Generating values
Once the “skeleton” of rows is in place, it’s time to fill them with actual data. For small projects and quick checks, the DBMS itself has enough built-in capabilities. This approach works great for rapid prototyping, validating basic functionality, or building reports where realism isn’t critical. Here’s an example in PostgreSQL using generate_series()along with random value generators:

select
  md5(random()::text) as name,
  floor(random() * 99)::int as years_of_experience,
  md5(random()::text) || '@gmail.com' as email,
  now() - (random() * (interval '90 days')) as order_date,
  case when random() > 0.5 then true else false end as is_student
from generate_series(1, 1000);

But real-world projects need much more realistic data. Fortunately, most modern data generation tools can produce lifelike primitives: names, addresses, emails, document numbers, phone numbers — all customizable and localized. For example, many tools support:

Demographic data (full names) adapted for specific regions (Cyrillic for Russia, Latin for Europe, etc.).
Addresses that match real postal formats for different countries.
Valid document numbers (passports, driver’s licenses) with proper checksums, plus phone numbers formatted by operator masks.
Emails bound to actual-looking domains.

This makes the data not only look convincing but also pass formal validation checks, which is critical for testing business logic.

Other key data generation patterns

Generating related values — producing logically connected data across fields and tables. For instance, generating an email based on a user’s first and last name, or maintaining consistency of region codes across different tables. This ensures both realism and data integrity.
Generation from parent tables — respecting foreign keys and relationships. Data in child tables should correspond to records in parent tables.
Recursive relationships — handling self-referential or cyclic structures (like an employees table where each record can point to a manager in the same table). This requires: -- analyzing the business logic of the recursion, -- defining a safe nesting depth, -- preventing circular references, -- and using iterative or recursive generation algorithms.

Each of these patterns demands a deep understanding of the database schema and the application’s business logic. In practice, you’ll often mix specialized tools with custom SQL queries or scripts to get the job done.

Where to start

Study your DB schema in detail — you won’t get good results without it. Don’t forget triggers and stored procedures: they can silently change data during generation or masking.
Choose the right tool for your tasks and tech stack. Be ready to find workarounds. Make sure the tool or script can selectively process specific tables or parts of tables.
Build a testbed. Generate data, apply basic masking, and review the results together with your colleagues.