Every developer needs test data. Filling a database with "test test test" and "asdf@asdf.com" creates data that is nothing like production, which means your tests are nothing like production. Realistic dummy data catches bugs that toy data misses.
I have found bugs in production that would have been caught in development if we had tested with realistic data: names with apostrophes breaking SQL, addresses too long for the column, phone numbers in unexpected formats, and email addresses with plus signs causing validation failures.
What realistic test data looks like
Realistic test data has the same statistical properties as real data:
Names: Varying lengths, multiple cultural origins, special characters (O'Brien, Martinez-Lopez, Bjork), prefixes and suffixes (Dr., Jr., III).
Emails: Different providers, varying username formats, subaddressing (user+tag@gmail.com), long domain names, international domains.
Addresses: Real street patterns, varying line lengths, apartment numbers, international formats. US addresses look nothing like UK addresses look nothing like Japanese addresses.
Phone numbers: Country codes, area codes, varying formats (parentheses, dashes, dots, spaces, none). International numbers starting with + versus domestic formatting.
Dates: Distributed across a realistic range. Not all January 1st. Not all from this year. Includes edge cases like February 29th.
Financial data: Amounts that follow realistic distributions (most transactions are small, a few are large). Account numbers with proper check digits. Currency-appropriate formatting.
Data generation approaches
Faker libraries: Available in every major language (Faker.js, Faker for Python, Bogus for .NET). These generate realistic data from locale-specific datasets.
import { faker } from '@faker-js/faker';
const user = {
name: faker.person.fullName(),
email: faker.internet.email(),
phone: faker.phone.number(),
address: {
street: faker.location.streetAddress(),
city: faker.location.city(),
state: faker.location.state(),
zip: faker.location.zipCode(),
},
createdAt: faker.date.past({ years: 2 }),
};
Template-based generation: Define a schema and generate data that conforms to it. This is better for maintaining referential integrity (orders reference existing customers, line items reference existing products).
Production data masking: Copy production data and replace sensitive fields with fake equivalents. This preserves the statistical distribution and edge cases of real data while removing PII. This is the gold standard for realistic test data but requires careful implementation to avoid leaking any real data.
Referential integrity
Generated data is useless if it does not maintain relationships. An order that references customer_id 42 is meaningless if customer 42 does not exist in your test data.
The approach: generate in dependency order. Create customers first, then create orders that reference existing customer IDs. Create products first, then create line items that reference existing product IDs.
const customers = Array.from({ length: 100 }, () => ({
id: faker.string.uuid(),
name: faker.person.fullName(),
}));
const orders = Array.from({ length: 500 }, () => ({
id: faker.string.uuid(),
customer_id: faker.helpers.arrayElement(customers).id,
total: faker.commerce.price({ min: 5, max: 500 }),
date: faker.date.past({ years: 1 }),
}));
Volume matters
Testing with 10 records is not the same as testing with 100,000. Performance issues, pagination bugs, and UI layout problems only appear at scale. Generate enough data to stress your system the way real users will.
A good rule of thumb: test with at least 10x your expected first-year data volume. If you expect 10,000 users, test with 100,000 dummy records.
The generator
For quick generation of realistic dummy data in various formats (JSON, CSV, SQL INSERT statements), I built a dummy data generator that supports common data types with configurable schemas and output sizes.
I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.
Top comments (0)