DEV Community

Cover image for Building an API Test Data Factory Without Faker (and Why You Might Want To)
Rishi Gaurav
Rishi Gaurav

Posted on

Building an API Test Data Factory Without Faker (and Why You Might Want To)

Faker is great until your test fails on a Tuesday because someone, somewhere, generated a name with an apostrophe and your SQL escape was off.

If you've worked in API testing for any length of time, you've probably used Faker or a similar library.

It solves an obvious problem: generating realistic-looking names, addresses, emails, phone numbers, and company details without manually maintaining large datasets.

For demos and prototypes, it's fantastic.

For automated testing at scale, however, randomness can become the enemy.

One failing test that can't be reproduced because the random generator created a slightly different payload is enough to consume hours of debugging time. Worse, intermittent failures erode confidence in your test suite, making engineers question whether failures indicate real defects or just unlucky data.

Over the past few years, I've gradually moved away from relying on Faker for the majority of my API tests. Instead, I use deterministic test data factories that generate the same data every time given the same input.

The result isn't just more stable tests—it's a system that's easier to debug, easier to parallelize, and far easier to maintain.

Here's why.


The Case Against Random Data in Tests

Random data sounds like a great idea.

Every execution uses fresh values.

No duplicate emails.

No conflicting usernames.

No hardcoded fixtures.

Until something fails.

Consider this test:

const customer = faker.person.fullName();
Enter fullscreen mode Exit fullscreen mode

Yesterday it generated:

John Smith
Enter fullscreen mode Exit fullscreen mode

Today it generated:

D'Arcy O'Connor
Enter fullscreen mode Exit fullscreen mode

Tomorrow it generates:

José Hernández
Enter fullscreen mode Exit fullscreen mode

Every one of those names is perfectly valid.

Yet they exercise completely different parts of your application.

Suddenly you're debugging:

  • Unicode handling
  • SQL escaping
  • JSON serialization
  • CSV exports
  • Search indexing
  • Email validation

None of which your test intended to verify.

The problem isn't Faker.

The problem is unpredictability.

A test should fail because the application changed—not because the generated input happened to be different this morning.


Random Data Makes Failures Harder to Reproduce

Imagine a CI pipeline reports:

Customer creation failed.
Enter fullscreen mode Exit fullscreen mode

The logs don't include the generated payload.

You rerun the pipeline.

The random generator produces different values.

The failure disappears.

Congratulations.

You've just created a "works on my machine" bug.

Deterministic test data eliminates this entirely.

Every execution starts from the same inputs.

Every failure becomes reproducible.


A Deterministic Factory: Seed → Entity

Instead of generating completely random objects, deterministic factories use a predictable input—usually called a seed.

Think of it like a mathematical function.

Seed 101
        ↓
Customer Object
Enter fullscreen mode Exit fullscreen mode

Every time the factory receives:

101
Enter fullscreen mode Exit fullscreen mode

it returns:

{
  "id": 101,
  "firstName": "Alice",
  "lastName": "Johnson",
  "email": "customer101@example.com"
}
Enter fullscreen mode Exit fullscreen mode

Tomorrow?

Exactly the same.

Next month?

Exactly the same.

Another developer's machine?

Exactly the same.

That's the beauty of deterministic generation.


A Simple Factory Pattern

Instead of writing:

faker.person.fullName();
Enter fullscreen mode Exit fullscreen mode

you build a reusable factory:

CustomerFactory.create(101)
Enter fullscreen mode Exit fullscreen mode

The factory owns:

  • Names
  • Emails
  • Addresses
  • Phone numbers
  • Relationships

Every entity is generated from a predictable algorithm rather than random selection.

Changing the seed changes the entity.

Using the same seed recreates it perfectly.


Why This Matters

Suppose Test A creates:

Customer #101
Enter fullscreen mode Exit fullscreen mode

Later, another test fails.

The logs mention:

customer101@example.com
Enter fullscreen mode Exit fullscreen mode

You immediately know:

  • Which factory generated it
  • Which seed produced it
  • Which scenario created it

Debugging becomes dramatically faster.


Per-Test Isolation Without Truncating the Database

One of the biggest challenges in API testing is keeping tests isolated from each other.

The traditional solution looks like this:

Run Test
↓

Insert Data
↓

Delete Everything
↓

Run Next Test
Enter fullscreen mode Exit fullscreen mode

Large integration suites spend a surprising amount of time cleaning databases.

Sometimes the cleanup takes longer than the tests themselves.


A Better Approach

Instead of deleting data after every test, assign each test its own namespace.

For example:

Test 15

Seed = 15000
Enter fullscreen mode Exit fullscreen mode

Every object generated by that test belongs to the same deterministic range.

Customer:

15001
Enter fullscreen mode Exit fullscreen mode

Order:

15002
Enter fullscreen mode Exit fullscreen mode

Invoice:

15003
Enter fullscreen mode Exit fullscreen mode

Another test uses:

Seed = 42000
Enter fullscreen mode Exit fullscreen mode

The datasets never collide.

No truncation required.

Tests remain isolated.

Parallel execution becomes much easier.


Faster CI Pipelines

This approach offers another benefit.

Because data never overlaps:

  • Parallel jobs become safer
  • Cleanup becomes optional
  • Database resets become less frequent

For large enterprise suites, that translates directly into shorter pipeline execution times.


Edge-Case Banks: The 30 Strings That Break Everything

Random generators are surprisingly bad at consistently exercising edge cases.

They occasionally produce unusual values—but not reliably enough.

Instead, maintain an edge-case bank.

Think of it as a curated library of problematic inputs.

Examples include:

Special Characters

O'Connor
Enter fullscreen mode Exit fullscreen mode
Smith-Jones
Enter fullscreen mode Exit fullscreen mode
Anne & Bob
Enter fullscreen mode Exit fullscreen mode

Unicode

José
Enter fullscreen mode Exit fullscreen mode
李小龙
Enter fullscreen mode Exit fullscreen mode
Ångström
Enter fullscreen mode Exit fullscreen mode

Emoji

🚀 Launch
Enter fullscreen mode Exit fullscreen mode
😀 Test
Enter fullscreen mode Exit fullscreen mode

Whitespace

 Leading
Enter fullscreen mode Exit fullscreen mode
Trailing
Enter fullscreen mode Exit fullscreen mode
Multiple    Spaces
Enter fullscreen mode Exit fullscreen mode

SQL-Like Inputs

Robert'); DROP TABLE Customers;
Enter fullscreen mode Exit fullscreen mode

Not because you expect SQL injection to succeed.

Because you expect your API to handle unusual strings safely.


Long Values

Generate:

  • 256 characters
  • 512 characters
  • 2048 characters

Length boundaries often expose validation issues.


Empty Variations

Don't stop at:

""
Enter fullscreen mode Exit fullscreen mode

Include:

  • Null
  • Spaces
  • Tabs
  • Newlines

Applications frequently treat these differently.


Why Banks Beat Randomness

Instead of hoping random generation eventually creates interesting inputs, you intentionally cover known problem categories.

Coverage becomes measurable.

Maintenance becomes predictable.

Regression testing becomes far stronger.


When Random Is Still the Right Call

None of this means randomness should disappear completely.

In fact, random generation excels in several testing strategies.

Fuzz Testing

Fuzz testing intentionally feeds unexpected inputs into APIs.

Examples include:

  • Random strings
  • Invalid encodings
  • Oversized payloads
  • Corrupted JSON

The objective is discovering crashes—not deterministic validation.

Randomness is valuable here.


Property-Based Testing

Property-based testing generates thousands of inputs automatically.

Instead of checking:

Customer Name = John
Enter fullscreen mode Exit fullscreen mode

you define rules like:

Every generated customer should produce a valid response.

The framework explores countless combinations searching for failures.

This is exactly where randomness shines.


Load Testing

Large performance tests often require:

  • Thousands of users
  • Millions of requests
  • Large datasets

Random variation helps avoid unrealistic caching effects.

Again, deterministic factories aren't always ideal.


The Right Balance

A mature testing strategy usually looks something like this:

Test Type Data Strategy
Unit Tests Deterministic
Contract Tests Deterministic
API Functional Tests Deterministic
Integration Tests Mostly Deterministic
Regression Tests Deterministic
Fuzz Tests Random
Property Tests Random
Performance Tests Mixed

The goal isn't eliminating randomness.

It's using it intentionally.


Building Your Own Test Data Factory

Creating a deterministic factory doesn't require a massive framework.

Start small.

Create factories for your most common entities:

  • Customers
  • Orders
  • Products
  • Users
  • Accounts

Accept a numeric seed.

Generate consistent values.

Store complex edge cases separately.

Over time, you'll build a reusable library that every test can rely on.

The factory becomes a single source of truth for API test data, reducing duplication and making tests easier to read.

Instead of embedding payloads throughout your codebase, developers can express intent clearly:

CustomerFactory.create(101);
OrderFactory.create(205);
ProductFactory.create(12);
Enter fullscreen mode Exit fullscreen mode

The implementation evolves.

The tests remain stable.


Final Thoughts

Random data generators like Faker remain excellent tools. They're quick to adopt, easy to use, and invaluable for prototypes, demonstrations, and exploratory testing.

But when you're building large, reliable API automation suites, predictability often matters more than realism.

Deterministic factories make failures reproducible.

Seed-based entities simplify debugging.

Per-test isolation improves parallel execution.

Edge-case banks provide deliberate coverage instead of accidental coverage.

And when randomness is genuinely needed—such as fuzz testing or property-based testing—you can still introduce it deliberately rather than allowing it to influence every test.

In other words, randomness should be a testing strategy, not a default.

If you're looking to improve the reliability of your automated API tests, learning how to generate API test data, the deterministic way is an excellent place to start:

https://totalshiftleft.ai/blog/how-to-generate-test-data-api-testing

The less time your team spends chasing unpredictable test failures, the more time they can spend finding real defects that matter.

Top comments (0)