SEN LLC

Posted on Apr 16

Seed Data Without a Dependency Tree: A Tiny PHP CLI Instead of Faker

#php #cli #database #tutorial

Seed Data Without a Dependency Tree: A Tiny PHP CLI Instead of Faker

I needed fifty users and two hundred posts in a fresh Postgres schema so a dashboard demo wouldn't look empty. The Laravel seeder doesn't work outside Laravel. Faker works everywhere but pulls in a tree. PHP's json_decode is already there. So I wrote the smallest possible thing that could walk a JSON config and emit INSERT statements — with foreign keys, deterministic seeds, and dialect-aware escaping — and it turned out to be about 600 lines.

📦 GitHub: https://github.com/sen-ltd/db-seed-faker

Every web framework has an opinion about seed data. Laravel has Seeder classes plus Factory definitions. Rails has fixtures (YAML) and factory_bot. Django has dumpdata / loaddata. Spring has data.sql. They all work well inside their ecosystem and are all useless the minute you're outside it — which is the situation you find yourself in when the job is "I have a fresh Postgres schema for a side project, I want to psql -f seed.sql and have it look populated so I can screenshot the admin UI."

Faker libraries solve half of that, but they're big. fakerphp/faker is 1.2 MB installed, most of which is locale data you don't need. More importantly, they're a Composer dependency: now your tiny schema-populating script requires composer install and a lockfile in your side project. That's a lot of ceremony for a demo.

I wanted something I could run as docker run --rm -v $(pwd):/work db-seed-faker /work/seed.json > seed.sql and forget. No install. No framework. A config file that reads top-to-bottom like a TypeScript interface. Deterministic output when I pass --seed 42, because diffing generated SQL is the whole point of using a config file in the first place.

This is a walkthrough of the build. Two previous SEN tools appear in it — csv-to-sql (#135) for the SQL escape rule, and the id-generator article (#004) for the "just write the algorithm yourself, it's 20 lines" attitude. Neither is required reading; this post is self-contained.

The config shape

Here's a config that generates 50 users and 200 posts belonging to them:

{
  "tables": [
    {
      "name": "users",
      "count": 50,
      "columns": [
        {"name": "id", "type": "int", "generator": "increment", "start": 1},
        {"name": "name", "type": "string", "generator": "name"},
        {"name": "email", "type": "string", "generator": "email"},
        {"name": "role", "type": "string", "generator": "enum",
         "values": ["admin", "user", "guest"]},
        {"name": "age", "type": "int", "generator": "number", "min": 18, "max": 99}
      ]
    },
    {
      "name": "posts",
      "count": 200,
      "columns": [
        {"name": "id", "type": "int", "generator": "increment", "start": 1},
        {"name": "user_id", "type": "int", "generator": "foreign_key",
         "table": "users", "column": "id"},
        {"name": "title", "type": "string", "generator": "sentence", "words": 5},
        {"name": "body", "type": "text", "generator": "paragraph", "sentences": 3}
      ]
    }
  ]
}

Two design choices are visible here that drove the rest of the implementation.

First: JSON, not YAML. The original spec for this was YAML because everyone loves YAML for configs. But PHP 8 doesn't ship a YAML parser in its standard library — yaml_parse() requires the PECL yaml extension — and every time I've written a 100-line YAML parser by hand I've regretted it about six hours later when someone puts a colon in a value. JSON is built into json_decode, errors are mechanical, and the config is the same size. JSON it is.

Second: table order is load order. Notice that posts comes after users, and the foreign_key column on posts references users. The config loader enforces that any foreign key must reference a table declared earlier in the file. I deliberately did not add a topological sort. Two reasons:

The source file should show dependency order to a reader skimming top-to-bottom. Implicit sort makes that harder.
Circular foreign keys can't be expressed this way, which is a feature. Seed data shouldn't have circular dependencies.

The second point deserves a beat: real schemas sometimes have circular FKs (employee → manager → employee), but seed data for them is almost always generated by inserting rows with NULL parents first, then updating. If you need that, use a real migration tool. This isn't it.

Generator interface

Every generator is an object that knows how to produce one cell value per call:

<?php
namespace SenLtd\DbSeedFaker\Generators;

use SenLtd\DbSeedFaker\Rng;

interface GeneratorInterface
{
    /**
     * @param array<string, list<array<string, string|int|null>>> $context
     */
    public function generate(int $rowIndex, array $context, Rng $rng): string|int|null;
}

Three parameters: the 0-based row index inside the current table, the already-generated row data for previous tables (keyed by table name), and a seeded RNG. The return type is narrow — string|int|null — because every SQL target I care about round-trips bools as integers, and floats are never used for seed data in practice (what would a random float seed even mean for a demo?).

Each generator reads its config once at construction time via a GeneratorFactory:

return match ($kind) {
    'increment' => new IncrementGenerator($this->intOrDefault($column, 'start', 1)),
    'number' => new NumberGenerator(
        $this->requiredInt($column, 'min'),
        $this->requiredInt($column, 'max'),
    ),
    'name' => new NameGenerator(),
    'email' => new EmailGenerator(),
    'sentence' => new SentenceGenerator($this->intOrDefault($column, 'words', 6)),
    'paragraph' => new ParagraphGenerator(
        $this->intOrDefault($column, 'sentences', 3),
        $this->intOrDefault($column, 'words', 8),
    ),
    'datetime' => new DateTimeGenerator(
        $this->requiredString($column, 'from'),
        $this->requiredString($column, 'to'),
    ),
    'uuid' => new UuidGenerator(),
    'foreign_key' => new ForeignKeyGenerator(
        $this->requiredString($column, 'table'),
        $this->requiredString($column, 'column'),
    ),
    'enum' => new EnumGenerator($this->requiredList($column, 'values')),
    default => throw new \InvalidArgumentException("unknown generator: {$kind}"),
};

The match expression is the entire registry. Adding a new generator means adding one line here and one file under src/Generators/. I considered making the registry user-extensible via a plugin system — the kind of thing where you write new Faker\Provider\Internet(...) and it discovers itself. I backed off. The set of ten built-in generators covers every seed file I've ever actually written; any extension past that is better expressed by editing the source.

The foreign-key generator

This is the generator where all the interesting behaviour lives, because it's the one that forced the whole "keep row context in memory" decision. Here's the code in full:

<?php
namespace SenLtd\DbSeedFaker\Generators;

use SenLtd\DbSeedFaker\Rng;

final class ForeignKeyGenerator implements GeneratorInterface
{
    public function __construct(
        private readonly string $table,
        private readonly string $column,
    ) {
    }

    public function generate(int $rowIndex, array $context, Rng $rng): string|int|null
    {
        if (!isset($context[$this->table])) {
            throw new \RuntimeException(
                "foreign_key: no generated rows for table '{$this->table}'"
            );
        }
        $rows = $context[$this->table];
        if ($rows === []) {
            throw new \RuntimeException(
                "foreign_key: table '{$this->table}' has zero rows"
            );
        }
        $pick = $rng->intRange(0, count($rows) - 1);
        $row = $rows[$pick];
        if (!array_key_exists($this->column, $row)) {
            throw new \RuntimeException(
                "foreign_key: column '{$this->column}' not found in '{$this->table}'"
            );
        }
        return $row[$this->column];
    }
}

The emitter feeds context as it walks tables. For every row it generates, it stores the full row keyed by column name in $context[$tableName][$rowIndex]. When ForeignKeyGenerator::generate() is called during a later table's pass, it picks a uniform random row from the target and returns the requested column's value.

This in-memory cache is the reason the tool doesn't stream. For seed data generation that's fine — even 100k parent rows × an integer id per row is a few MB. The tool will happily run on a laptop for that size. I put a note in the README: "not a load generator." If you need millions of rows, use Postgres's generate_series directly.

One thing the FK generator deliberately does not do: clumping. Every parent row has a uniform chance of being picked, so for a 200-posts / 50-users seed you get ~4 posts per user on average with natural variance. For a demo that looks more realistic than forcing exactly 4-per-user, and it's also one fewer knob.

Escaping: the one place dialects disagree

The escaper is a port of the same rule used in csv-to-sql (#135). The reason that rule survived into this project unchanged is that it's the only safe choice across every SQL dialect I care about:

// String path. Reject embedded NUL; double every single quote.
if (strpos($value, "\0") !== false) {
    throw new \InvalidArgumentException('NUL byte in value cannot be escaped');
}
return "'" . str_replace("'", "''", $value) . "'";

That's the string-literal side. MySQL's documentation cheerfully tells you \' is valid inside a string literal. It is — except when sql_mode contains NO_BACKSLASH_ESCAPES, in which case \' is a literal backslash followed by a closing quote, and your insert statement is now a SQL injection hole that opens up under a particular server configuration. The doubled-quote form ('') is the SQL standard, works on MySQL with any sql_mode, works on Postgres, works on SQLite, and has zero configuration surface.

Where dialects actually disagree is identifier quoting and bool literals, so those get a branch:

public function ident(string $name): string
{
    if ($name === '' || strpos($name, "\0") !== false) {
        throw new \InvalidArgumentException('invalid identifier');
    }
    if ($this->dialect === self::DIALECT_MYSQL) {
        return '`' . str_replace('`', '``', $name) . '`';
    }
    // postgres + sqlite both use standard double-quoted identifiers.
    return '"' . str_replace('"', '""', $name) . '"';
}

Postgres bool literals are TRUE/FALSE; MySQL and SQLite both accept 1/0 (and MySQL accepts TRUE as a synonym too, but I'd rather pick the form that's portable). That's the entire dialect surface. No SQL generation code downstream needs to know which dialect it's in — it asks the escaper and moves on.

Determinism: a seeded RNG, not `random_bytes`

Every generator flows randomness through a single Rng instance:

final class Rng
{
    private int $state;

    public function __construct(int $seed)
    {
        $this->state = $seed & 0x7fffffff;
        if ($this->state === 0) {
            $this->state = 1;
        }
    }

    /**
     * Park-Miller minimal LCG. Deliberately simple — we're generating fake
     * names, not cryptographic material. Guaranteed identical output across
     * all PHP versions >= 8.2 because there's zero implementation leeway.
     */
    public function next(): int
    {
        $this->state = ($this->state * 48271) % 0x7fffffff;
        return $this->state;
    }

    public function intRange(int $min, int $max): int
    {
        if ($min > $max) throw new \InvalidArgumentException("min > max");
        if ($min === $max) return $min;
        $span = $max - $min + 1;
        return $min + ($this->next() % $span);
    }
}

Park-Miller is 50 years old and has well-known limitations for simulation work. It is not cryptographic. For seed data that's perfect, because the whole point is that --seed 42 produces bit-for-bit identical SQL today, tomorrow, and on a fresh container image six months from now.

I started with mt_rand seeded by mt_srand(), but that has two problems. First, mt_rand's output has changed across PHP point releases historically, and I didn't want to audit which. Second, mt_srand touches a global — two tests in the same PHPUnit process can clobber each other's seeds. An instance-scoped Park-Miller sidesteps both problems in about twelve lines.

The UUID generator is the only place where you might think twice about this trade-off. I documented it explicitly in the code:

// This makes the values _not_ cryptographically random — they must not
// be used as security tokens — but for seed data that's the entire point:
// you want diffable SQL output.

If you need real UUIDs in your seed, use random_bytes(16) and forfeit determinism. For fake demo data, I'll take reproducibility.

Trade-offs I made on purpose

A few things you'll notice are missing from this tool, in decreasing order of "someone will ask why":

No locale support. Names come from two CC0 lists of 50 English first and last names, hand-curated. Japanese names, Arabic names, anything non-ASCII — not supported. If you need those, fork and replace the data/*.txt files. A real faker ships hundreds of thousands of lines of locale data; I ship 100.
No relational constraints beyond FK count. You can't say "every user has between 2 and 10 posts." The FK generator picks uniformly. This is a deliberate non-feature: adding it multiplies the config surface and almost nobody actually cares for demo data.
No streaming. Row context is held in memory because FK generation needs it. For a million-row seed, go use pg_bench or write raw SQL with generate_series. This tool is for 10–100k rows.
No schema generation. No CREATE TABLE. Use a migration tool. The rows-vs-schema separation is the one thing every real project gets right and every demo tool gets wrong by merging them.
No unique-constraint awareness. The email generator has a two-digit disambiguator tail to reduce collisions, but it can't guarantee 50 unique emails. If your schema has UNIQUE(email) and you generate 1000 users, you'll hit collisions eventually. Fix: generate more rows or use increment in the email.
No Faker compatibility layer. Tempted, skipped. Faker's provider interface is complex, and porting it would have doubled the code.

Try it in 30 seconds

git clone https://github.com/sen-ltd/db-seed-faker
cd db-seed-faker
docker build -t db-seed-faker .

cat > seed.json << 'EOT'
{
  "tables": [{
    "name": "users",
    "count": 5,
    "columns": [
      {"name": "id", "type": "int", "generator": "increment", "start": 1},
      {"name": "name", "type": "string", "generator": "name"},
      {"name": "email", "type": "string", "generator": "email"}
    ]
  }]
}
EOT

docker run --rm -v $(pwd):/work db-seed-faker /work/seed.json \
  --dialect postgres --seed 42

And you get five INSERT INTO "users" rows printed to stdout that you can pipe into psql. Same seed, same output, every time.

What I got out of writing it

Three things beyond the tool itself.

First: the configurability spectrum has a sweet spot at "JSON file + registry of ten generators." It's small enough that a new contributor can read the whole thing in an afternoon, large enough that it actually covers real demo needs. Every time I was tempted to add a new axis of configuration (locales, providers, clumping, schema generation) I checked whether it was really needed, and the answer was usually no.

Second: the SQL escape rule from csv-to-sql (#135) keeps paying dividends. Any time I need to emit SQL from PHP, I now reach for the same '' doubling and know it's safe. That's the kind of pattern reuse that makes the 100-projects portfolio worth doing — you build one small tool that gets one thing right, and six projects later the same module drops in cleanly.

Third: determinism is cheap and worth it. Twelve lines of Park-Miller, a single seed flag, and now your seed data diffs cleanly in git. I can't think of a reason to ever generate non-deterministic fake data for demos. The default (seed=1) is fine; the --seed flag exists mostly so you can generate a second dataset that differs from the first if you want to show before/after screenshots.

Lines of code

About 600 lines total (including tests), zero runtime dependencies. The Docker image is 51 MB — php:8.2-cli-alpine is the floor and there's almost nothing on top. 49 PHPUnit tests covering escapers per dialect, each generator happy/sad paths, FK resolution order, deterministic seed reproduction, CLI exit codes, file output. You could port the entire thing to Go or Python in an afternoon; the point is that PHP's stdlib was already good enough.

Sometimes the answer to "what dependency should I use for this" is "none, I'll write it."

DEV Community

Seed Data Without a Dependency Tree: A Tiny PHP CLI Instead of Faker

Seed Data Without a Dependency Tree: A Tiny PHP CLI Instead of Faker

The config shape

Generator interface

The foreign-key generator

Escaping: the one place dialects disagree

Determinism: a seeded RNG, not `random_bytes`

Trade-offs I made on purpose

Try it in 30 seconds

What I got out of writing it

Lines of code

Top comments (0)

Seed Data Without a Dependency Tree: A Tiny PHP CLI Instead of Faker

The config shape

Generator interface

The foreign-key generator

Escaping: the one place dialects disagree

Determinism: a seeded RNG, not random_bytes

Trade-offs I made on purpose

Try it in 30 seconds

What I got out of writing it

Lines of code

Determinism: a seeded RNG, not `random_bytes`