DEV Community

Cover image for Building and battle-testing a Laravel package with AI peers
Sander Muller
Sander Muller

Posted on

Building and battle-testing a Laravel package with AI peers

I built laravel-fluent-validation, a fluent rule builder for Laravel. Magic strings like 'required|string|max:255' have always bothered me. I tried PRing expansions to Laravel's fluent API, but even small additions got closed with the usual answer: release it as a package instead. So I did.

Along the way I also fixed a performance problem with wildcard validation and built a Rector companion for automated migration.

The interesting part wasn't the package itself. It was the workflow that built and hardened it.

I used four Claude Code sessions. One owned the package, three owned real Laravel codebases that were adopting it. They reviewed each other's work through claude-peers, a peer messaging MCP server. The codebase peers would test, hit edge cases, report back. The package peer would fix, tag a release, and the codebase peers would re-verify. This compressed release-and-feedback loops from days to minutes.

The Rector companion went through eight functional releases in about 24 hours this way. 108 files converted on one codebase, net -1,426 lines of code, 566 tests green after migration with no behavioral regressions observed. But the Rector cycle is just the most compressed example. The same method shaped the performance benchmarks, the Livewire integration, the error messages, the documentation.

The examples below are Laravel-specific, but the method isn't. Isolated AI agents become far more useful when they review changes against multiple real environments with automated verification.

The workflow

claude-peers is an MCP server for Claude Code. Each instance running on your machine can discover other instances, see what they're working on, and send messages. They don't share context. Each has its own conversation with full codebase access.

In practice it works like this: the package peer tags a new release. It sends a message to the three codebase peers saying "0.4.5 tagged, fixes the parallel-worker race, please re-verify." Each codebase peer receives the message, pulls the new version, runs the migration, runs their tests, and sends back results. If something breaks, the response includes the exact error, the file, and usually a theory about why. The package peer reads that, asks follow-up questions if needed, fixes the issue, and the loop continues.

One thing I didn't expect was how quickly the peers developed their own review dynamic. They would challenge each other's assumptions, ask for evidence, and sometimes reach consensus before coming back with a recommendation.

I had four terminals open:

  • The package repo, building features, writing tests, shipping releases
  • Three production codebases, each a real Laravel app with its own validation patterns, framework integrations, and test suites

Everything runs locally. Claude Code works on local clones of each codebase, with the same filesystem access you'd have in your terminal. No production servers, no remote environments, no secrets exposed to AI.

Why real codebases beat synthetic fixtures

Running against multiple codebases isn't about redundancy. Each one stresses a different part of the code.

The first app has 108 FormRequests and uses rules() as a naming convention on Actions and Collections, not just validation. The Rector's skip log grew to 2,988 entries and 777KB. The package author expected a near-empty log. At 108 files, it was unusable. On a smaller codebase, you'd never notice. The same app also runs Filament alongside Livewire, and five of its components use Filament's InteractsWithForms trait, which defines its own validate() method. Inserting the package's trait would have created a fatal method collision on first form render. The right fix was to bail and flag those classes for manual review, since the Rector can't know whether the developer intends fluent validation or Filament's form validation.

The second app runs 15 parallel Rector workers. The skip log's "truncate on first write" flag was per-process, so every worker thought it was first and wiped the others' entries. Synthetic test fixtures run single-process. This bug doesn't exist there.

The third app was already on fluent validation with only 7 files left to convert. They tracked Pint code-style fixer counts across releases as an acceptance metric, and found that 5 of their 7 Livewire files had #[Validate] attributes coexisting with explicit validate([...]) calls. Dead-code attributes the package author hadn't anticipated. That drove a whole new hybrid-detection path.

None of these were likely to surface in a fixture-based test suite.

What automated tests still missed

The first app tracked firing counts across every release, how many times each Rector rule fired on their 108-file corpus. On one release, trait-insertion rectors fired zero times. Rector still reported "108 files changed" because the converter rules worked fine. A tester checking that output would have shipped it. The peer tracking counts caught that "108 to 0 on trait rectors" was a regression. The fix landed the same day, and expected counts became a permanent test.

One peer asked a question during a retrospective: "You've tested that the Rector output parses. Have you tested that the runtime semantics match?" Nobody had asked this in nine releases. It led to 16 parameterized test cases asserting that FluentRule and string-form rules produce identical error messages. All 16 passed. But those tests only exist because a peer who didn't write the code asked "prove it."

What peers changed at the design level

Before one release, the package peer was weighing whether to expand detection to handle new Password() constructor calls inside rule arrays. It sounded reasonable, more complete conversion, 30-60 minutes of work. A codebase peer killed it with one observation: the converter is context-free. It runs inside rules() methods and inside attribute arguments. Any expansion would fire in both contexts, silently rewriting code where the developer chose the constructor form intentionally. No test was failing. The feature would have worked in the narrow case it was designed for. The peer prevented it by naming a failure mode the author hadn't considered.

All three codebases reported near-zero ternary rules ($condition ? 'required' : 'nullable'), which was enough to shelve the feature on demand alone. But one peer added a reframe: developers who reach for ternaries in rule arrays are optimizing for terseness, and the closure-form fluent version loses on that axis by construction. Even with demand, the feature might make its target audience's code worse. That moved it from "deferred" to "won't fix."

In both cases, the peer contributed framing, not just evidence.

What made this work

Each Claude instance has full codebase access and its own conversation history. The package peer knows the internals. The codebase peers know their app's patterns, test suites, and integrations. Nobody has to context-switch.

The codebases were real, not demo fixtures. Every bug described above required production-level complexity that doesn't exist in test scenarios.

Automated verification made the loop objective. The package runs PHPStan on level max, Rector, and Pint on every change, with 616 tests and 1,235 assertions. Each codebase peer runs the same stack. When a peer reports "PHPStan clean, 566 tests green, Pint fixer count down from 3 to 2," you can trust the result.

The back-and-forth was fast because it stayed in the same session. Tag a release, three codebases verify, issues come back with exact errors and hypotheses, fix ships, re-verify. The whole cycle in 15-30 minutes. GitHub issues lose context between messages. These peers kept corpus knowledge across every release.

And the peers could challenge scope, not just report failures. The new Password() conversation and the ternary-rule reframe both came from peers who could say "I don't think you should build this" with technical reasoning.

What this workflow costs

Running four Claude Code sessions in parallel means watching your weekly usage limits and session caps burn in front of your eyes. It's worth it for a focused release cycle, but you feel the cost. For a solo contributor, the same process works across sequential sessions. You'd lose the synchronous loop but keep the corpus context.

The workflow also has a blind spot: if all test codebases share the same architectural assumptions, peers can miss the same category of bug together. The three-codebase model worked here because each app had genuinely different patterns: scale, parallel execution, hybrid Livewire attributes. If all three had been small Livewire apps, the skip-log volume and parallel-worker bugs would have shipped uncaught.

When I would and wouldn't use this

I'd use this workflow for packages or tools that modify other people's code: Rector rules, code generators, migration tools, linters. The cost of a silent-rewrite bug is high, and running against codebases you didn't write is the most reliable way to catch them before release.

I'd also use it for packages with integration surface across frameworks. Livewire, Filament, and Inertia all have their own quirks. A peer running on a codebase that actually uses Filament + Livewire together will find trait conflicts and method collisions that your test suite won't.

For a simpler utility package with a narrow API surface, I'd scale it down. One project peer instead of three. You still get the "does this actually work in someone else's codebase" signal without the overhead of a full multi-peer setup.

The surprising part was that multiple isolated peers, each grounded in a different real codebase, acted more like an internal design-and-QA loop than an autocomplete tool. That changed what got built, what got cut, and what got tested.


The package: laravel-fluent-validation -- fluent validation rule builders with up to 160x wildcard performance gains, full Laravel parity, Livewire and Filament support.

The Rector companion: laravel-fluent-validation-rector -- automated migration from string rules. 108 files converted on one production codebase, -1,426 LOC, 566 tests green.

The peer messaging: claude-peers

AI skills for Laravel packages: package-boost -- ships migration guides, optimization hints, and framework-specific gotchas alongside your package so each peer has context without manual setup.

Top comments (0)