When Your App and the Gateway Disagree: Orphan Cleanup and State Reconciliation

#laravel #devops #php #architecture

Anyone who manages an external system from a Laravel app eventually hits the same uncomfortable truth: your database thinks one thing, the external system thinks another, and nobody told either of them they'd drifted apart. Today I spent most of my time on the unglamorous-but-essential side of running an API gateway from Laravel — making the app's view of the world and the gateway's actual state agree, and giving myself safe tools to fix it when they don't.

Two themes came out of it that are worth writing down: orphan cleanup (objects that exist on the gateway but nothing in my app tracks anymore) and key-material sync (keeping JWKS in step with the credentials the app issued). Both are really the same problem wearing different hats — reconciliation across a boundary you don't fully control.

Why drift happens at all

When you provision a service, route, or consumer on a gateway, you're doing a write on two systems: your own database row, and the gateway's config via its admin API. The moment those two writes aren't atomic — and across a network they never are — you have a window where they can disagree.

A few normal, blameless ways it happens:

A create succeeds on the gateway but the follow-up DB write fails (or vice-versa).
Someone fixes something by hand directly on the gateway during an incident, and the app never finds out.
A delete removes the local row but the gateway call times out and silently leaves the object behind.
A half-finished migration from an older gateway leaves objects nobody is tracking.

None of these are bugs exactly — they're the cost of coordinating two systems. The mistake is pretending it won't happen. So instead I treat the gateway as a source of truth I reconcile against, not one I assume mirrors me.

Orphans: things on the gateway my app forgot

An orphan is any object living on the gateway that has no corresponding record on my side. They're not harmless — an orphaned route can still serve traffic, an orphaned consumer can still hold valid credentials. That's a security and billing problem, not just untidiness.

The detection is conceptually simple: list what the gateway has, list what I have, and diff. The discipline is in how you act on the diff. I expose this as a tool, and the key design decision is that listing and deleting are two separate steps with a human (or an explicit confirmation) in between.

final class GatewayOrphanCleanup
{
    public function __construct(
        private GatewayClient $gateway,
        private ServiceRepository $services,
    ) {}

    /** @return array<int, Orphan> objects on the gateway we don't track. */
    public function detect(): array
    {
        $tracked = $this->services->allGatewayIds();      // ids we know about
        $remote  = $this->gateway->listServices();        // what actually exists

        return collect($remote)
            ->reject(fn (array $obj) => in_array($obj['id'], $tracked, true))
            ->map(fn (array $obj) => new Orphan(
                id: $obj['id'],
                name: $obj['name'] ?? null,
                createdAt: $obj['created_at'] ?? null,
            ))
            ->values()
            ->all();
    }
}

Notice what detect() does not do: it doesn't delete anything. It returns a description of the drift. The deletion is a deliberate second call that takes a specific id you got from the detect step:

public function delete(string $gatewayId): Result
{
    // Refuse to delete anything we actually track — that path goes
    // through the normal action + approval flow, never orphan cleanup.
    if ($this->services->existsByGatewayId($gatewayId)) {
        return Result::refused('Not an orphan — use the managed delete flow.');
    }

    $this->gateway->deleteService($gatewayId);

    return Result::ok("Removed orphaned object {$gatewayId}.");
}

That guard clause is the whole point. The most dangerous bug in a cleanup tool is one that "cleans up" something that wasn't actually an orphan. So the delete path re-checks that the target really is untracked before it touches the gateway. Detect-then-confirm beats a single "sync everything" button you can't take back — especially when, increasingly, the thing calling these tools is an AI agent rather than a careful human reading the list twice.

Keeping JWKS honest

The second slice was credential visibility and key sync. When a gateway validates JWTs, it needs the public keys — usually exposed as a JWKS (JSON Web Key Set). If your app rotates or issues signing keys, the gateway's copy has to follow, or you get the worst kind of outage: tokens that are correctly signed but rejected because the verifier is looking at stale keys.

So I gave myself two complementary tools: one to see what credentials a consumer currently has, and one to sync the JWKS so the gateway's view matches what the app issued. Visibility first, because you can't safely sync what you can't inspect.

final class SyncConsumerJwks
{
    public function __construct(private GatewayClient $gateway) {}

    public function handle(string $consumerId, array $jwks): Result
    {
        $existing = $this->gateway->listJwtCredentials($consumerId);
        $desired  = $this->normalise($jwks);   // key-id => public key

        // Add anything the gateway is missing.
        foreach (array_diff_key($desired, $existing) as $kid => $key) {
            $this->gateway->addJwtCredential($consumerId, $kid, $key);
        }

        // Remove keys the app no longer vouches for (revoked / rotated out).
        foreach (array_diff_key($existing, $desired) as $kid => $_) {
            $this->gateway->deleteJwtCredential($consumerId, $kid);
        }

        return Result::ok('JWKS reconciled.');
    }
}

Same shape as the orphan logic, right? List remote, compute the desired set, apply the difference in both directions. Sync isn't "push my state over theirs" — it's "make theirs match mine, adding and removing". Forgetting the removal half is how revoked keys keep working long after you thought you'd killed them.

A subtle but important call: I only remove a key when the app is sure it has rotated out, not just because it isn't in the current request. An accidental over-prune here locks out live clients. When in doubt, the sync is additive and flags the extras for review rather than deleting them.

Why this is worth a Pest test

Reconciliation logic is exactly the kind of code that looks obviously correct and then quietly does the wrong thing six months later. The assertions worth writing are the boundaries: an object I track must never be classed as an orphan, and the sync must remove revoked keys, not just add new ones.

it('never flags a tracked object as an orphan', function () {
    $service = Service::factory()->create(['gateway_id' => 'svc_123']);

    $gateway = fakeGatewayWith(['svc_123', 'svc_999']); // 999 is the real orphan

    $orphans = app(GatewayOrphanCleanup::class)->detect();

    expect($orphans)->toHaveCount(1)
        ->and($orphans[0]->id)->toBe('svc_999')
        ->and(collect($orphans)->pluck('id'))->not->toContain('svc_123');
});

it('refuses to delete a tracked object via the orphan path', function () {
    Service::factory()->create(['gateway_id' => 'svc_123']);

    $result = app(GatewayOrphanCleanup::class)->delete('svc_123');

    expect($result->refused())->toBeTrue();
});

The second test is the one I care about most. It encodes a rule that has nothing to do with the happy path and everything to do with not deleting the wrong thing: the orphan-cleanup door must stay shut for anything the app actually manages. If a future refactor weakens that guard, this test goes red before it ever ships.

The pattern underneath

Strip away the gateway specifics and there's one reusable idea here: when you mirror state into a system you don't fully own, build the reconciler before you need it, and make every destructive step re-verify its own precondition. Detect and act are separate. Sync goes both directions. Deletes prove the target is really safe to delete at the moment of deletion, not just when it was listed.

It's the same instinct as a good database constraint — you don't trust that callers will always be careful, you make the dangerous thing structurally hard to do by accident. Doubly so when the "caller" might be an AI agent reading your tool descriptions and deciding what to invoke.

Next on my list: writing a developer-facing guide for the cutover when you migrate from one gateway to another — the part where DNS, tokens, and consumer credentials all have to move without dropping live traffic. That's a reconciliation problem too, just on a much scarier clock.