DEV Community: Kingsley Onoh

Why I Froze Simulation Inputs Before the Solver Ran

Kingsley Onoh — Thu, 28 May 2026 13:24:24 +0000

A planner can approve the right transfer for the wrong reason.

That was the failure mode I kept coming back to while building the Inventory Allocation Simulator. The solver could be mathematically correct, the recommendation could have positive net value, and the UI could show a clean explanation. But if the explanation reread today's warehouse and SKU tables after yesterday's simulation finished, the audit trail would be fiction.

Inventory data changes constantly. A lane gets disabled. A SKU margin changes. Inbound units arrive. Demand history is corrected because a stockout period was recorded as zero sales. If a completed simulation depends on the current state of those tables, its story changes every time the business updates its planning data.

I was wrong to treat that as a reporting problem at first. It was a data contract problem.

The failure hiding in a normal design

The first design looked harmless: store a simulation run, run the worker, persist recommendations, and render the detail page by joining back to warehouses, SKUs, inventory, lanes, and policies. Most CRUD systems are written that way because joins are cheap and normalized tables keep data clean.

That design breaks the moment a simulation becomes evidence.

The recommendation is not just a row saying transfer 30 units. It needs to explain the constraint that bound the decision, the demand scenario that created the shortage, the service-level tail left unmet, and the tradeoffs accepted. In this project those fields live in explanation: binding_constraints, scenario_sensitivity, accepted_tradeoffs, net_value, and solver diagnostics.

If the explanation uses live tables, a planner can open the same completed run on Tuesday and see different supporting facts from Monday. That is worse than no explanation. It creates confidence in a record that no longer matches the decision.

The constraint I chose

I made simulation creation the boundary.

create_simulation_run! authorizes the planner, validates the scenario count, captures every planning surface needed by the solver, and stores it inside simulation_runs.input_snapshot. The worker consumes that snapshot. The detail page reads that snapshot. Demand scenarios are stored with the run. Completed runs do not ask the mutable catalog what the world looks like now.

The core function is small, which is the point:

function capture_simulation_input_snapshot(
    store::AbstractTenantAdminStore,
    ctx::TenantContext,
    policy_id,
)::NamedTuple
    authorize!(ctx, "run_cancel", "simulation")
    parsed_policy_id = _uuid_value(policy_id)
    policy = _snapshot_policy(store, ctx, parsed_policy_id)
    return (
        tenant_id = string(ctx.tenant_id),
        policy = policy,
        warehouses = [_warehouse_response(row) for row in fetch_warehouses(store, ctx.tenant_id, _snapshot_page())],
        skus = [_sku_response(row) for row in fetch_skus(store, ctx.tenant_id, _snapshot_page())],
        inventory_positions = [_inventory_response(row) for row in fetch_inventory_positions(store, ctx.tenant_id, _snapshot_page())],
        demand_history = [_demand_response(row) for row in fetch_demand_history(store, ctx.tenant_id, _snapshot_page())],
        transfer_lanes = [_lane_response(row) for row in fetch_transfer_lanes(store, ctx.tenant_id, _snapshot_page())],
    )
end

That snapshot is not elegant. It is intentionally blunt. The run carries the policy, warehouses, SKUs, inventory positions, demand history, and transfer lanes it used. SNAPSHOT_MAX_ROWS is set to 1_000_000, which tells you the tradeoff plainly: this is a batch planning system, not a real-time transfer executor.

What surprised me

The snapshot decision also fixed a forecasting bug before it could become a solver bug.

Stockout periods are dangerous because they make demand look low. In clean_demand_history, the system stores both observed units and adjusted units. If demand was zero and lost sales were 82, the cleaned demand is 82. That value feeds the scenario generator. The stockout row also inflates uncertainty, because a period with unavailable inventory is less trustworthy than normal sales.

The Batch 019 test mutates live inventory and demand after a run is created. Then it runs the worker and checks that the scenario baseline still comes from the frozen snapshot, not the updated live row. That was the test that made the architecture feel real. It did not just prove a function. It proved the system can remember what it believed when the recommendation was created.

The alternative was versioning every table. That would give better diff history, but it would also make the MVP harder to operate. I chose the snapshot because the project needed completed-run honesty more than a general temporal database. A future version could move to event-sourced planning records. This one needed a concrete audit boundary.

Where the solver fits

The solver reads the frozen snapshot and stored scenarios, then builds a JuMP model over lanes, SKUs, inventory, service level, warehouse capacity, transfer cost, and safety stock. The model is allowed to fail. In fact, readable failure is part of the contract.

A region rule can block every feasible transfer. A max transfer cost can make the plan infeasible. A timeout can return no acceptable incumbent. Those cases return diagnostics instead of pretending every scenario has a transfer.

That mattered for the Journal topic because failure diagnostics have to describe the same world the solver saw. The _constraint_report function can name max_transfer_cost_cents, region blocking, sender safety stock, and receiver service-level constraints. If those facts came from live tables, the failure message could drift too.

The result

The deterministic solver fixture produces a 30-unit transfer from WH-SURPLUS to WH-NEED, with lane_capacity and receiver_service_level as binding constraints and a 30,900-cent net value. The large benchmark ran 50 warehouses, two thousand SKUs, and 100 scenarios in 17,928.4753 ms and generated two thousand recommendations.

Those numbers matter less than the contract behind them. A completed run is not a report over current data. It is a preserved decision record. Once I made that boundary explicit, the rest of the system had somewhere honest to stand.

Why I Moved Redis Acknowledgement Outside the Database Transaction

Kingsley Onoh — Fri, 22 May 2026 10:22:55 +0000

One good parcel event, one bad parcel event, one batch. That was enough to lose the good one.

The consumer read both from Redis Streams. The first event was valid, so the app wrote a shipment event and a claim case, then acknowledged the Redis message. The second event had no tenant id, threw an exception, and rolled back the PostgreSQL transaction. Redis had already been told the first message was done.

The database said nothing happened. Redis said the message was gone.

That is the kind of bug that looks like an operations mystery later. A carrier says it sent the event. The stream no longer shows it as pending. The claim queue has no case. Everyone starts looking at logs, but the data has already contradicted itself.

The part I got wrong

I treated Redis acknowledgement as part of processing. That was too early.

The first version of DeliveryEventConsumerJob.pollOnce() processed each stream record inside a TransactionTemplate. It wrote through ShipmentEventIngestionService.upsertFromEvent(...), then acknowledged the message in the same loop. It felt clean because both actions sat inside the same method and both happened after the event handler returned.

But Redis is not inside the PostgreSQL transaction. XACK does not care whether the database later commits. Once Redis removes the message from the pending list, the recovery path changes. If the transaction rolls back after that ack, PostgreSQL loses the row and Redis loses the retry handle.

That is not a duplicate problem. It is a vanished work problem.

The codebase already had idempotency in the right place for duplicates. ShipmentEventIngestionService builds a tenant-scoped dedup key from the event source and takes a PostgreSQL advisory transaction lock before inserting into shipment_events. The table also has a (tenant_id, dedup_key) uniqueness constraint. A replay can create at most one shipment event row and one claim case.

I was wrong to optimize for avoiding replay. Replay was safe. Premature ack was not.

The shape of the fix

The fix was small, but the boundary it moved mattered.

public int pollOnce() {
    if (!Boolean.TRUE.equals(properties.enabled())) {
        return 0;
    }
    List<String> processedMessageIds = transactionTemplate.execute(status -> {
        if (!lockManager.tryAcquireTransactionLock(LOCK_NAME)) {
            return List.of();
        }
        streamClient.ensureConsumerGroup();
        String consumer = consumerName();
        List<DeliveryGatewayTrackingEvent> events = streamClient.readPendingEvents(
                consumer, BATCH_SIZE, Duration.ZERO);
        if (events.isEmpty()) {
            events = streamClient.readNewEvents(consumer, BATCH_SIZE, Duration.ofMillis(250));
        }
        List<String> messageIds = new ArrayList<>();
        Map<UUID, TenantContext> tenantsById = new HashMap<>();
        for (DeliveryGatewayTrackingEvent event : events) {
            process(event, tenantsById);
            messageIds.add(event.messageId());
        }
        return messageIds;
    });
    if (processedMessageIds == null || processedMessageIds.isEmpty()) {
        return 0;
    }
    streamClient.acknowledgeAll(processedMessageIds);
    return processedMessageIds.size();
}

The transaction now returns the message ids only after the database work succeeds. Only then does the job call acknowledgeAll.

That creates two possible failure states, and both are tolerable.

If the database fails, no ack happens. Redis still has the pending messages. The next poll retries them.

If the database commits and the ack fails, Redis may replay messages whose database rows already exist. That is fine. The dedup key, advisory lock, and unique index collapse the replay back to the existing shipment event. The system may do extra work, but it will not create a second claim.

This is the tradeoff I should have chosen from the start: duplicate work over lost work.

Why the batch cache came later

The ack fix uncovered a different problem. The 50 events/sec acceptance test passed in isolation, then failed during full regression. The batch took 1,805 ms against a 1,000 ms target.

At first glance, that looked like Redis overhead. It wasn't. The consumer was reading 50 messages and doing one batched ack, but process(...) still resolved the same tenant and integration setting once per event. One tenant, 50 events, 50 database lookups before the hot path even reached shipment ingestion.

The fix was not a global cache. A global tenant cache would create stale security behavior and make feature flags harder to trust. The right cache lived inside one poll batch:

Map<UUID, TenantContext> tenantsById = new HashMap<>();
for (DeliveryGatewayTrackingEvent event : events) {
    process(event, tenantsById);
    messageIds.add(event.messageId());
}

process(...) now uses computeIfAbsent to authorize each tenant once per poll. Every poll still checks whether the tenant has delivery-gateway enabled. Nothing survives across polls. The latency win comes from removing repeated reads, not weakening the gate.

What surprised me was how easy this bug was to miss. The system could pass duplicate-message tests, tenant-scope tests, and manual exception flow tests while still failing a real throughput target. It needed the acceptance test with 50 actual Redis messages and PostgreSQL writes to reveal the hidden cost.

The database boundary still does the hard work

The consumer job is only the outer shell. The correctness boundary lives in ShipmentEventIngestionService.upsertFromEvent(...).

That service takes the event, normalizes carrier, tracking number, status, timestamps, snapshots, and metadata. It builds an idempotency key under the tenant id. It locks that key with pg_advisory_xact_lock. It upserts the shipment, inserts the event with on conflict do nothing, updates the visible shipment status only when the event timestamp is not older, writes audit records, then asks ClaimAutoCaseService whether the status deserves a case.

Only three statuses create default cases: failed_attempt, returned, and exception. A delivered scan does not open a claim. An unknown scan does not open a claim. The tracking timeline records context, but operations work begins only when the event status represents an operational exception.

That separation matters because not every carrier signal should become work. A tracking system records facts. A claims system creates ownership.

The result

The regression test that mattered is blunt: publish a valid event and then an invalid event in the same Redis batch. Run pollOnce(). Assert that no shipment event and no claim case exist, and that both Redis messages remain pending.

That test now passes.

The throughput proof also passes: 50 delivered events process in under 1 second locally, and because they are delivered events, they create zero claim cases. The duplicate stream test publishes two messages with the same source event and gets one shipment event, one claim case, and zero pending Redis messages after successful processing.

The lesson here is not "ack after commit" as a slogan. The real rule is narrower: if one system owns retry visibility and another system owns business durability, the retry signal must not be cleared until the business fact is committed.

The Document Number Is Reserved Before the PDF Exists

Kingsley Onoh — Mon, 11 May 2026 08:21:48 +0000

The hard part of document numbering is not incrementing an integer.

It is deciding what happens when the integer is reserved, rendering starts, and the render fails.

PostgreSQL sequences are built for speed. They are not built for legal numbering. A sequence advances even if the surrounding transaction rolls back. For most applications that is fine. For invoices and board records, a gap is not invisible. If number 41 exists and number 43 exists, someone will ask what happened to 42.

I needed a numbering system that could do three things at once: serialize per entity, type, and year; let different sequences run in parallel; and preserve a reason when a number is skipped.

That became the two-phase allocator in src/documents/numbering.ts.

Phase one reserves the next number. The allocator locks the tuple (entity_id, document_type, year) with pg_advisory_xact_lock, checks for the oldest unclaimed gap, and only then advances the high-water mark. It writes the reservation to pending_allocations with a TTL.

Phase two either finalizes the reservation against a document row or releases it into sequence_gaps.

The core reservation shape is small:

await acquireAllocatorLock(c, entityId, documentType, year);

let reservedNumber = await tryReclaimOldestGap(c, entityId, documentType, year);
if (reservedNumber === null) {
  reservedNumber = await computeNextHighWater(c, entityId, documentType, year);
}

await c.query(
  `INSERT INTO pending_allocations
     (id, entity_id, document_type, year, reserved_number,
      reserved_by_api_key_id, reserved_at, expires_at, metadata)
   VALUES ($1, $2, $3, $4, $5, $6, now(),
           now() + ($7::text)::interval,
           $8::jsonb)`,
  [allocationId, entityId, documentType, year, reservedNumber, apiKeyId, ttl, metadata],
);

The advisory lock boundary matters. FZE invoices for 2026 serialize against other FZE invoices for 2026. LLC board resolutions do not wait on them. A document type has its own counter space because letterhead #1 and compliance letter #1 can both be real first documents in their own family.

The part I got wrong early was the SQL shape for reclaiming a gap.

The first version used an update with a subquery and a limit against the same table. PostgreSQL flattened the plan in a way that ignored the limit and updated every eligible row. That is the kind of bug that looks impossible until you inspect the affected rows. The fixed version uses a CTE target so the candidate set is materialized before the update runs.

const res = await c.query(
  `WITH target AS (
     SELECT id FROM sequence_gaps
      WHERE entity_id = $2 AND document_type = $3 AND year = $4
        AND reclaimed_at IS NULL
      ORDER BY reaped_at ASC
      LIMIT 1
      FOR UPDATE SKIP LOCKED
   )
   UPDATE sequence_gaps
      SET reclaim_allocation_id = $1, reclaimed_at = now()
    WHERE id IN (SELECT id FROM target)
    RETURNING id, gap_number`,
  [null, entityId, documentType, year],
);

That one CTE is the difference between reclaiming one documented gap and mutating the whole backlog.

The reaper is the other half of the design. A reservation can expire before it is attached to a document. Maybe Puppeteer failed. Maybe the sidecar was down. Maybe storage rejected the upload. The service cannot just delete the reservation and pretend the number never happened. It moves the number to sequence_gaps with reason = 'reaper_swept'.

There are also explicit reasons: abandoned reservation, admin documented gap, manual void. That vocabulary is important because auditors do not need a philosophical explanation. They need a row that says why the number did not produce a document.

What surprised me was how much of the design was about not being too clever. I could have hidden this behind one nextDocumentNumber() helper and let failures be retried. That would make the happy path smaller and the audit story weaker. The split between pending allocation and sequence gap is more verbose, but the data tells the truth.

The same pattern shows up in the tests. The reaper-race gate is not a decorative concurrency test. It exists because the allocator is only correct under pressure: concurrent reservations, expired rows, reclaimed gaps, and failed finalization must all preserve one property. No two documents get the same number in the same sequence, and no missing number lacks a reason.

The document number is reserved before the PDF exists because rendering is fallible. The audit trail has to survive that fallibility.

The Audit Trail Is a Data Structure, Not a Log Message

Kingsley Onoh — Mon, 11 May 2026 08:21:05 +0000

Logs can explain what a service thought happened.

They do not prove what happened.

Klevar Docs needed an audit trail for rendered documents, invoice events, credit note applications, signatures, voids, and attachments. The usual answer is an events table. Insert a row whenever something happens. Add timestamps. Keep it forever. That is useful, but it is still just a table unless the table can detect tampering.

The hash chain is the difference.

Each entity gets its own ordered chain. A row stores the event payload, a SHA-256 hash of the canonical payload, the previous row hash, the chain index, and a link hash that binds those values together. If someone changes a payload, deletes a prior row, swaps entity rows, or reorders entries, verification fails.

The append path is transactional with the document change. That detail matters more than the hashing. If the document row commits and the chain row rolls back, the proof is incomplete. If the chain row commits and the document row rolls back, the proof references a thing that does not exist. insertChainEntry() is designed to run inside the caller's transaction.

The core logic is direct:

const allocRes = await tx.execute(
  sql`SELECT fn_allocate_chain_index(${entityId}::uuid)::text AS idx`,
);
const chainIndex = BigInt(allocRes.rows[0]!.idx);

let previousHash: string | null = null;
if (chainIndex > 1n) {
  const priorRes = await tx.execute(
    sql`SELECT payload_hash FROM document_hash_chain
         WHERE entity_id = ${entityId}::uuid
           AND chain_index = ${(chainIndex - 1n).toString()}::bigint`,
  );
  previousHash = priorRes.rows[0]!.payload_hash;
}

const chainLinkHash = computeChainLinkHash({
  content_hash: contentHash,
  previous_hash: previousHash,
  chain_index: chainIndex,
  entity_id: entityId,
});

There are two design choices hidden in that snippet.

The index comes from fn_allocate_chain_index(entity_id), not a PostgreSQL sequence. The same rollback problem that makes sequences wrong for legal document numbers also applies to chain indices. A verifier expects the chain to be contiguous. If index 19 is missing because a transaction rolled back after a sequence increment, the verifier cannot know whether that is harmless or tampering.

The link hash includes entity_id. That prevents a row from one entity being copied into another entity's chain without detection. Klevar has one group boundary, but the legal proof is per entity. FZE, LLC, and Ltd cannot share a chain just because the service is single-tenant.

The verifier is a walker, not a database query. It receives rows sorted by chain_index, checks ordering, checks the genesis row, checks each previous_hash, recomputes each link hash, and returns the first mismatch. It also reports intentionally broken indices. That last category is important because some retention or force-purge action may be documented rather than hidden. A broken chain can be honest if the break is recorded and visible.

What surprised me was the dependency on canonical JSON. Hashing JavaScript objects directly is a trap because key order and serialization details can drift. The service pins canonicalize@2.0.0 and runs an RFC 8785 boot assertion before Fastify binds a port. If canonicalization changes, the server refuses to start. That is not paranoia. It is the cost of using hashes as legal proof.

The hash chain also changed how I think about events. events_emitted is the integration outbox for Hub and Webhook Engine fanout. It is operational. document_hash_chain is proof. Those two surfaces overlap, but they are not the same thing. A notification can be retried, delayed, or dropped without changing the legal document. A chain append cannot be treated that way.

The biggest tradeoff is operational weight. A chain gives you another invariant to maintain, another verifier to run, another repair story to document, and another failure mode to alert on. The alternative is worse: a document archive that can produce files but cannot prove nobody altered the history.

For this system, the archive without the chain would be storage. The chain turns it into evidence.

Why a Rendered Invoice Can Still Fail Send

Kingsley Onoh — Mon, 11 May 2026 08:20:52 +0000

An invoice PDF can exist and still not be an invoice package.

That sentence shaped the send path. The easy implementation would render the invoice, store the PDF, try to build the XML, and log a warning if the compliance layer failed. The client still gets a document. The API still returns success. The business can "fix it later."

That is exactly the failure I did not want.

Klevar Docs treats e-invoicing as part of finalization, not as decoration after rendering. For an invoice that routes to Factur-X, the send path has to build CII XML, validate it, embed it into the PDF, convert the container to PDF/A-3b, validate that with veraPDF, upload the XML, and replace the PDF with the conformant output. If any hard requirement fails, the send fails.

The orchestrator is intentionally thin. buildComplianceArtifacts() does profile resolution and dispatches to a branch. The Factur-X branch owns the hard path. The XRechnung branch owns the public-sector German XML path.

The Factur-X branch is where the philosophy is visible:

const valid = validationRaw.valid === true;
if (!valid) {
  await markDocumentFacturXFailed(db, document.id);
  throw new FacturXValidationRejectError({
    reason: 'schematron_failed',
    errors: Array.isArray(validationRaw.errors) ? validationRaw.errors : [],
    invoice_id: invoice.id,
    factur_x_validation_status: 'fail',
  });
}

const fallbackResult = await embedAndValidateWithFallback(
  plainPdfBytes,
  build.xml,
  fallbackContext,
  fallbackDeps,
);

if (!fallbackResult.conformant) {
  throw new PdfARequiredRejectError({
    reason: 'pdf_a_non_conformant',
    stage_failed: fallbackResult.stage_failed,
    invoice_id: invoice.id,
  });
}

There are two details in that snippet that matter.

First, validation failure is a typed 422, not an internal server error. The invoice shape is wrong. The operator needs to fix the invoice, not restart the service. Missing BT fields, sidecar 4xx failures, and schematron failures all become the same class of business-visible rejection.

Second, the document row is marked as failed before throwing. That was a later correction. Without it, a failed send could leave the row looking cleaner than reality because the broader transaction never committed. The code now best-effort updates documents.factur_x_validation_status = 'fail' so a polling operator sees the truth after a rejection.

What surprised me was how much logic had to live before the sidecar call. I expected mustangproject to be the hard part. The harder part was building the DTO without lying. Seller name, seller VAT ID, registration ID, address, client identity, tax category, reverse charge handling, unit code, document type, original invoice reference, payment terms, issue date, due date. Every one of those fields has a business rule.

The builder is pure TypeScript for that reason. It reads frozen invoice, client, and entity snapshot data and returns a transport DTO for the Java sidecar. It does not query the database. It does not default from live entity state. It does not mutate the invoice.

That purity paid off when credit notes entered the path. A credit note is not just a negative invoice. It carries a different document type code and can need a preceding invoice reference. The builder got an explicit discriminator for invoice versus credit_note, and the sidecar maps that into the XML. The alternative was to infer from amounts, which would have been clever and wrong.

XRechnung is intentionally different. German public-sector invoices route to standalone XML. The human-readable PDF remains a companion, not the legal container. That means the branch skips PDF/A-3b embedding work and persists XML as the authoritative artifact. Today the validator status can be pass, fail, or skipped because the KoSIT validation lane had an acknowledged gap. I kept that explicit instead of pretending both branches had identical maturity.

The system now has an uncomfortable but correct behavior: it can render a beautiful PDF, then refuse to send it.

That is the point. A valid-looking artifact is not enough. The service needs to know whether the document is acceptable for the legal path it is taking. For a plain letterhead, PDF bytes may be enough. For an invoice that routes to Factur-X, the XML and PDF/A container are part of the document. If they fail, the document failed.

I would rather make the operator fix a rejection than let a client receive a file that only looks complete.

The PDF Looked Correct Because the Template Was Wrong

Kingsley Onoh — Mon, 11 May 2026 08:20:09 +0000

The first FZE letterhead looked fine.

That was the problem.

The rendered PDF had the right legal name, the right registration label, the right address, the right contact line, and the right visual structure. It passed the visual check because every value in the template matched the entity I was testing with. Then I looked at what would happen if the same bundle rendered an LLC document.

It would still say FZE.

That failure is more dangerous than a crash. A crash stops the send path. A wrong legal identity in a PDF can leave the system looking healthy while the document is unusable. The template authoring lane had hardcoded FZE identity strings because the snapshot did not expose the fields the template needed. The implementation had chosen the fastest path to a green render instead of stopping at the missing data contract.

I was wrong to treat the template as the place where legal identity could be finished. The template is presentation. The entity row is state. The document snapshot is the contract between them.

The fix started by widening the entity surface. captureEntitySnapshot() now freezes address, registration, contact, legal names, country code, VAT identifier, officer data, brand, banking config, retention posture, and document policy into the document row. It also redacts fields that should not land on PDFs, like private tax identifiers and operational banking rollout notes.

The load-bearing part is that the snapshot always returns stable shapes. Handlebars runs in strict mode. If a template asks for entity.address.single_line, the address path must exist even when the row is old. Empty object is better than undefined because an old document can still render through the same code path.

The function is blunt about that boundary:

export function captureEntitySnapshot(row: EntityRowForSnapshot): EntitySnapshot {
  if (typeof row.id !== 'string' || row.id.length === 0) {
    throw new ValidationError('entity_snapshot_invalid', { reason: 'id missing' });
  }

  return {
    id: row.id,
    entity_id: row.entity_id,
    name: row.name,
    type: row.type,
    brand: cloneJsonLike(row.brand),
    banking_config: redactBankingConfigForDocuments(row.banking_config),
    address: cloneJsonObject(row.address),
    registration: redactRegistrationForDocuments(row.registration),
    contact: cloneJsonObject(row.contact),
    legal_name: toNullableString(row.legal_name),
    country_code: toNullableString(row.country_code),
    vat_identifier: toNullableString(row.vat_identifier),
    officers: cloneJsonObjectArray(pickOfficers(row)),
  };
}

That code does not look dramatic. It is the reason a document can be re-rendered later without asking the live entities row what the company looks like today.

The surprise was CSS. The HTML literals were obvious once I started searching. The comments in the stylesheet were not. The composer injects CSS directly into a <style> block and hashes the full rendered HTML. A comment like Klevar FZE primary is still rendered output. It becomes part of the document hash. It can leak into the PDF byte stream. It can fail an entity-neutral regression even if the visible page looks correct.

That changed the template rule: anything inside the bundle is document output. Body HTML, shared partials, stylesheet comments, labels, helper inputs. None of it gets to carry entity identity unless it comes from the snapshot or the body payload.

The renderer also became stricter in a different direction. composeHtml() uses a private Handlebars instance for each render, not the global singleton. Helpers like formatCurrency, formatDate, markdown, eq, and officerRoleLabel live inside that private instance. That prevents a test, module, or future template family from registering a helper globally and changing another document type by accident.

The snapshot fix did not end at source code. The regression had to prove the failure could not return. The gate renders every authored bundle against multiple seeded entities and searches for identity strings from the wrong entity. If an LLC render contains an FZE registration value, the test fails. If a stylesheet comment leaks a company name, the test fails. If someone adds a new bundle and hardcodes a legal name because it is faster, the gate catches it.

The deeper lesson is that PDF generation is not the domain. Legal attribution is the domain.

A renderer that accepts a body and returns bytes is easy to build. A renderer that can explain where every legal identity field came from is the system. Once I saw that, the architecture became clearer: templates do not own identity, live rows do not own old documents, and snapshots do not carry private operational fields just because the database has them.

That distinction now runs through the rest of Klevar Docs. Factur-X builder data comes from the snapshot. Board resolution officer data can default from the snapshot. Payment details freeze at issue time. Hash-chain verification depends on content hashes that include the rendered output. If a lower layer cheats, every upper layer can look correct while proving the wrong thing.

The PDF looking correct was the warning. The system only became correct when the source of correctness moved out of the template.

Reachable Is Not the Same as Correct

Kingsley Onoh — Mon, 11 May 2026 08:19:56 +0000

The CLI could create a credit note.

That was the bad news.

The umbrella documents compose command was built to make all document types reachable through one stable surface. Pass a type, an entity, and a JSON body. The server looks up the per-type Zod schema, validates the body, renders the document, and returns the same envelope shape every time.

For generic documents, that is exactly right. A letterhead, proposal, quote, statement, receipt, minutes document, reference letter, or compliance letter can be a schema-gated render through the generic pipeline.

Credit notes were different. So were invoices, pro-forma invoices, and board resolutions.

Those document types do not just render. They create specialized rows, allocate numbers, enforce state machines, inherit fields from source invoices, attach payment behavior, and feed later compliance paths. When documents compose --type credit_note went through the generic renderer, it produced a document row but bypassed createCreditNote(). That meant the credit note skipped the inheritance logic that copies terms from the original invoice.

The symptom appeared later as a Factur-X validation problem. The XML layer complained about missing payment terms. The real bug was earlier: the API made a specialized document reachable through the wrong path.

I had treated coverage as correctness. It was not.

The fix was not to delete the umbrella. The umbrella is still the right operator surface. The fix was to split the server path into two routes inside the same endpoint: specialized dispatch first, generic render second.

The registry is explicit:

export const COMPOSE_SPECIALIZED_DISPATCH: Partial<Record<TemplateType, ComposeDispatcher>> = {
  invoice: invoiceDispatcher,
  pro_forma_invoice: proFormaDispatcher,
  credit_note: creditNoteDispatcher,
  board_resolution: boardResolutionDispatcher,
};

That tiny map carries a big rule. If a type has specialized business logic, compose must call the specialized service. If it does not, compose falls through to renderDocument(). Exactly one path fires.

The dispatcher contract is intentionally boring. It maps the umbrella body into the specialized service input, calls the service, and wraps the returned row in the same envelope shape as a rendered document. It does not recompute totals. It does not duplicate inheritance. It does not allocate its own number. If the specialized service owns the rule, the dispatcher is only an adapter.

Here is the credit note path:

const row = await createCreditNote({
  db: args.db,
  pool: args.pool,
  entityId: args.entityId,
  input,
  apiKeyId: args.apiKeyId ?? null,
  requestId: args.requestId ?? null,
});

return wrapRowAsOutput({
  documentId: row.document_id,
  entityId: args.entityId,
  type: 'credit_note',
  documentNumber: row.document_number ?? '',
  status: row.status ?? 'draft',
  locale: args.locale ?? 'en',
});

That looks like extra ceremony until you compare it to the failure. The old path returned a rendered artifact while skipping the domain rule. The new path returns a draft row because specialized documents often require a later send or execute step to produce their final artifact. The response shape stays stable, but the lifecycle is honest.

What surprised me was that the bug survived because both layers were internally correct. The CLI sent a valid body. The compose endpoint validated it. The generic renderer produced a document. The Factur-X validator was right to reject the later XML. No single module was obviously broken in isolation.

The boundary was wrong.

That forced a different kind of test. Unit tests on the dispatcher are not enough. The integration test has to assert the side effect that only the specialized service can create: invoice numbering, credit-note inheritance, board-resolution numbering, generic letterhead finalization. The test is not "does compose return 201?" It is "did the right domain path fire?"

This is one of the more useful patterns in the project because it applies outside documents. An umbrella command is good when operators need one muscle memory. It becomes dangerous when the umbrella erases domain-specific behavior. Reachability is a UX property. Correctness is a domain property.

The compose endpoint now carries both.

Confidence Is Not Ownership

Kingsley Onoh — Wed, 06 May 2026 16:00:28 +0000

What should a finance queue do when two credible records point at the same case?

An invoice discrepancy and a contract breach can describe the same dispute. They can also describe two different disputes with the same counterparty, the same currency, and nearly the same amount. That is the trap in a finance operations queue. The data looks related before anyone has proved ownership.

The Workbench ingests exceptions from invoice reconciliation, transaction reconciliation, contract lifecycle events, webhook dead letters, manual operator entry, and signed Hub fanout. Every source carries its own identifiers. Some are reliable. Some are only reliable inside the upstream tool that produced them.

I had to decide what the system should do when a new exception looks like it belongs to an existing dispute.

The tempting version is simple: compute a score, pick the highest dispute, attach the exception. That makes demos feel clean. Exceptions flow in, disputes become richer, and the queue stays small.

It is also how a finance system quietly corrupts its own audit trail.

Once an exception is attached to a dispute, every later action inherits that fact. SLA timers, resolution playbooks, Notification Hub events, audit PDF exports, and operator comments all treat the relationship as true. If the relationship was only probable, the system has converted probability into evidence.

That conversion is the real design problem.

The Scoring Shape

The correlator in src/drw/domain/correlator.clj uses seven signals. Source reference and entity id each carry 0.15. Counterparty carries 0.25. Currency carries 0.10. Amount carries 0.15. Date carries 0.10. Category carries 0.10.

Those weights are not magic in the sense of being secret. They are visible because this is a spec project. But the structure matters more than the numbers.

Counterparty is the gate. A candidate dispute is not eligible unless it belongs to the same tenant, is not terminal, has the same counterparty, and falls within the correlation window. Only then does scoring begin.

That means the correlator is not a general similarity search. It is a tenant-scoped dispute ownership test.

The amount signal has a 10 percent tolerance and only scores when the currency also matches. The date signal checks whether the exception was observed within 72 hours of the dispute creation time. The source reference and entity id signals compare against exceptions already attached to the candidate dispute, not just fields on the dispute itself.

That last part matters. A dispute becomes easier to recognize as it accumulates evidence. The first invoice mismatch may create the dispute. A later webhook dead letter with the same upstream reference can now match the attached evidence, even if the dispute record itself does not carry that reference.

The core function is plain Clojure:

(defn score-candidates
  ([tenant-id exception disputes attached]
   (score-candidates tenant-id exception disputes attached {}))
  ([tenant-id exception disputes attached opts]
   (let [cfg (merge default-config opts)
         review (get-in cfg [:thresholds :review])]
     (->> disputes
          (map-indexed vector)
          (filter (fn [[_ dispute]]
                    (candidate-eligible? tenant-id exception dispute cfg)))
          (map (fn [[idx dispute]]
                 (assoc (score-candidate tenant-id exception dispute attached cfg)
                        :sort-index idx)))
          (filter #(>= (:score %) review))
          (sort-by (juxt (comp - :score) :sort-index))
          (mapv #(dissoc % :sort-index))))))

There are two details in that function I care about.

First, eligibility happens before scoring. A cross-tenant dispute receives no score. A terminal dispute receives no score. A different counterparty receives no score. The function does not let a high amount or date match compensate for a broken boundary.

Second, ties preserve input order through :sort-index. That is not glamorous. It prevents unstable review queues where two equal candidates swap positions between renders and make operators think the system changed its mind.

What I Was Wrong About

I initially treated the correlation score as the hard part.

It was not. The harder question was what the system does with the score.

There are three possible outcomes in src/drw/domain/exceptions.clj. If no candidate passes review, the exception creates a new dispute and attaches immediately. If the best candidate hits the auto-merge band and auto-merge is explicitly enabled, the exception attaches and records an auto-merged correlation. Otherwise, the system creates pending correlation records and emits a dispute.correlation_pending event.

That middle branch is intentionally hard to reach. The .env.example values set AUTO_MERGE_THRESHOLD=0.92 and REVIEW_THRESHOLD=0.70, while the source correlator defaults are lower for unit-level behavior. The runtime config is stricter because this is finance operations. False attachment costs more than a larger review queue.

What surprised me is that pending correlation became a domain object, not a UI convenience.

The queue needed an id, a tenant id, an exception id, a target dispute id, a score, a rationale, a status, a decided-by user, and decision timestamps. That is a lot of structure for something that could have been a modal row.

But the moment an operator accepts or rejects a candidate, that decision becomes part of the case history. A rejected match is useful evidence. It says someone looked at the overlap and decided the exception did not belong there. If the same upstream source sends a related item later, the prior rejection explains why the system did not combine the cases earlier.

That is why correlation records live next to exceptions and disputes instead of inside a transient UI response.

The Failure Mode Hidden In Good Matches

The most dangerous false match is not a ridiculous one.

It is the match that looks reasonable.

Same counterparty. Same currency. Amount within 10 percent. Observed inside three days. Category is billing. If those signals point to the wrong open dispute, the system does not look broken. It looks efficient.

The damage appears later. A Workflow Engine playbook starts against the wrong case. A Notification Hub event tells an operator that the dispute is ready for resolution. The audit PDF now contains an exception that belongs somewhere else. Nobody sees the root mistake because every downstream artifact is internally consistent.

That is the kind of bug that worries me more than a 500 response.

A 500 stops the flow. A wrong attachment keeps moving.

The design answer was to make confidence create a decision, not mutate the dispute. A review-band candidate becomes work for an operator. The UI exposes accept and reject actions. The API carries the same boundary. The audit log records correlation creation and later decisions.

The Workbench still supports auto-merge, but it is a policy choice. It has to be enabled. The score has to clear the higher band. The code does not pretend the existence of a scoring function means the business has accepted the risk.

Why Tenant Scope Belongs Inside The Algorithm

Tenant isolation is usually discussed at the HTTP layer. API key comes in, tenant id gets attached to the request, handlers filter queries.

That is necessary, but it is not enough here.

Correlation is a cross-entity operation by nature. It compares a new exception against many existing disputes and attached exceptions. If the algorithm accepts a list that accidentally contains another tenant's disputes, the HTTP layer is already too far away to save it.

So score-candidate checks tenant equality itself. score-candidates filters candidates through candidate-eligible?, which repeats tenant, status, counterparty, and time-window checks before scoring.

This is defensive duplication with a purpose. The route should pass tenant-scoped collections. The domain should still reject anything outside the tenant boundary. In a single-tenant fixture, this looks redundant. In a two-tenant test, it is the difference between "the route behaved" and "the invariant held."

The same philosophy appears in reports. The audit PDF renderer captures a tenant snapshot, renders with strict token lookup, and the setup check renders two tenants to make sure one tenant's identity literals never appear in the other tenant's output.

The theme is the same: do not trust a boundary because a previous layer probably handled it.

The Result

The finished local build passed 160 tests with 869 assertions. The full-flow E2E drives invoice adapter polling into exception creation, assignment, investigation, Workflow Engine resolution polling, Notification Hub event capture, and audit PDF generation.

The number I care about most is smaller: the dashboard guard. It caught a practical operations failure. The first dashboard shape rendered every dispute link in the tenant fixture. The fix capped the overview at 50 open disputes and kept totals intact.

That is the Workbench in miniature. Preserve the facts. Limit the surface. Make the operator decide when the machine only has a probability.

Confidence is useful. Ownership is a human or policy decision.

Why I Made WebSocket Delivery the Disposable Part of the Tracking System

Kingsley Onoh — Wed, 06 May 2026 14:53:35 +0000

The uncomfortable failure is not a carrier outage. That one is loud. The polling loop logs it, retries it, and moves on.

The failure I had to design around is quieter: the gateway receives a real carrier update, writes it to the database, then Redis is down at the exact moment the WebSocket fanout should happen. The client misses the live push. The shipment state is still correct. The event timeline is still correct. But the thing the user was watching in real time never moves.

That sounds like a broken real-time system until you decide which part is allowed to be temporary.

I made the database the truth and the WebSocket stream the delivery layer. That one decision shaped the rest of the gateway.

The real boundary

DHL, DPD, and GLS do not send one clean stream of facts. The adapters have to handle different request shapes, different status codes, and different timestamp rules. DHL uses DHL-API-Key and returns shipment events under shipments[0].events. DPD can return HTML for an error response, so DpdAdapter checks the content type before it ever tries to parse JSON. GLS sends a date like 06.05.2026 and a local time string, so GlsAdapter has to parse CET into UTC.

That is all adapter work. Once an event crosses into the core, it has one shape: carrier, tracking number, carrier status, normalized status, carrier timestamp, and a deterministic dedupKey.

The key is generated from four values:

export function generateDedupKey(input: DedupKeyInput): string {
  const carrier = normalizeCarrier(input.carrier);
  const trackingNumber = assertRequiredString(input.trackingNumber, "trackingNumber");
  const carrierStatus = assertRequiredString(input.carrierStatus, "carrierStatus");
  const carrierTimestampIso = normalizeTimestamp(input.carrierTimestamp);

  return [carrier, trackingNumber, carrierStatus, carrierTimestampIso].join(":");
}

That function is plain, but it carries the system. A carrier can send the same scan twice. A poll can retry after a timeout. A process can restart and fetch the same history again. If the fact is the same, the key is the same.

What surprised me was how much of the gateway became simpler once duplicate handling moved into the event identity instead of the polling loop. The poller does not need to remember what it saw last time. The adapters do not need per-carrier duplicate caches. The WebSocket layer does not decide whether something is new. The processor decides once, against PostgreSQL.

The processor is the load-bearing part

I was wrong at first about where the "real-time" complexity would live. I expected it to be WebSockets: connection cleanup, heartbeats, subscription maps, broadcast performance. Those were real problems, but they were not the hard correctness problem.

The hard part was making sure Redis never became the source of truth by accident.

createEventProcessor starts with a dedup lookup, then inserts into tracking_events with onConflictDoNothing(). That second guard matters because two workers can pass the lookup at the same time. The database still gets the final vote.

After the insert, it updates the shipment projection only when the new carrier timestamp is newer than last_event_at. Older events are still persisted. They just do not move the visible shipment state backwards.

The important shape is this:

const updatedRows = await options.db
  .update(shipments)
  .set({
    currentStatus: event.normalizedStatus,
    lastEventAt: event.carrierTimestamp,
    updatedAt: new Date()
  })
  .where(
    and(
      eq(shipments.id, event.shipmentId),
      or(isNull(shipments.lastEventAt), lt(shipments.lastEventAt, event.carrierTimestamp))
    )
  )
  .returning({ id: shipments.id });
projectionUpdated = updatedRows.length > 0;

That is the line between history and projection. An out-of-order scan belongs in history. It does not belong on the dashboard as the current state.

Then Redis publish happens after the database work. If the database write fails, the processor returns failed and does not publish. If Redis publish fails, the processor still returns processed with published: false. That asymmetry is deliberate. A WebSocket update can be missed. A carrier fact cannot be invented.

The integration tests capture both sides. One test inserts an invalid shipment id and asserts that Redis stays empty. Another makes the publisher throw redis unavailable and asserts the PostgreSQL event still exists. Those tests are more useful than a happy-path WebSocket demo because they pin down which failure the business can recover from.

Why not broadcast directly?

The simple version is tempting. A client subscribes to a tracking number. The processor receives an event. It loops over sockets and calls send().

That works until the process restarts, the send throws, or the event arrives in a worker that does not own the socket. Even in a single-process version, direct send ties event processing to live delivery. That is the dependency I wanted to avoid.

Redis Streams gave the gateway a narrow delivery contract. The processor writes one stream entry to tracking:events. The broadcaster consumes as group ws-broadcaster, parses the payload, looks up connection ids by tracking number, and sends JSON to sockets registered in the current process.

The broadcaster code has a small but telling rule: malformed stream entries are acknowledged.

At first that felt wrong. Acknowledging a bad message means admitting it will never be delivered. But leaving invalid JSON pending forever is worse. It blocks operational visibility and makes the group look unhealthy for a message that cannot succeed on retry. The test for that case writes { as the payload and asserts it gets logged and acked.

The other Redis detail is XAUTOCLAIM. If a consumer reads a stream entry and dies before acknowledging it, the message sits pending. The broadcaster reclaims stuck messages after 60 seconds. That is enough for the current single-service deployment, and it creates a clear upgrade path for multi-process delivery later.

The polling loop stays boring on purpose

The polling engine is less clever than the rest of the system. That is good.

It loads enabled carrier configs from the database, starts one loop per carrier, queries active shipments excluding delivered, returned, and deleted_at, then batches them by POLL_BATCH_SIZE. The default is 10.

Each shipment call runs through withExponentialBackoff. RateLimitError and CarrierError are retryable by default. The delay is baseDelayMs * 2 ** attempt, with the base coming from carrier_configs.backoff_base_ms.

The scheduler has one job: skip overlapping cycles. If a carrier cycle is still running when the next interval fires, it returns a structured skipped result with reason cycle_already_running. It does not start a second poller and hope the processor dedup saves the day.

That skip behavior matters because the PRD target is 100 active shipments across 3 carriers, with a carrier polling cycle under 30 seconds and WebSocket delivery under 200ms once the event enters the pipeline. If the poller stacks, the gateway creates its own load spike and every downstream metric lies.

The timestamp bug that exposed the database shape

The most useful gotcha came from pagination, not carrier integration. Cursor pagination repeated a shipment on page 2 even though the SQL condition compared the sort timestamp against the cursor timestamp.

The schema uses PostgreSQL timestamp without time zone. Passing a JavaScript Date through Drizzle and node-postgres introduced enough interpretation drift that the comparison did not behave like the stored value. The fix was to format the cursor as a PostgreSQL timestamp literal and cast it with ::timestamp.

That is why shipments.routes.ts has this helper:

function toPgTimestamp(value: Date): string {
  return value.toISOString().replace("T", " ").replace("Z", "");
}

It looks like formatting trivia. It is actually a pagination invariant. If the cursor repeats rows, clients may process the same shipment page twice or miss a page when they try to recover.

The part I left deliberately unfinished

The stats endpoint exposes a limitation instead of pretending the system has a metric it does not own yet. It returns active shipments and today's event counts by carrier, but error_rate and errors are null. The response includes a metrics_limitations entry that says carrier poll failure counters are not persisted yet.

That is not a polished answer, but it is the honest one. The polling engine logs retry attempts and per-cycle errors, and the build journal records the summary fields: carrier, shipments polled, events found, deduplicated events, and errors. Those numbers exist at runtime. They do not yet survive as queryable history.

I prefer that over a fake rate derived from current memory. A dashboard number that resets on restart is worse than no number because it invites operational decisions from incomplete evidence. If this gateway were being deployed for a client, the next increment would be a small poll-cycle table or metrics sink before anyone relied on /api/stats for carrier health.

The same honesty shows up in the success criteria. The system has the local loop, the production-start check, the Docker Compose smoke proof, and the mock-carrier journey. It does not have the 24-hour simulated-load soak. That missing box is not paperwork. It is the difference between "the architecture handles the failure model" and "this process survived a day of real runtime pressure."

The result

The finished local build passed 134 Vitest tests. The unit and integration coverage command passed at 83.32% statement and line coverage. The WebSocket E2E test publishes a Redis stream event and asserts delivery under 200ms after the broadcaster group exists. The mock-carrier demo registers DHL, DPD, and GLS shipments, subscribes over WebSocket, processes events through the real event processor, verifies duplicate suppression, and confirms the REST API shows in-transit shipments.

The 24-hour soak is still deferred. That matters. A gateway like this is not production-proven just because the happy path works for a few minutes.

But the failure model is in the right order. Carrier facts land in PostgreSQL first. Shipment state is a projection. Redis carries delivery. WebSockets are allowed to miss a push. The timeline is not allowed to lie.

Why I Made the Ledger Refuse Single Rows

Kingsley Onoh — Tue, 05 May 2026 19:35:17 +0000

The bug I kept designing around was not "wrong penalty amount."

Wrong amounts are visible. A supplier sees a credit note for EUR 8400, checks the contract, and pushes back. The operator can investigate. The trail exists.

The quieter failure is a penalty row with no mirror. One side of the system says the buyer is owed money. The other side never records the supplier-facing debit. The total looks correct in the dashboard because the credit row exists. The settlement builder can even pick it up. But the accounting story is incomplete, and it only becomes obvious when someone asks why the counterparty view does not match the buyer view.

That is the sort of error append-only systems can preserve forever.

So I made the ledger refuse single rows.

The system calculates supplier SLA penalties. A missed response target, a delivery window breach, a compounding daily penalty, a tier crossing, or a per-ticket miss becomes money owed. That money then needs to move through three phases: accrual, possible reversal, and settlement. Each phase has a tempting shortcut.

For accrual, the shortcut is one row with amount_cents and direction.

For reversal, the shortcut is updating the original row.

For settlement, the shortcut is marking the original row as settled.

All three shortcuts make the data easier to write and harder to defend.

The Shape That Forced the Design

The PRD had a hard constraint: penalty_ledger is append-only. Disputes, withdrawals, and corrections use compensating entries. Settlement membership lives outside the ledger. That sounds clean until the first application flow has to implement it.

The accrual worker receives a breach, loads the contract and clause, calls RulesEngine.calculatePenalty, and gets one answer back: no penalty, a domain error, an accrued amount, or a capped amount. The amount is not the ledger. It is only a financial fact waiting to become a record.

The ledger record has more obligations:

a credit side and mirror side
same amount and currency
same tenant, contract, counterparty, clause, and breach
same accrual period
same entry kind
same compensation reference when reversing

If any of those differ, the pair is not a pair. It is two rows that happen to be written near each other.

I originally thought the database trigger was the center of the solution. Block update and delete. Done. That part exists:

create trigger penalty_ledger_block_update
before update on penalty_ledger
for each row execute function block_penalty_ledger_mutation();

create trigger penalty_ledger_block_delete
before delete on penalty_ledger
for each row execute function block_penalty_ledger_mutation();

But that only stops mutation after insertion. It does not stop a bad insert. A database that blocks updates can still store a malformed truth forever.

That is where the F# domain layer earns its place.

The Load-Bearing Function

The most important code in the ledger path is not the insert statement. It is LedgerPair.create.

let private validate credit mirror =
    if Money.cents credit.Amount <= 0L || Money.cents mirror.Amount <= 0L then
        Error LedgerAmountMustBePositive
    elif
        credit.Direction <> LedgerDirection.CreditOwedToUs
        || mirror.Direction <> LedgerDirection.Mirror
    then
        Error LedgerPairDirectionInvalid
    elif credit.EntryKind <> mirror.EntryKind then
        Error LedgerPairKindInvalid
    elif credit.Amount <> mirror.Amount then
        Error(LedgerPairMismatch "amount must match")
    elif
        credit.AccrualPeriodStart <> mirror.AccrualPeriodStart
        || credit.AccrualPeriodEnd <> mirror.AccrualPeriodEnd
    then
        Error(LedgerPairMismatch "period must match")
    elif not (sameContext credit mirror) then
        Error(LedgerPairMismatch "tenant contract counterparty clause breach context must match")
    elif credit.CompensatesLedgerId <> mirror.CompensatesLedgerId then
        Error(LedgerPairMismatch "compensating ledger reference must match")
    elif credit.AccrualPeriodEnd < credit.AccrualPeriodStart then
        Error PeriodInvalid
    else
        Ok()

That function is deliberately narrow. It does not know about HTTP, Hangfire, PDF rendering, Invoice Recon, or even settlement grouping. It only knows what makes two candidate rows a valid accounting pair.

The private LedgerPair type matters. Callers cannot construct one directly. They can propose two LedgerEntryCandidate records, but only the domain module can expose the pair. That means the application layer does not get to "remember" to validate. It has no valid object unless validation has already happened.

This is the part I got wrong at first in my head: I thought append-only was mainly a persistence rule. It is not. It is a construction rule. Once bad ledger rows are inserted, append-only protects the bad rows just as strongly as the good ones.

Why The Rules Engine Stays Pure

The penalty math also stays outside the database. RulesEngine.calculatePenalty takes a PenaltyCalculationInput and returns a result. No connection string. No repository. No clock except the explicit AsOf value passed in.

That was not aesthetic. The engine needs deterministic recompute. Given the same contract, clause, breach, prior accruals, and timestamp, it should return the same penalty. The snapshot test covers twelve cases: flat penalties, capped penalties, monthly fee proration, tier crossing, overflow tier behavior, currency mismatch, daily compounding cap, missing units, inactive clauses, and pre-contract breaches.

The awkward case is previous accruals. Caps depend on what was already credited. Tiered penalties may need to accrue only the difference between the previous tier and the new one. Compounding daily penalties need to avoid adding the same days twice. So the rules engine receives PreviousAccruals, filters to credit-side rows for the same clause, and calculates incremental money from that prior state.

That looks like a database concern until it breaks. If the rules engine quietly queried the database itself, replay tests would become setup-heavy and timing-sensitive. By making prior state an input, the application layer owns retrieval and the domain layer owns the calculation.

Reversal Without Mutation

The reversal path is where the design either holds or collapses.

When a supplier disputes a breach and wins, the system cannot update the original accrual to zero. It cannot delete the row. It cannot mark the row as false and pretend the old financial position never existed.

It writes a reversal pair.

ReversalEngine.uncompensatedCreditAccruals first looks for accrual rows that do not already have a reversal pointing at them. Then reversalCandidate copies the amount, period, tenant, contract, counterparty, clause, and breach from the original accrual, changes the entry kind to Reversal, and sets CompensatesLedgerId.

That little filter is important. Without it, the same breach could be reversed twice. Append-only would preserve both reversals, and the system would show the supplier owed less than zero. The code does not rely on a user avoiding the button. It filters uncompensated credit rows and writes every reversal inside one transaction with the status change.

The design tradeoff is verbosity. A simple breach flow creates two rows on accrual. Accrual plus reversal creates four rows. A ledger explorer has to explain direction, entry kind, and compensation references. The UI has more work because the database refuses to simplify history for display.

I accept that. Accounting systems should make the audit easy and the write path strict.

Settlement Is Not a Ledger Mutation

Settlement introduced a second temptation: add settlement_id to penalty_ledger.

That would make queries simple. Find uncommitted rows where settlement_id is null. Mark them when the PDF is built. Done.

It would also puncture the ledger invariant. If building a settlement mutates a ledger row, the ledger is no longer an append-only record of penalty facts. It becomes a workflow table.

The repo uses settlement_ledger_entries instead. SettlementsRepository.listUncommittedAccruals selects credit-side accruals for the period, excludes rows that already have active settlement membership, and excludes rows with reversals. Then SettlementsRepository.insert writes the settlement row and membership rows in the same transaction.

That means settlement membership is append-like metadata around the ledger, not a change to the ledger event itself.

The cost is an extra table and a more careful query. The gain is that a settlement can be cancelled or released without rewriting what the penalty ledger said at the time of accrual.

The Part That Still Needs Pressure

The full suite passes: 12 domain tests, 6 data tests, 30 application tests, 10 API tests, and 3 UI tests. The dashboard load audit has tooling for 10000 breaches and 5000 settlements. The fake staging flow proves a Contract Lifecycle event can become an accrual, settlement, Invoice Recon outbox message, and settlement.posted hub event in the local harness.

But the PRD still has open success criteria around staged NATS-to-ledger time, Invoice Recon posting p95, and million-row ledger query behavior. That is the honest limit of the current proof. The invariants are strong. The operational envelope still needs live pressure.

What surprised me is how much of the system exists to protect against its own future convenience. Every obvious shortcut would make one screen or one query easier. Single-row ledger entries. Updating rows for reversals. Marking ledger rows as settled. Reading tenant identity live during PDF posting.

Each shortcut is fine until the first external audit, supplier dispute, or tenant rename.

The lesson I took from this build is narrow: append-only is not a database trigger. It is a system-wide refusal to let later workflow needs rewrite earlier financial facts.

Why I Refused to Re-Query the Tenant Row at Alert Dispatch Time

Kingsley Onoh — Sun, 26 Apr 2026 17:37:42 +0000

What does an audit-grade risk alert look like six months after the band crossing it describes, when the tenant has renamed itself twice in the interim, when the vendor has been merged into another, when the scoring rule has been retuned three times since? In a system that re-queries the source rows at dispatch time, the answer is whatever is true now. In this system, the answer is whatever was true at 14:03 on the Tuesday the alert was written.

The concrete shape of the problem: a risk alert is created at 14:03 Tuesday for a vendor crossing from medium to high. The Notification Hub is down for emergency Postgres maintenance. The dispatcher falls back to the failed-alert retry queue and tries again every 30 minutes. By Friday afternoon, when the Hub is back, the email body must still read Acme GmbH, the legal name that was current Tuesday, not the post-Wednesday rebrand to Acme Industrial GmbH that the operator entered through the settings UI on Wednesday morning. A live re-query of the tenants row at send time would emit the new name and reference an event at a company that didn't exist when it happened.

That's the constraint this whole architecture is designed around. Not the failure mode I worried about most. The one I knew would happen and didn't have a clean answer to.

The Real Problem

Most engineers I've talked to assume the issue is performance. They look at a table with a JSONB column called delivery_payload holding a fully-rendered tenant snapshot, and they ask why I'd denormalize. The tenants row has 11 identity columns. Why pickle them into JSON when I could just SELECT them again at dispatch time?

Performance has nothing to do with it. The dispatcher is fast either way. What matters is that an alert is a fact about what was true at a moment in time, and a re-query produces a fact about what is true now. Those two facts are not interchangeable. An auditor reading the alert ledger six months later wants to know what the operator saw on Tuesday afternoon, not what the system shows today.

Same problem on the report side. A vendor scorecard PDF generated in March needs to reprint byte-for-byte (or close enough to it) when an auditor downloads it again in September. If the underlying tenant or vendor row has been mutated in between, a live re-query produces a different document. The PDF that was approved by procurement leadership in March no longer exists on disk. It exists in the operator's email and nowhere else.

The mistake is one of category. The system was treating mutable rows as the source of truth for an immutable record. SQL was doing exactly what SQL does: returning the current value of the column. The error was in the architecture asking the question that way at all.

The Constraints

Three things made this hard.

Tenant identity is mutable by design. A CPO can update their legal name, registration number, address, and brand colors at any time through the settings UI. We don't lock fields after creation; that would be operationally hostile to the business reality of mergers and rebrands. So the row is the live identity, full stop.

Vendors get merged. When the operator confirms that two vendor_aliases rows point at the same legal entity, the system collapses them and the surviving vendor inherits the signals from the merged one. If an alert fires referring to "Vendor A" and that vendor gets merged into "Vendor B" three weeks later, a re-query at dispatch time would emit "Vendor B" in the email body. That isn't just confusing; it's wrong. The alert was about A.

Retries can run for days. The notification hub has a circuit breaker (5-failure rolling window, 60-second cooldown). If the Hub is genuinely unhealthy for a sustained period, the failed-alert retry job picks alerts back up every 30 minutes and tries again. There is no upper bound. I have seen alerts retry across 72-hour outages. Anything the dispatcher reads at send time can have moved arbitrarily far from where it was at create time.

The obvious moves were all bad. Locking the tenant row after first alert? Operationally hostile. Versioning every field on the tenant table with effective-date ranges? A second source of truth and a permanent maintenance burden. Re-creating the alert if the tenant changed? You can't, because the alert is a record of an event that already happened.

The Design

The architecture has two halves: storage immutability and rendering immutability. Both are necessary. Either alone is insufficient.

On the storage side, every alert carries a delivery_payload JSONB column populated at insertion. The capture path lives in lib/alerts/capture_payload.rb and is called from the alert dispatcher exactly once, at the moment the risk_alerts row is INSERTed:

# lib/alerts/capture_payload.rb
def call(vendor_score:)
  tenant = vendor_score.tenant
  vendor = vendor_score.vendor
  tenant_snapshot = Tenants::CaptureSnapshot.call(tenant.id)

  payload = {
    event_type: event_type,
    tenant: tenant_snapshot,
    vendor: vendor_block(vendor, tenant_snapshot),
    score: score_block(vendor_score, previous_composite, previous_band, direction),
    top_contributors: top_contributors_block(vendor_score),
    deep_links: deep_links_block(vendor),
    created_at: Time.now.utc.iso8601
  }

  deep_freeze(payload)
end

Tenants::CaptureSnapshot is the single canonical builder for the tenant identity block. Eleven columns from the tenants row plus a snapshot_at timestamp, returned as a frozen Hash. The shape is locked. Adding a column to tenants does not automatically add it to the snapshot. Every addition is a deliberate change to the snapshot, the templates that bind to it, and the test fixtures.

That's the first half. The dispatcher (Alerts::HubDispatchJob) then has exactly one rule: read alert.delivery_payload and never query tenants, vendors, or vendor_scores again. The job's class doc is a paragraph explaining this. The integration test fires an alert, mutates the underlying tenant row, then runs the dispatcher and asserts the emitted Hub event still contains the original literal values. Without that test, a casual refactor could re-introduce a Tenant.find(alert.tenant_id) call in the dispatcher, and the regression would only show up in production after the first sustained Hub outage.

The second half is in-memory immutability. A frozen top-level Hash doesn't prevent a careless caller from mutating a nested Hash:

# This still works on a top-level-frozen Hash:
payload[:tenant][:legal_name] = "Sneaky GmbH"
# raises FrozenError only if every nested level is also frozen

So the capture path walks the structure recursively:

def deep_freeze(obj)
  case obj
  when Hash
    obj.each_value { |v| deep_freeze(v) }
    obj.freeze
  when Array
    obj.each { |v| deep_freeze(v) }
    obj.freeze
  when String
    obj.freeze
  else
    obj
  end
end

A Sidekiq worker that loads the JSONB column gets a fresh Ruby Hash from JSON.parse, which is mutable, but the in-memory structure built by CapturePayload.call is physically immutable at every level the moment it leaves the method. That removes a class of bugs where a logging middleware or a serializer transformation accidentally mutates the snapshot in flight before it's emitted to the Hub.

The rendering side is the same idea applied to templates. Every Hub Liquid template registers with strict_variables: true. Every report ERB template uses a small helper called Reports::StrictFetch that walks dotted paths against the captured render context and raises StrictFetchError on any unresolved segment. A template that references {{ tenant.legal_nme }} (typo) doesn't silently emit an empty string. It raises during the test, before anything ships.

The CI gate that locks all of this is in test/integration/report_template_lint_test.rb. It renders every report template against captured render contexts for two distinct tenants, then parses each template for f("…") calls without a default: argument to enumerate the mandatory token set. The test will fail loudly if anyone adds a new template token without extending the snapshot, or adds a field to the snapshot but forgets to update a downstream template.

I had this turned into the cleverest part of the suite by accident. After the first version of the test passed, I noticed it would also pass if I weakened StrictFetch to silently return nil for missing paths. So I added a deliberate-failure regression test: a synthetic broken template that references {{ tenant.this_field_does_not_exist_anywhere }}. If anyone weakens the strict-fetch contract, that meta-test fails. Without it, the regression test for the regression test, the gate could rot.

What Surprised Me

The byte-identical re-render test for the four report types did not work the first time, and the failure mode taught me something I would not have figured out from documentation.

I picked the legal-footer line as my canary literal: Reg: HRB-123456 · Tax: DE987654321 · Berlin, DE. The PDF visually rendered correctly. But pdf-reader text extraction returned a nil for that line. After two hours of debugging, I figured out that wkhtmltopdf encodes the · HTML entity as a non-standard glyph escape that pdf-reader's content-stream decoder doesn't recognize, and the entire PDF text object containing that escape gets dropped from the extraction.

The fix was to pick canary literals from lines that don't contain · separators (header legal_name, address line1, contact email are all safe). But the lesson was the deeper one: byte-identical PDF re-rendering is impossible because wkhtmltopdf embeds creation timestamps and random object IDs in every render. CSV outputs are bytewise equal across renders; PDFs are content-equal via text extraction. I had to pick the right level of the immutability claim.

The Result

A 30-day audit-reprint test runs on every CI build. It captures a render context, generates the PDF and CSV, runs Timecop.travel(30.days), mutates the underlying tenants row, and re-renders. CSV output is exactly byte-equal. PDF text-extracted output contains the original tenant literals (Acme GmbH, Hauptstraße 10, procurement@acme-gmbh.example) and not the new ones. The same gate runs separately for the alert side: the dispatcher integration test fires an alert, mutates the tenant, runs the dispatcher, and asserts the emitted Hub event contains the pre-mutation legal name.

The architecture is denormalized. Every alert duplicates the tenant identity. Every report context duplicates the entire scoring snapshot. Storage cost is real but small (typical alert payload is ~3KB; a year of band-crossing alerts at 50 vendors per tenant is well under 100MB per tenant). The cost is paid once at write time and never paid again at read time.

The takeaway, if there is one: a snapshot has a different job from a normalized table. The normalized rows are operational state, useful for the live dashboard and the next score recompute, and entirely correct to mutate when the business changes. The frozen JSONB is the legal record. It's what you point at six months later when an auditor asks you to prove what the operator saw on a Tuesday afternoon in April.

The Per-Tenant Rate Limit That Wasn't Per-Tenant

Kingsley Onoh — Sun, 26 Apr 2026 17:33:43 +0000

Tenant A blocked correctly at request 11. Tenant B also blocked at request 11.

Phase 7 added a rate limit on POST /api/events that reads from tenants.config.rate_limits.events_per_minute. The default cap is 200 requests per minute. A tenant who needs more can be raised to 1000 via an admin PATCH. The integration test seeded two tenants (one on the default, one bumped to 100) and asserted that tenant A got blocked at request 11 while tenant B kept going. Tenant B was supposed to get nine more requests through.

The resolver function was fine. The unit tests on resolveTenantEventsRateLimit() all passed. Pass a tenant config with no override, get 200. Pass a tenant with events_per_minute: 100, get 100. Pass null, get 200. Pass anything over 1000, get clamped to 1000. Five tests, all green. The function did exactly what it was supposed to do.

The function was being called with null.

Where the Resolver Was Called With Null

@fastify/rate-limit accepts a max option that can be a function. The function receives the request and returns the cap to apply. Putting the resolver behind that callback was the design: each request asks "what is this tenant's limit?" at the moment the rate check runs, so changing a tenant's config takes effect immediately without a process restart.

For the resolver to do its job, it needs request.tenant already populated. That happens in authPlugin, which reads the X-API-Key header, looks up the tenant row, and attaches the tenant config to the request. The auth plugin runs in the onRequest lifecycle hook.

By default, @fastify/rate-limit also runs in onRequest. Fastify fires hooks in registration order within the same lifecycle stage. The rate-limit plugin was registered before the route was registered, and the route's auth runs as part of the route's plugin chain. The two onRequest hooks fired in an order that meant rate-limit's max callback ran first, with request.tenant still undefined.

The resolver, faithful to its contract, returned the default 200 for every request because every request looked like a tenant with no config override. Tenant A with the default and tenant B with 100 both got the same global cap. The endpoint behaved exactly as if the per-tenant feature didn't exist, but with no error and no warning, because both halves of the system were doing what they were told.

Why Moving Auth Wasn't an Option

The fix could not move auth. authPlugin runs in onRequest for every protected route in the entire API, and most of those routes do not need rate limiting. Pushing auth to preHandler to accommodate one route changes the lifecycle for every route, which moves a documented contract for one accidental dependency. That is the sort of fix that creates three new bugs to make one go away.

The fix also could not run rate-limit globally with a different hook. The rate-limit plugin can be registered globally with a custom hook stage, but that overrides the stage for all rate-limited routes. Every other rate-limited route in the codebase, including the admin endpoints with their own static caps, would be affected by a change made for events.routes.ts.

What was needed was a per-route override that moved only this rate check to a later hook stage, leaving the rest of the system on defaults.

hook: 'preHandler' Plus a Real keyGenerator

@fastify/rate-limit exposes a per-route option called hook inside the route's config.rateLimit object. Setting it to 'preHandler' means this specific route's rate check fires in the preHandler stage instead of onRequest. Auth has already run by then. request.tenant is populated. The resolver gets a real tenant config and returns the right cap.

config: {
  rateLimit: {
    hook: 'preHandler' as const,
    max: (request) => resolveTenantEventsRateLimit(request.tenant ?? null),
    keyGenerator: (request) => request.tenantId ?? request.ip ?? 'anonymous',
    timeWindow: '1 minute',
  },
},

Three things in that block matter, and the hook line is only one of them.

The keyGenerator is the second multi-tenant fix. The default key generator is the request IP. In a single-tenant deployment that is fine. In a multi-tenant deployment behind a shared NAT (two tenants whose servers happen to live in the same cloud region, two CI pipelines running tests behind the same egress address) the bucket gets shared. Tenant A burns through tenant B's allowance. The cap stops being per-tenant in a different way: the resolver returns the right number, but the bucket it counts against is wrong.

Setting keyGenerator to request.tenantId partitions the bucket per authenticated tenant. The fallback to request.ip covers the unauthenticated case (which the route does not allow, but defensive code is cheap and explicit defaults beat surprising ones).

The third thing is the resolver clamp at 1000. The admin route validates input at 1-1000, but a tenant whose config gets corrupted, or an old tenant from before the validation existed, could in principle have a number out of range stored in their JSONB config. The resolver caps the return value defensively. A misconfigured tenant cannot DoS the platform by setting events_per_minute: 999999 directly in the database.

Why a Single-Tenant Test Would Have Shipped This

The hook-ordering issue was not in the documentation for @fastify/rate-limit because it is not a bug in the plugin. The plugin defaults to onRequest because that is the right stage for rate limiting in 99% of cases. You want to reject over-limit requests as early as possible, before any work happens. The Hub's case is the 1% where the rate check depends on data that is only available after another hook has run. The plugin gives you the escape hatch, but you only know to use it once you have hit the problem.

The integration test that caught this was not a sophisticated test. It seeded two tenants, fired eleven requests for each, and asserted on the response codes. An earlier draft only seeded tenant A with a cap of 10 and asserted it got rate-limited at request 11. Tenant A was the only tenant. The test passed. I added tenant B with cap 100 specifically to verify that the per-tenant bucket worked, and the test failed with both tenants getting blocked.

If I had only tested one tenant, the rate limit would have shipped looking like it worked. Every tenant on the platform would silently be on the same global cap, and the only way to discover it would be a tenant raising a support ticket about being blocked at unexpected request counts. That is the kind of bug that runs in production for months because nothing visibly breaks.

The test fix was as simple as the production fix. Changing hook: 'preHandler' and keyGenerator: req.tenantId made the assertion pass. Total LOC change: two lines. Total time spent debugging from "why does tenant B also block" to "Fastify hook ordering": about an hour, most of it spent printing request.tenant from inside the max callback and confirming it was undefined.

What Else the Patch Quietly Got Right

H7 ships with per-tenant rate limiting that actually scopes per tenant. Tenant default is 200 requests per minute, raised to a configurable value via PATCH /api/admin/tenants/:id/rate-limit, capped defensively at 1000. The admin route uses spread-merge so updating rate_limits.events_per_minute preserves the rest of tenants.config (the channel credentials, the dedup window, the sandbox flag) without overwriting them. The PATCH was simple to write. Forgetting to use spread-merge would have been a different production incident: an admin update silently wiping every tenant's Resend API key.

Test coverage went beyond the spec. The resolver got five tests covering null config, missing override, explicit override, cap-at-1000, and the default. The integration suite got five more covering the per-tenant bucket, admin happy path, range validation at both ends, 404 on missing tenant, and config preservation across the update. The admin tests assert that channel credentials and dedup_window survive the PATCH untouched. The spread-merge contract is the thing that has to keep working, not just on the day it was written, but on every future change to the admin route.

The hook-ordering trick is now in the project's pattern 004 (per-route rate limit) as a note for the next route that needs a dynamic per-tenant cap. It will not be the last one. The same pattern will apply to per-tenant request limits on the suppressions endpoints, the templates endpoints, and any future endpoint where the cap depends on tenant config.

The fix is one line. The understanding it required is the lifecycle of every plugin in the request chain and the order in which Fastify will call them. That is the kind of detail that does not show up in a unit test for a resolver function.