Why I Made WebSocket Delivery the Disposable Part of the Tracking System

#eventsourcing #redisstreams #websocket #typescript

The uncomfortable failure is not a carrier outage. That one is loud. The polling loop logs it, retries it, and moves on.

The failure I had to design around is quieter: the gateway receives a real carrier update, writes it to the database, then Redis is down at the exact moment the WebSocket fanout should happen. The client misses the live push. The shipment state is still correct. The event timeline is still correct. But the thing the user was watching in real time never moves.

That sounds like a broken real-time system until you decide which part is allowed to be temporary.

I made the database the truth and the WebSocket stream the delivery layer. That one decision shaped the rest of the gateway.

The real boundary

DHL, DPD, and GLS do not send one clean stream of facts. The adapters have to handle different request shapes, different status codes, and different timestamp rules. DHL uses DHL-API-Key and returns shipment events under shipments[0].events. DPD can return HTML for an error response, so DpdAdapter checks the content type before it ever tries to parse JSON. GLS sends a date like 06.05.2026 and a local time string, so GlsAdapter has to parse CET into UTC.

That is all adapter work. Once an event crosses into the core, it has one shape: carrier, tracking number, carrier status, normalized status, carrier timestamp, and a deterministic dedupKey.

The key is generated from four values:

export function generateDedupKey(input: DedupKeyInput): string {
  const carrier = normalizeCarrier(input.carrier);
  const trackingNumber = assertRequiredString(input.trackingNumber, "trackingNumber");
  const carrierStatus = assertRequiredString(input.carrierStatus, "carrierStatus");
  const carrierTimestampIso = normalizeTimestamp(input.carrierTimestamp);

  return [carrier, trackingNumber, carrierStatus, carrierTimestampIso].join(":");
}

That function is plain, but it carries the system. A carrier can send the same scan twice. A poll can retry after a timeout. A process can restart and fetch the same history again. If the fact is the same, the key is the same.

What surprised me was how much of the gateway became simpler once duplicate handling moved into the event identity instead of the polling loop. The poller does not need to remember what it saw last time. The adapters do not need per-carrier duplicate caches. The WebSocket layer does not decide whether something is new. The processor decides once, against PostgreSQL.

The processor is the load-bearing part

I was wrong at first about where the "real-time" complexity would live. I expected it to be WebSockets: connection cleanup, heartbeats, subscription maps, broadcast performance. Those were real problems, but they were not the hard correctness problem.

The hard part was making sure Redis never became the source of truth by accident.

createEventProcessor starts with a dedup lookup, then inserts into tracking_events with onConflictDoNothing(). That second guard matters because two workers can pass the lookup at the same time. The database still gets the final vote.

After the insert, it updates the shipment projection only when the new carrier timestamp is newer than last_event_at. Older events are still persisted. They just do not move the visible shipment state backwards.

The important shape is this:

const updatedRows = await options.db
  .update(shipments)
  .set({
    currentStatus: event.normalizedStatus,
    lastEventAt: event.carrierTimestamp,
    updatedAt: new Date()
  })
  .where(
    and(
      eq(shipments.id, event.shipmentId),
      or(isNull(shipments.lastEventAt), lt(shipments.lastEventAt, event.carrierTimestamp))
    )
  )
  .returning({ id: shipments.id });
projectionUpdated = updatedRows.length > 0;

That is the line between history and projection. An out-of-order scan belongs in history. It does not belong on the dashboard as the current state.

Then Redis publish happens after the database work. If the database write fails, the processor returns failed and does not publish. If Redis publish fails, the processor still returns processed with published: false. That asymmetry is deliberate. A WebSocket update can be missed. A carrier fact cannot be invented.

The integration tests capture both sides. One test inserts an invalid shipment id and asserts that Redis stays empty. Another makes the publisher throw redis unavailable and asserts the PostgreSQL event still exists. Those tests are more useful than a happy-path WebSocket demo because they pin down which failure the business can recover from.

Why not broadcast directly?

The simple version is tempting. A client subscribes to a tracking number. The processor receives an event. It loops over sockets and calls send().

That works until the process restarts, the send throws, or the event arrives in a worker that does not own the socket. Even in a single-process version, direct send ties event processing to live delivery. That is the dependency I wanted to avoid.

Redis Streams gave the gateway a narrow delivery contract. The processor writes one stream entry to tracking:events. The broadcaster consumes as group ws-broadcaster, parses the payload, looks up connection ids by tracking number, and sends JSON to sockets registered in the current process.

The broadcaster code has a small but telling rule: malformed stream entries are acknowledged.

At first that felt wrong. Acknowledging a bad message means admitting it will never be delivered. But leaving invalid JSON pending forever is worse. It blocks operational visibility and makes the group look unhealthy for a message that cannot succeed on retry. The test for that case writes { as the payload and asserts it gets logged and acked.

The other Redis detail is XAUTOCLAIM. If a consumer reads a stream entry and dies before acknowledging it, the message sits pending. The broadcaster reclaims stuck messages after 60 seconds. That is enough for the current single-service deployment, and it creates a clear upgrade path for multi-process delivery later.

The polling loop stays boring on purpose

The polling engine is less clever than the rest of the system. That is good.

It loads enabled carrier configs from the database, starts one loop per carrier, queries active shipments excluding delivered, returned, and deleted_at, then batches them by POLL_BATCH_SIZE. The default is 10.

Each shipment call runs through withExponentialBackoff. RateLimitError and CarrierError are retryable by default. The delay is baseDelayMs * 2 ** attempt, with the base coming from carrier_configs.backoff_base_ms.

The scheduler has one job: skip overlapping cycles. If a carrier cycle is still running when the next interval fires, it returns a structured skipped result with reason cycle_already_running. It does not start a second poller and hope the processor dedup saves the day.

That skip behavior matters because the PRD target is 100 active shipments across 3 carriers, with a carrier polling cycle under 30 seconds and WebSocket delivery under 200ms once the event enters the pipeline. If the poller stacks, the gateway creates its own load spike and every downstream metric lies.

The timestamp bug that exposed the database shape

The most useful gotcha came from pagination, not carrier integration. Cursor pagination repeated a shipment on page 2 even though the SQL condition compared the sort timestamp against the cursor timestamp.

The schema uses PostgreSQL timestamp without time zone. Passing a JavaScript Date through Drizzle and node-postgres introduced enough interpretation drift that the comparison did not behave like the stored value. The fix was to format the cursor as a PostgreSQL timestamp literal and cast it with ::timestamp.

That is why shipments.routes.ts has this helper:

function toPgTimestamp(value: Date): string {
  return value.toISOString().replace("T", " ").replace("Z", "");
}

It looks like formatting trivia. It is actually a pagination invariant. If the cursor repeats rows, clients may process the same shipment page twice or miss a page when they try to recover.

The part I left deliberately unfinished

The stats endpoint exposes a limitation instead of pretending the system has a metric it does not own yet. It returns active shipments and today's event counts by carrier, but error_rate and errors are null. The response includes a metrics_limitations entry that says carrier poll failure counters are not persisted yet.

That is not a polished answer, but it is the honest one. The polling engine logs retry attempts and per-cycle errors, and the build journal records the summary fields: carrier, shipments polled, events found, deduplicated events, and errors. Those numbers exist at runtime. They do not yet survive as queryable history.

I prefer that over a fake rate derived from current memory. A dashboard number that resets on restart is worse than no number because it invites operational decisions from incomplete evidence. If this gateway were being deployed for a client, the next increment would be a small poll-cycle table or metrics sink before anyone relied on /api/stats for carrier health.

The same honesty shows up in the success criteria. The system has the local loop, the production-start check, the Docker Compose smoke proof, and the mock-carrier journey. It does not have the 24-hour simulated-load soak. That missing box is not paperwork. It is the difference between "the architecture handles the failure model" and "this process survived a day of real runtime pressure."

The result

The finished local build passed 134 Vitest tests. The unit and integration coverage command passed at 83.32% statement and line coverage. The WebSocket E2E test publishes a Redis stream event and asserts delivery under 200ms after the broadcaster group exists. The mock-carrier demo registers DHL, DPD, and GLS shipments, subscribes over WebSocket, processes events through the real event processor, verifies duplicate suppression, and confirms the REST API shows in-transit shipments.

The 24-hour soak is still deferred. That matters. A gateway like this is not production-proven just because the happy path works for a few minutes.

But the failure model is in the right order. Carrier facts land in PostgreSQL first. Shipment state is a projection. Redis carries delivery. WebSockets are allowed to miss a push. The timeline is not allowed to lie.