iCe Gaming

Posted on Mar 16

Building an Offline-First Retail Hub in Rust: How ApexEdge Keeps Stores Selling When the Internet Dies

#rust #architecture #backend #webdev

If you’ve ever built retail software, you know the happy-path demo is the easy part.

The hard part is everything around it:

internet goes down,
HQ APIs lag or fail,
terminals still need to sell,
receipts still need to print,
and nothing can be lost.

I built ApexEdge, a RUST-powered store hub orchestrator that sits between POS/mPOS clients and HQ systems:

POS/MPOS <-> ApexEdge <-> HQ

Repo: https://github.com/AncientiCe/apex-edge
This post is about the actual engineering problems I had to solve, and the architecture patterns that made the system reliable in production-like conditions.

The Core Constraint: Stores Must Keep Selling

Retail can’t block on cloud availability. That drove my first principle:

The store hub is the source of operational truth during a transaction.

In practice, that means:

local persistence for catalog/prices/promos/customers/config,
local cart + checkout orchestration,
local document generation (receipt, merchant copy, kitchen chit, etc.),
async sync with HQ, not inline dependency.

If HQ is unavailable, checkout should still complete locally.

Synchronization is eventually consistent, but sales flow is immediate.

Why a Hub Instead of POS Calling HQ Directly?

Direct POS -> HQ can work for tiny setups, but at scale it creates fragile coupling:

every terminal becomes an integration client,
token/session handling is duplicated per app/device,
every command depends on WAN quality,
retries/idempotency become inconsistent across clients.

The hub model centralizes this:

one northbound contract for POS commands,
one southbound contract for HQ submission/sync,
one place to enforce idempotency, retries, conflict handling, and observability.

Contract-Driven Commands Instead of “Random Endpoints”

Instead of exposing many ad-hoc mutable endpoints, I route checkout behavior through a command envelope:

POST /pos/command

Examples:

create_cart
add_line_item
set_customer
set_tendering
add_payment
finalize_order

This gave me major wins:

Compatibility discipline: versioned command envelope.
Idempotency at the boundary: safe retries from POS.
Unified observability: command metrics by operation/outcome.
Deterministic testing: journey tests mirror real checkout flows.

Problem #1: “Exactly Once” Is a Lie (So I Designed for At-Least-Once)

Networks duplicate requests. Clients retry. Humans double-click.

So I assume at-least-once delivery and make handlers idempotent:

command envelope includes idempotency_key,
server stores command results keyed by scope,
repeated command with same key returns same logical response.

This prevents duplicate line items, duplicate payments, and duplicate order submission side effects.

Problem #2: Reliable HQ Submission Without Blocking Checkout

Inline finalize -> call HQ is brittle. If HQ times out, do you fail the sale?

I don’t.

I use a durable outbox:

finalize order locally,
atomically write HQ submission payload into outbox table,
background dispatcher posts to HQ with retry/backoff,
mark accepted/retry/dead-letter based on result.

This separates customer-facing latency from external dependency reliability.

Operationally, this is huge:

cashiers are not blocked by HQ instability,
submissions are durable across process restarts,
dead-letter rows are inspectable and replayable.

Problem #3: Keeping Catalog/Pricing Fresh Without Breaking Ongoing Sales

HQ sync is asynchronous NDJSON ingest with checkpoints per entity:

catalog
categories
price book
tax rules
promotions
customers
coupons
inventory
print templates

Design goals:

ingest in batches,
move checkpoints only on successful entity application,
tolerate unknown entities for forward compatibility,
surface sync state via API for UI visibility.

This avoids all-or-nothing fragility and makes partial progress explicit.

Problem #4: Inventory Truth vs. Checkout Experience

Stock rules are trickier than “if qty > 0”.

I enforce availability at add-to-cart time:

inactive items are blocked,
out-of-stock items are blocked,
quantity checks can return insufficient stock errors.

But I also handle incomplete sync states pragmatically:

if inventory is not yet synced for an item, I allow add-to-cart (configurable policy at architecture level).

That tradeoff favors operational continuity while still applying strict checks where data exists.

Problem #5: Document Generation Should Be Local and Deterministic

Receipts are not optional, and they can’t depend on round-tripping to HQ.

I generate documents in the hub:

render from synced templates + order/cart view models,
persist document artifacts,
expose retrieval endpoints for POS clients.

POS remains responsible for printer/device UX, but generation is centralized and reproducible.

Bonus: template updates can be distributed through sync instead of app redeploys.

Problem #6: Security for Shared, Real-World Devices

mPOS fleets need practical trust bootstrap:

generate pairing codes,
pair device once,
exchange external identity token + device proof,
receive local hub access/refresh tokens,
protect operational routes behind hub auth.

This model gives controlled device onboarding without hardcoding secrets into POS binaries.

Problem #7: You Can’t Operate What You Can’t See

I instrumented behavior ownership across routes and background flows:

command counts + latencies + outcomes,
outbox dispatch attempts/duration/DLQ,
sync ingest batches + durations + outcomes,
DB operation outcomes,
HTTP-level request metrics.

In an outage, these metrics answer:

is checkout still flowing?
are HQ submissions backlogged?
is sync stuck on a specific entity?
are failures concentrated at DB, network, or domain validation?

Without this, “it feels slow” becomes your only signal.

Testing Strategy That Actually Catches Regressions

I leaned hard on behavior-level tests:

in-process smoke tests for health/ready/basic command flows,
full journey tests from cart creation to finalized order + document generation + outbox payload assertions,
crate-level tests for domain, storage, sync, outbox, printing, and contracts.

This matters because distributed correctness is mostly integration correctness.

Architecture Pattern Summary

What worked for me:

Local-first transaction boundary
Command envelope + idempotency
Durable outbox for external submission
Checkpointed async ingest for HQ sync
Local document generation
Device trust + token exchange
First-class metrics on all critical paths

If you’re building store, warehouse, or edge-heavy systems, this combination gives resilience without requiring “perfect network” fantasies.

Tradeoffs (No Silver Bullets)

What you pay for this architecture:

more moving parts than direct API calls,
stronger schema/contract discipline required,
eventual consistency complexity,
operational tooling for DLQ replay and sync introspection.

Still worth it for domains where downtime equals lost revenue.

Practical Advice If You’re Starting Today

Start with idempotency keys on day one.
Model outbox as a product feature, not a reliability patch.
Keep sync entity-specific with independent checkpoints.
Treat metrics as acceptance criteria.
Run end-to-end journey tests before adding more endpoints.
Document behavior ownership explicitly (who owns which route/flow/metric).

Closing

The biggest shift for me was this:

I stopped optimizing for request/response elegance and started optimizing for business continuity under failure.

That changes everything from API shape to data model to test strategy.

If you’re designing offline-capable systems and want to compare patterns, I’m happy to share a deeper follow-up on:

idempotency schema design,
outbox retry and DLQ policy,
sync conflict handling,
or observability dashboards that worked in practice.

DEV Community