If you’ve ever built retail software, you know the happy-path demo is the easy part.
The hard part is everything around it:
- internet goes down,
- HQ APIs lag or fail,
- terminals still need to sell,
- receipts still need to print,
- and nothing can be lost.
I built ApexEdge, a RUST-powered store hub orchestrator that sits between POS/mPOS clients and HQ systems:
POS/MPOS <-> ApexEdge <-> HQ

Repo: https://github.com/AncientiCe/apex-edge
This post is about the actual engineering problems I had to solve, and the architecture patterns that made the system reliable in production-like conditions.
The Core Constraint: Stores Must Keep Selling
Retail can’t block on cloud availability. That drove my first principle:
The store hub is the source of operational truth during a transaction.
In practice, that means:
- local persistence for catalog/prices/promos/customers/config,
- local cart + checkout orchestration,
- local document generation (receipt, merchant copy, kitchen chit, etc.),
- async sync with HQ, not inline dependency.
If HQ is unavailable, checkout should still complete locally.
Synchronization is eventually consistent, but sales flow is immediate.
Why a Hub Instead of POS Calling HQ Directly?
Direct POS -> HQ can work for tiny setups, but at scale it creates fragile coupling:
- every terminal becomes an integration client,
- token/session handling is duplicated per app/device,
- every command depends on WAN quality,
- retries/idempotency become inconsistent across clients.
The hub model centralizes this:
- one northbound contract for POS commands,
- one southbound contract for HQ submission/sync,
- one place to enforce idempotency, retries, conflict handling, and observability.
Contract-Driven Commands Instead of “Random Endpoints”
Instead of exposing many ad-hoc mutable endpoints, I route checkout behavior through a command envelope:
POST /pos/command
Examples:
create_cartadd_line_itemset_customerset_tenderingadd_paymentfinalize_order
This gave me major wins:
- Compatibility discipline: versioned command envelope.
- Idempotency at the boundary: safe retries from POS.
- Unified observability: command metrics by operation/outcome.
- Deterministic testing: journey tests mirror real checkout flows.
Problem #1: “Exactly Once” Is a Lie (So I Designed for At-Least-Once)
Networks duplicate requests. Clients retry. Humans double-click.
So I assume at-least-once delivery and make handlers idempotent:
- command envelope includes
idempotency_key, - server stores command results keyed by scope,
- repeated command with same key returns same logical response.
This prevents duplicate line items, duplicate payments, and duplicate order submission side effects.
Problem #2: Reliable HQ Submission Without Blocking Checkout
Inline finalize -> call HQ is brittle. If HQ times out, do you fail the sale?
I don’t.
I use a durable outbox:
- finalize order locally,
- atomically write HQ submission payload into outbox table,
- background dispatcher posts to HQ with retry/backoff,
- mark accepted/retry/dead-letter based on result.
This separates customer-facing latency from external dependency reliability.
Operationally, this is huge:
- cashiers are not blocked by HQ instability,
- submissions are durable across process restarts,
- dead-letter rows are inspectable and replayable.
Problem #3: Keeping Catalog/Pricing Fresh Without Breaking Ongoing Sales
HQ sync is asynchronous NDJSON ingest with checkpoints per entity:
- catalog
- categories
- price book
- tax rules
- promotions
- customers
- coupons
- inventory
- print templates
Design goals:
- ingest in batches,
- move checkpoints only on successful entity application,
- tolerate unknown entities for forward compatibility,
- surface sync state via API for UI visibility.
This avoids all-or-nothing fragility and makes partial progress explicit.
Problem #4: Inventory Truth vs. Checkout Experience
Stock rules are trickier than “if qty > 0”.
I enforce availability at add-to-cart time:
- inactive items are blocked,
- out-of-stock items are blocked,
- quantity checks can return insufficient stock errors.
But I also handle incomplete sync states pragmatically:
- if inventory is not yet synced for an item, I allow add-to-cart (configurable policy at architecture level).
That tradeoff favors operational continuity while still applying strict checks where data exists.
Problem #5: Document Generation Should Be Local and Deterministic
Receipts are not optional, and they can’t depend on round-tripping to HQ.
I generate documents in the hub:
- render from synced templates + order/cart view models,
- persist document artifacts,
- expose retrieval endpoints for POS clients.
POS remains responsible for printer/device UX, but generation is centralized and reproducible.
Bonus: template updates can be distributed through sync instead of app redeploys.
Problem #6: Security for Shared, Real-World Devices
mPOS fleets need practical trust bootstrap:
- generate pairing codes,
- pair device once,
- exchange external identity token + device proof,
- receive local hub access/refresh tokens,
- protect operational routes behind hub auth.
This model gives controlled device onboarding without hardcoding secrets into POS binaries.
Problem #7: You Can’t Operate What You Can’t See
I instrumented behavior ownership across routes and background flows:
- command counts + latencies + outcomes,
- outbox dispatch attempts/duration/DLQ,
- sync ingest batches + durations + outcomes,
- DB operation outcomes,
- HTTP-level request metrics.
In an outage, these metrics answer:
- is checkout still flowing?
- are HQ submissions backlogged?
- is sync stuck on a specific entity?
- are failures concentrated at DB, network, or domain validation?
Without this, “it feels slow” becomes your only signal.
Testing Strategy That Actually Catches Regressions
I leaned hard on behavior-level tests:
- in-process smoke tests for health/ready/basic command flows,
- full journey tests from cart creation to finalized order + document generation + outbox payload assertions,
- crate-level tests for domain, storage, sync, outbox, printing, and contracts.
This matters because distributed correctness is mostly integration correctness.
Architecture Pattern Summary
What worked for me:
- Local-first transaction boundary
- Command envelope + idempotency
- Durable outbox for external submission
- Checkpointed async ingest for HQ sync
- Local document generation
- Device trust + token exchange
- First-class metrics on all critical paths
If you’re building store, warehouse, or edge-heavy systems, this combination gives resilience without requiring “perfect network” fantasies.
Tradeoffs (No Silver Bullets)
What you pay for this architecture:
- more moving parts than direct API calls,
- stronger schema/contract discipline required,
- eventual consistency complexity,
- operational tooling for DLQ replay and sync introspection.
Still worth it for domains where downtime equals lost revenue.
Practical Advice If You’re Starting Today
- Start with idempotency keys on day one.
- Model outbox as a product feature, not a reliability patch.
- Keep sync entity-specific with independent checkpoints.
- Treat metrics as acceptance criteria.
- Run end-to-end journey tests before adding more endpoints.
- Document behavior ownership explicitly (who owns which route/flow/metric).
Closing
The biggest shift for me was this:
I stopped optimizing for request/response elegance and started optimizing for business continuity under failure.
That changes everything from API shape to data model to test strategy.
If you’re designing offline-capable systems and want to compare patterns, I’m happy to share a deeper follow-up on:
- idempotency schema design,
- outbox retry and DLQ policy,
- sync conflict handling,
- or observability dashboards that worked in practice.
Top comments (0)