Prince Raj

Posted on Jun 4

How I Built a Hotel AI Platform in Go (And Every Honest Technical Debt We're Carrying)

#go #ai #architecture #backend

Building Stayzr meant solving real problems: PMS integration, high-throughput webhook handling, and AI that actually knows your property. Here's how we architected it.

The Stack (What's Running in Production)

Backend: Go 1.23 with Fiber framework, pgx/v5 connection pooling, Bun ORM over PostgreSQL, Redis for caching/sessions, OpenTelemetry for tracing

AI Agents Service: Python 3.11 + FastAPI (Uvicorn), LangChain primitives, Qdrant for knowledge base, ChromaDB for conversation memory

Frontend: Next.js 15 / React admin UI + marketing site

3rd-party Integrations: Mews (PMS), WhatsApp Business/Meta, Resend + Postmark (email), Azure Blob Storage (files), Gemini + OpenAI (LLM + embeddings), Infisical (secrets), SigNoz + Oneuptime (observability)

It's a polyglot monorepo: Go where throughput and concurrency matter (API, dispatch, sync), Python where the LLM/RAG ecosystem lives.

Why Go Over Python/Node/Java?

For the parts handling concurrent I/O — PMS sync workers, email dispatch worker, webhook fan-in — Go's goroutines + channels let us run in-process worker pools without pulling in a broker or heavyweight async runtime.

The dispatch worker is a for{ select } loop over a ticker and wake channel — simple and effective for our use case.

We kept Python only for the agents service because that's where LangChain, Gemini/OpenAI SDKs, and vector-store clients live. The honest answer: Go for systems work, Python where AI tooling requires it.

Multi-Tenancy: Row-Level Isolation

Shared database, shared schema, row-level isolation by organizationId. Every tenant-scoped table carries an organizationId, with a TenantDB wrapper in the data layer that auto-appends organization_id = $N to queries.

Middleware (MultiTenantContext / RequireTenant) resolves the org from the X-Organization-ID header, query param, cookie, or JWT claim. Below org we scope further by propertyId (a hotel can have multiple properties).

The AI memory store enforces the same boundary differently — every guest's conversation memory lives in a collection keyed org:{orgId}:guest:{guestId}, with redundant guestId metadata filter as defense-in-depth.

PMS Integration: The Hardest Part

We currently integrate with Mews, a modern cloud-based PMS. The architecture is built to support multiple providers (Opera, Cloudbeds scaffolding exists).

The Architecture Supports Multiple Providers

There's a generic Transformer[TLocal, TRemote] interface for bidirectional mapping (remote ↔ local model) with conflict-resolution strategies: REMOTE_WINS, LOCAL_WINS, MERGE_FIELDS, MANUAL_REVIEW.

Provider capabilities (API versions, base URLs, auth methods, supported entities, field mappings, rate limits, retry policy, webhook support) live in a pms_provider_configs table as JSONB. Adding a provider is "insert a row + implement the client."

Mews is interesting because it's REST but POST-only, with credentials in every request body rather than headers, and separate gross-pricing/net-pricing tokens.

The Real Challenge: Bidirectional Sync

Getting sync working wasn't hard because of HTTP. It was hard because of identity reconciliation and bidirectional sync without clobbering local edits.

We keep a pms_entity_mappings table translating local IDs ↔ Mews external IDs per entity type. The sync orchestrator diffs incoming records against local state, skips writes when there are no real changes, and routes genuine conflicts through the resolution strategy.

Getting "don't overwrite a staff member's local correction with stale remote data" right is the part that requires careful engineering.

Rate Limits, Downtime, Slow Responses

Three layers:

Token-bucket rate limiter in the Mews client (≈200 requests / 30s). Requests block on the bucket rather than getting 429'd.
Retry with exponential backoff + jitter (1s, 2s, 4s…) on 429/500/502/503/504, capped at 3 attempts.
Durable sync queue (pms_sync_queue table) processed by polling worker. Failed items record attempt count, last error, and nextRetryAt, then retry later — so outages degrade to delayed sync, not lost sync.

Caching PMS Data

PMS data (guests, reservations, rooms) is synced into PostgreSQL and read from there — Postgres acts as the cache. Redis is for auth/permissions/session and rate-limit counters (5–15 min TTL).

Invalidation is event-driven: Mews webhooks (Reservation.Updated, Customer.Updated, Message.Added) enqueue re-sync of affected entities. No classic cache-invalidation race because there's no separate cache tier.

Webhooks: Ack-Fast, Process-Async

How We Handle Inbound Webhooks

WhatsApp/Meta: Handled in Go backend, verified via X-Hub-Signature-256 (HMAC-SHA256). Resolve business number → reconcile guest by phone → find-or-create conversation → store message → hand off async.
Email (campaigns): Resend webhooks (Svix-style signatures) and Postmark webhooks for delivery, open, bounce, complaint, inbound replies.
PMS: Mews webhooks at POST /webhooks/pms/mews/{integrationId}, HMAC-SHA256 validated, processed async after immediate 200.

Every webhook: verify signature → ack fast → process in goroutine.

Throughput Architecture

The intake path is cheap: verify + persist + return 200, then process out-of-band. WhatsApp processing fans out to goroutines; PMS work lands in a durable queue draining at controlled rate.

A spike becomes a deeper queue, not dropped messages or webhook timeouts. The architecture is built for throughput bounded by downstream workers, not webhook handling.

Message Queue Strategy

We use Go goroutines + worker pools for fan-out and a Postgres-backed queue table (pms_sync_queue, dispatch batches) where durability across restarts matters.

Postgres-as-a-queue gives us:

Transactional enqueue
Visibility into stuck jobs via plain SQL
Zero new infra to operate

When a single Postgres queue stops keeping up, that's the signal to introduce a real broker.

Message Persistence

For durable paths, queue rows track state: PENDING → PROCESSING → COMPLETED/FAILED with attempt counts and nextRetryAt. A crash mid-process leaves a retryable row.

Email dispatch uses idempotency keys (batchID:contactID) so retries can't double-send.

AI Concierge: API-Based, Multi-Provider, Custom ReAct

How It Works

API-based, multi-provider. Default LLM is Gemini (gemini-2.0-flash-exp), with OpenAI, DeepSeek, Azure OpenAI selectable via LLM factory.

It's a custom ReAct loop with router + specialist agents. Router LLM classifies guest intent, delegates to one of five specialists:

Booking
Services
Property Info
Atlas/knowledge
Catch-all Concierge

Each specialist runs bounded ReAct loop (3–4 turns), binding tools and executing against Go backend over HTTP.

How AI Knows Property-Specific Info

RAG over Qdrant. Each property's documents (CSVs, text, markdown) are chunked, embedded with OpenAI text-embedding-3-small (1536-dim), stored in stayzr_knowledge collection with payload indexes on organizationId, propertyId, category, fileName.

Retrieval scoped to property (exact propertyId OR org-wide docs with null property). Query enrichment anchors vague questions ("what's nearby?") to property's actual city before embedding. Multi-pass search (strict threshold 0.3 → relaxed 0.15 → top-k fallback) ensures near-misses still return results.

Results carry matchTier tag so downstream logic knows trust level.

Escalation When AI Doesn't Know

Explicit escalation score sums signals:

Urgent/frustrated/negative sentiment
All-tools-failed
Low confidence (<0.4)
Explicit "I want human/manager" requests

Score ≥0.7 → escalate, 0.3–0.7 → monitor, else none.

On hard failure, specialist returns human-handoff line rather than hallucinating. Anti-hallucination guardrails detect ungrounded details. Duplicate-reply suppressor won't let bot parrot same answer twice without new tool result.

Prompt Engineering Only

100% prompt engineering — no fine-tuning. Every prompt is a builder function (buildRouterPrompt, buildBookingPrompt) injecting dynamic context: property basics, amenities, policies, local time, recent guest memory, active booking state.

For multi-tenant product where every property's facts differ, RAG + structured prompts beats fine-tuning.

Scalability & Performance

Current Architecture

Go services are mostly stateless (state in Postgres/Redis), so horizontal scaling is available (run N backend processes behind Nginx). Today we're vertically scaled with queue-absorbs-spike behavior.

Caching Strategy

Redis for:

Org data (5 min TTL)
User permissions (15 min)
Session validation (5 min)
Sliding-window rate-limit counters (1 min)

Pool sized at 100 connections. PMS/business data in Postgres, not Redis-cached.

WhatsApp Rate Limiting

We use token-bucket limiter pattern (already in codebase for Mews client) that paces requests. Per-minute global rate limit (default 60 RPM) on email dispatch worker. Same pattern, different ceiling, when volume justifies it.

Observability

OpenTelemetry end-to-end. OTel collector ships traces/logs to SigNoz and Oneuptime (dual export).

Prometheus scrapes node/postgres/redis/nginx/cadvisor every 15s with Alertmanager rules. Blackbox exporter probes public health endpoints (api.stayzr.com/health, marketing site, admin).

Go backend instrumented with OTel SDK + OTLP HTTP exporter. Structured logging (zap in Go) into OTel/SigNoz pipeline, PM2 log files parsed by collector's filelog receiver.

Security & Privacy

Webhook Authentication

Every inbound webhook signature-verified:

WhatsApp/Meta via X-Hub-Signature-256 (HMAC-SHA256)
Mews via HMAC-SHA256
Resend via Svix-style HMAC over id.timestamp.body
Postmark via server token/signature
Agents service via X-Agent-Signature HMAC + agent-ID allowlist

No unauthenticated webhook endpoints.

Encryption

In transit: TLS 1.2/1.3 everywhere via Nginx + Let's Encrypt, strong cipher suites, HSTS.

At rest: Disk/storage-level encryption. Application-level encryption for OAuth tokens via secure token codec.

PCI Scope

We don't store card data. Payment provider configs exist (Stripe, Razorpay) — integration model lets processor handle cards, staying out of PCI scope by never touching PANs.

Lessons Learned

Biggest Technical Wins

Postgres-as-a-queue works — until it doesn't. When it stops keeping up, that's when you introduce RabbitMQ/Kafka. Not before.
PMS APIs are idiosyncratic — they're not clean REST. POST-only, body-auth, partial webhooks, different data models. Build for reconciliation and conflict from start.
AI latency is hundreds of ms to seconds — if calling external LLM + RAG + tool calls, optimize for correctness, not fake speed claims.
Documentation matters — keep README aligned with actual stack. Documentation debt is invisible until someone reads it and the map doesn't match territory.

What Works Well

Goroutines for I/O-bound work — in-process worker pools without external broker
RAG + structured prompts — beats fine-tuning for multi-tenant with different property facts
Event-driven invalidation — webhooks trigger re-sync, no cache-invalidation races
Idempotency keys — prevent double-send on retries

What's Next

PMS Integrations

Opera and Cloudbeds next (config scaffolding exists), with service-layer abstraction making second provider straightforward.

AI Features

Multilingual partially real — locale handling in agent runtime, migrations canonicalizing guest languages. The agent runtime's tool-using design makes additional modalities feasible.

Scale

The architecture supports horizontal scaling (stateless Go services, state in Postgres/Redis). The next infrastructure investment is adding redundancy as we scale beyond current design partners.

For Developers Building Similar Systems

If you're building in hospitality tech or B2B SaaS with:

PMS/hotel integrations
High-throughput webhook handling
AI concierge with property-specific knowledge
Multi-tenant architecture

I'm happy to share patterns. Drop a comment or reach out.

Check Out Stayzr (If You're a Hotel Operator)

We're actively onboarding design partners. If you're a hotel operator drowning in guest messages, manual back-office work, or B2B travel agent requests, I'd love to show you what's possible.

30-day free trial, no strings attached.

👉 Visit Stayzr

If you found this useful, I'm planning more deep-dives on:

Go concurrency patterns for high-throughput webhook systems
PMS transformer abstraction (code walkthrough)
RAG pipeline for property-specific knowledge (Qdrant + multi-pass search)

Drop a comment if you want to see any of these.

DEV Community