prithviraj.veluchamy@gmail.com

Posted on Mar 23

Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack

#gcp #vertexai #machinelearning #retail

A shopper searches for rain boots on your storefront. Within 120ms, your personalization engine surfaces the right products. A stock alert fires, and three AI agents coordinate a reorder without a human touching a keyboard. The customer asks a question in chat — the answer comes back grounded in live inventory and your return policy, cited and accurate.

This is not three separate AI projects. It is one unified platform — and this article shows you how to build it on GCP.

🏗️ The Three Layers of an AI-Native Retail Platform

Most retail AI initiatives start with one use case and stop there. What makes a platform is when these three capabilities are designed together, sharing infrastructure and data:

Layer	What It Does	GCP Services
Real-Time Personalization	Surfaces relevant products from millions of SKUs in < 120ms	Pub/Sub, Dataflow, Vertex AI Matching Engine, Feature Store, Cloud Run
Multi-Agent Operations	Coordinates inventory, pricing, supplier, and customer agents in parallel	Vertex AI Reasoning Engine, Pub/Sub, BigQuery ML, Cloud Run
Agentic RAG	Answers complex queries grounded in live data + policy docs	Vertex AI Search, Gemini, BigQuery (as a live tool)

The key insight: all three layers share the same data backbone — BigQuery as the source of truth, Pub/Sub as the event spine, and Vertex AI as the intelligence layer.

📐 Unified Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        FRONTEND / API GATEWAY                   │
└───────────┬──────────────────┬───────────────────┬─────────────┘
            │                  │                   │
    ┌───────▼──────┐  ┌────────▼───────┐  ┌───────▼──────────┐
    │ PERSONALI-   │  │  MULTI-AGENT   │  │   AGENTIC RAG    │
    │ ZATION       │  │  ORCHESTRATOR  │  │   (Customer Q&A) │
    │ ENGINE       │  │  (Gemini 1.5)  │  │   (Gemini +      │
    │ (Cloud Run)  │  │  (Vertex AI    │  │    Vertex Search) │
    └───────┬──────┘  │   Reasoning)   │  └───────┬──────────┘
            │         └────────┬───────┘          │
            │                  │                  │
            └──────────────────┼──────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │         GOOGLE CLOUD PUB/SUB     │
              │         (Shared Event Spine)      │
              └───┬──────────┬──────────┬────────┘
                  │          │          │
          ┌───────▼──┐ ┌─────▼────┐ ┌──▼────────────┐
          │ Dataflow  │ │Specialist│ │ Vertex AI     │
          │ Streaming │ │ Agents   │ │ Search Index  │
          └───────┬──┘ └─────┬────┘ └──┬────────────┘
                  │          │          │
              ┌───▼──────────▼──────────▼───┐
              │          BIGQUERY            │
              │   (Shared Operational Store) │
              └─────────────────────────────┘

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

Daily batch recommendations ignore the most powerful signal available: what the user is doing right now. A shopper who just added rain boots to their cart does not want yesterday's trending sneakers.

Design principle: Personalization is a retrieval problem. Given a user and their context right now, find the items most likely to convert — in under 120ms.

The Six-Stage Pipeline

Stage 1 — Event Capture (Pub/Sub)

Every user interaction fires a structured event to Pub/Sub. The client SDK is fire-and-forget — it does not wait for a response.

{
  "event_type": "CART_ADD",
  "user_id": "u_8821",
  "sku_id": "SKU-4471",
  "session_id": "s_992abc",
  "ts": "2026-03-22T14:03:11Z",
  "context": { "device": "mobile", "location": "Atlanta, GA" }
}

Stage 2 — Stream Enrichment (Dataflow)

A Dataflow streaming job picks up events, joins with item metadata from BigQuery, and writes two outputs:

Session feature update → Vertex AI Feature Store (< 5s latency)
Interaction log → BigQuery (for offline model training)

Stage 3 — Feature Assembly (Vertex AI Feature Store)

At query time, three feature groups are fetched in a single low-latency call:

feature_store_client.read_feature_values(
    entity_type="user",
    entity_ids=[user_id],
    feature_selector={
        "id_matcher": {
            "ids": ["purchase_history", "session_clicks", "device_type", "location"]
        }
    }
)

Stage 4 — ANN Retrieval (Vertex AI Matching Engine)

The assembled user context vector is submitted to Matching Engine — Google's managed ANN index. It returns the top 50 candidate SKUs from a catalog of millions in under 10ms.

response = index_endpoint.find_neighbors(
    deployed_index_id="retail_item_embeddings",
    queries=[user_context_vector],
    num_neighbors=50
)

Under the hood: Google's ScaNN algorithm, pre-filtered by in-stock status so the re-ranker never sees unavailable items.

Stage 5 — Re-Ranking (Vertex AI Prediction)

A lightweight model re-scores the 50 candidates using signals the embedding index cannot capture:

Current inventory level
Promotional pricing flag
User's price sensitivity segment
Real-time trend score

Stage 6 — Serve (Cloud Run)

Top 10 results + display metadata returned to the frontend. End-to-end: < 120ms at p99.

Handling Cold Start

Scenario	Strategy
New user (no history)	Serve contextual top-trending items by device + time + location
New item (no interactions)	Content-based embedding from product description + image on ingestion
After first click	Session features kick in within 5 seconds

🤖 Layer 2: Multi-Agent Operations

The Core Problem

A single LLM handling all retail operations hits three walls: context overload, sequential latency, and unmaintainable prompts. When the inventory rule, pricing model, supplier contract, and customer policy all need to fit in one context — reasoning quality degrades.

Design principle: Treat operations like a well-run team. One orchestrator receives requests and coordinates specialists. Each specialist does one thing well.

Agent Architecture

Operator / System Trigger
        │
        ▼
┌─────────────────────────────────┐
│  ORCHESTRATOR AGENT             │
│  Gemini 1.5 Pro                 │
│  Vertex AI Reasoning Engine     │
│  - Decomposes tasks             │
│  - Routes to specialists        │
│  - Synthesizes final response   │
└────┬──────────┬──────────┬──────┘
     │  Pub/Sub │          │
     ▼          ▼          ▼
┌─────────┐ ┌────────┐ ┌──────────┐ ┌──────────┐
│Inventory│ │Pricing │ │Supplier  │ │Customer  │
│Agent    │ │Agent   │ │Agent     │ │Agent     │
│BigQuery │ │BQ ML   │ │Vertex AI │ │Agentic   │
│         │ │        │ │Search    │ │RAG ←────── Layer 3
└─────────┘ └────────┘ └──────────┘ └──────────┘

Notice: the Customer Agent IS Layer 3 — Agentic RAG is not separate, it is the intelligence layer of the Customer Agent. This is where the three layers connect.

A Reorder Request — Traced End-to-End

Input: "Should we reorder SKU-991?"

Step 1 — Decompose: Orchestrator identifies three parallel sub-tasks.

tasks = orchestrator.decompose(query)
# → [
#     {"agent": "inventory", "task": "get_stock_level", "sku": "SKU-991"},
#     {"agent": "supplier",  "task": "get_eta_and_cost", "sku": "SKU-991"},
#     {"agent": "pricing",   "task": "get_reorder_cost", "sku": "SKU-991"}
# ]

Step 2 — Dispatch: All three tasks published to Pub/Sub simultaneously.

Step 3 — Execute in Parallel: Each Cloud Run agent handles its task independently:

# Inventory Agent
stock = bq_client.query("""
    SELECT units_available FROM inventory_snapshot
    WHERE sku_id = 'SKU-991' AND store_id = 'DC-ATL'
""").result()

# Pricing Agent (BigQuery ML)
reorder_cost = bq_client.query("""
    SELECT ML.PREDICT(MODEL `retail.pricing_model`,
        (SELECT * FROM pricing_signals WHERE sku_id = 'SKU-991'))
""").result()

Step 4 — Synthesize:

Orchestrator → "Reorder 50 units from Vendor A at $4.20/unit, ETA 3 days. 
                Current stock: 8 units (below reorder threshold of 15)." ✅

Total time = max(slowest agent) — not the sum of all three.

The Pub/Sub Design — Why It Matters

Three properties you get for free:

Loose coupling: agents have no direct dependency on each other, only on topic names
Fault tolerance: if an agent crashes, the message is retained and redelivered on recovery
Independent scaling: each Cloud Run agent scales on its own Pub/Sub queue depth

Shared Memory: The `agent_decision_log` Table

Every orchestrated request is fully logged:

CREATE TABLE retail.agent_decision_log (
  request_id      STRING,
  ts              TIMESTAMP,
  agent_called    STRING,
  tools_used      ARRAY<STRING>,
  input_payload   JSON,
  output_payload  JSON,
  latency_ms      INT64,
  confidence      FLOAT64
);

This table powers weekly evaluation reports and feeds back into model fine-tuning — your audit trail is also your training dataset.

📚 Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Standard RAG (embed query → retrieve chunks → generate) fails retail because:

A single customer question often spans multiple knowledge domains (policy + inventory + product specs)
Inventory data goes stale in minutes — you cannot index it as static documents
Retrieval confidence varies — a system that cannot detect low-confidence answers will hallucinate

Design principle: RAG should reason, not just retrieve. The agent decides which source to query, validates the result, and cites its sources.

Three Retrieval Sources

1. Policy & Compliance Index (Vertex AI Search)

Return policies, warranty terms, BOPIS rules, hazmat shipping. Indexed as documents with hybrid retrieval (dense semantic + sparse BM25 keyword).

BM25 matters here: product part numbers and model codes are not well-served by pure vector search. Hybrid retrieval handles both.

2. Product Catalog Index (Vertex AI Search)

Product descriptions, specs, compatibility notes, sizing guides. Indexed with multimodal embeddings (text + image) so "waterproof jacket similar to this one" works.

3. Live Operational Data (BigQuery as a Tool)

Inventory levels, order status, real-time pricing — not indexed as documents but called as a live tool. This is the key architectural decision that prevents stale answers.

tools = [
    VertexAISearchTool(index="retail_policy_index"),
    VertexAISearchTool(index="retail_product_index"),
    BigQueryTool(query_template=INVENTORY_QUERY)  # live call, not indexed
]

Query Decomposition in Action

Customer query: "Can I return the 40V battery I bought online at a store, and is it in stock at the Cumming, GA location?"

Agent Plan:
  Sub-query A → Policy Index: "online purchase battery return policy in-store"
  Sub-query B → BigQuery Tool: SELECT units_available 
                               FROM inventory_snapshot 
                               WHERE sku_id='SKU-4471' AND store='GA-CUMMING'

Agent validates Sub-query A: relevance score > 0.82 threshold ✅

Agent validates Sub-query B: live data, timestamp 2 minutes ago ✅

Synthesized answer:

"Yes — online purchases can be returned in-store within 90 days (Policy §3.2). 
The 40V battery (SKU-4471) shows 3 units in stock at Cumming, GA 
as of 14:07 EST today."

Every fact is cited. No hallucination. No "please check the website."

The Self-Correction Loop

MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
    result = vertex_search.retrieve(query, index=index_id)

    if result.confidence_score >= THRESHOLD:
        return result

    # Reformulate: broaden scope, try synonyms, switch retrieval mode
    query = agent.reformulate(query, attempt)

# After max retries: escalate to human agent queue
escalate_to_human(original_query)

This loop means your system knows what it does not know — and routes accordingly.

🔗 How the Three Layers Connect

The platform is unified, not assembled. Here is how data and events flow across all three layers in a single customer session:

1. Customer browses → Pub/Sub event → Personalization Engine 
                      surfaces relevant products (Layer 1)

2. Inventory drops below threshold → Pub/Sub alert → 
   Orchestrator Agent dispatches reorder across 3 specialist 
   agents in parallel (Layer 2)

3. Customer asks: "Is this in stock?" → Customer Agent (Layer 2) 
   → Agentic RAG (Layer 3) queries BigQuery live + policy index
   → grounded, cited answer in < 2s

4. All events → BigQuery agent_decision_log + interaction_log
   → weekly eval reports + model retraining for Layers 1 & 3

The feedback loop is the platform. Every interaction trains the next version of every model.

📊 Observability — One Dashboard, Three Layers

All three layers write to BigQuery. One Looker Studio dashboard covers the full platform:

Metric	Layer	Source Table
Recommendation CTR by segment	Personalization	`interaction_log`
ANN retrieval latency p99	Personalization	`serving_metrics`
Agent task parallelism ratio	Multi-Agent	`agent_decision_log`
Reorder decision accuracy	Multi-Agent	`agent_decision_log`
RAG retrieval precision@5	Agentic RAG	`agent_query_log`
Re-query rate	Agentic RAG	`agent_query_log`

When retrieval precision drops, you know before customers notice.

🚀 Where to Start

Don't try to ship all three layers at once. Here is a proven sequencing:

Week 1–4: Lay the data foundation

Set up BigQuery tables: inventory_snapshot, interaction_log, agent_decision_log
Stand up Pub/Sub topics and Dataflow streaming job
This infrastructure is shared by all three layers — do it once, use it everywhere

Week 5–8: Ship Personalization (Layer 1)

Train a two-tower model on BigQuery interaction history
Index item embeddings into Vertex AI Matching Engine
Wire up Cloud Run serving API
Measure: recommendation CTR vs. batch baseline

Week 9–12: Add Multi-Agent Ops (Layer 2)

Start with two agents: Inventory + Pricing
Orchestrator on Vertex AI Reasoning Engine
Add Supplier Agent once the first two are stable

Week 13–16: Add Agentic RAG (Layer 3)

Index return policy + product catalog into Vertex AI Search
Wire the BigQuery inventory tool into the agent
Deploy as the Customer Agent inside your multi-agent system

The Pub/Sub bus means each new layer plugs in without touching what already works.

💡 Key Takeaways

Share infrastructure, not code. BigQuery and Pub/Sub serve all three layers. Build them once.
The Customer Agent IS Agentic RAG. Don't build these as separate projects.
The agent_decision_log is your most valuable table. It is your audit trail, your eval dataset, and your retraining signal.
Personalization cold start is solved by context, not history. Device + time + location gets you 80% of the way there for new users.
Hybrid retrieval beats pure vector search for retail. BM25 handles part numbers and model codes that semantic search misses.

DEV Community

Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack

🏗️ The Three Layers of an AI-Native Retail Platform

📐 Unified Architecture Overview

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

The Six-Stage Pipeline

Handling Cold Start

🤖 Layer 2: Multi-Agent Operations

The Core Problem

Agent Architecture

A Reorder Request — Traced End-to-End

The Pub/Sub Design — Why It Matters

Shared Memory: The `agent_decision_log` Table

📚 Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Three Retrieval Sources

Query Decomposition in Action

The Self-Correction Loop

🔗 How the Three Layers Connect

📊 Observability — One Dashboard, Three Layers

🚀 Where to Start

💡 Key Takeaways

Top comments (0)

🏗️ The Three Layers of an AI-Native Retail Platform

📐 Unified Architecture Overview

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

The Six-Stage Pipeline

Handling Cold Start

🤖 Layer 2: Multi-Agent Operations

The Core Problem

Agent Architecture

A Reorder Request — Traced End-to-End

The Pub/Sub Design — Why It Matters

Shared Memory: The agent_decision_log Table

📚 Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Three Retrieval Sources

Query Decomposition in Action

The Self-Correction Loop

🔗 How the Three Layers Connect

📊 Observability — One Dashboard, Three Layers

🚀 Where to Start

💡 Key Takeaways

Shared Memory: The `agent_decision_log` Table