DEV Community

Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack

A shopper searches for rain boots on your storefront. Within 120ms, your personalization engine surfaces the right products. A stock alert fires, and three AI agents coordinate a reorder without a human touching a keyboard. The customer asks a question in chat β€” the answer comes back grounded in live inventory and your return policy, cited and accurate.

This is not three separate AI projects. It is one unified platform β€” and this article shows you how to build it on GCP.


πŸ—οΈ The Three Layers of an AI-Native Retail Platform

Most retail AI initiatives start with one use case and stop there. What makes a platform is when these three capabilities are designed together, sharing infrastructure and data:

Layer What It Does GCP Services
Real-Time Personalization Surfaces relevant products from millions of SKUs in < 120ms Pub/Sub, Dataflow, Vertex AI Matching Engine, Feature Store, Cloud Run
Multi-Agent Operations Coordinates inventory, pricing, supplier, and customer agents in parallel Vertex AI Reasoning Engine, Pub/Sub, BigQuery ML, Cloud Run
Agentic RAG Answers complex queries grounded in live data + policy docs Vertex AI Search, Gemini, BigQuery (as a live tool)

The key insight: all three layers share the same data backbone β€” BigQuery as the source of truth, Pub/Sub as the event spine, and Vertex AI as the intelligence layer.


πŸ“ Unified Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        FRONTEND / API GATEWAY                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                  β”‚                   β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ PERSONALI-   β”‚  β”‚  MULTI-AGENT   β”‚  β”‚   AGENTIC RAG    β”‚
    β”‚ ZATION       β”‚  β”‚  ORCHESTRATOR  β”‚  β”‚   (Customer Q&A) β”‚
    β”‚ ENGINE       β”‚  β”‚  (Gemini 1.5)  β”‚  β”‚   (Gemini +      β”‚
    β”‚ (Cloud Run)  β”‚  β”‚  (Vertex AI    β”‚  β”‚    Vertex Search) β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β”‚   Reasoning)   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
            β”‚                  β”‚                  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚         GOOGLE CLOUD PUB/SUB     β”‚
              β”‚         (Shared Event Spine)      β”‚
              β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚          β”‚          β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚ Dataflow  β”‚ β”‚Specialistβ”‚ β”‚ Vertex AI     β”‚
          β”‚ Streaming β”‚ β”‚ Agents   β”‚ β”‚ Search Index  β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚          β”‚          β”‚
              β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”
              β”‚          BIGQUERY            β”‚
              β”‚   (Shared Operational Store) β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

Daily batch recommendations ignore the most powerful signal available: what the user is doing right now. A shopper who just added rain boots to their cart does not want yesterday's trending sneakers.

Design principle: Personalization is a retrieval problem. Given a user and their context right now, find the items most likely to convert β€” in under 120ms.

The Six-Stage Pipeline

Stage 1 β€” Event Capture (Pub/Sub)

Every user interaction fires a structured event to Pub/Sub. The client SDK is fire-and-forget β€” it does not wait for a response.

{
  "event_type": "CART_ADD",
  "user_id": "u_8821",
  "sku_id": "SKU-4471",
  "session_id": "s_992abc",
  "ts": "2026-03-22T14:03:11Z",
  "context": { "device": "mobile", "location": "Atlanta, GA" }
}
Enter fullscreen mode Exit fullscreen mode

Stage 2 β€” Stream Enrichment (Dataflow)

A Dataflow streaming job picks up events, joins with item metadata from BigQuery, and writes two outputs:

  • Session feature update β†’ Vertex AI Feature Store (< 5s latency)
  • Interaction log β†’ BigQuery (for offline model training)

Stage 3 β€” Feature Assembly (Vertex AI Feature Store)

At query time, three feature groups are fetched in a single low-latency call:

feature_store_client.read_feature_values(
    entity_type="user",
    entity_ids=[user_id],
    feature_selector={
        "id_matcher": {
            "ids": ["purchase_history", "session_clicks", "device_type", "location"]
        }
    }
)
Enter fullscreen mode Exit fullscreen mode

Stage 4 β€” ANN Retrieval (Vertex AI Matching Engine)

The assembled user context vector is submitted to Matching Engine β€” Google's managed ANN index. It returns the top 50 candidate SKUs from a catalog of millions in under 10ms.

response = index_endpoint.find_neighbors(
    deployed_index_id="retail_item_embeddings",
    queries=[user_context_vector],
    num_neighbors=50
)
Enter fullscreen mode Exit fullscreen mode

Under the hood: Google's ScaNN algorithm, pre-filtered by in-stock status so the re-ranker never sees unavailable items.

Stage 5 β€” Re-Ranking (Vertex AI Prediction)

A lightweight model re-scores the 50 candidates using signals the embedding index cannot capture:

  • Current inventory level
  • Promotional pricing flag
  • User's price sensitivity segment
  • Real-time trend score

Stage 6 β€” Serve (Cloud Run)

Top 10 results + display metadata returned to the frontend. End-to-end: < 120ms at p99.

Handling Cold Start

Scenario Strategy
New user (no history) Serve contextual top-trending items by device + time + location
New item (no interactions) Content-based embedding from product description + image on ingestion
After first click Session features kick in within 5 seconds

πŸ€– Layer 2: Multi-Agent Operations

The Core Problem

A single LLM handling all retail operations hits three walls: context overload, sequential latency, and unmaintainable prompts. When the inventory rule, pricing model, supplier contract, and customer policy all need to fit in one context β€” reasoning quality degrades.

Design principle: Treat operations like a well-run team. One orchestrator receives requests and coordinates specialists. Each specialist does one thing well.

Agent Architecture

Operator / System Trigger
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ORCHESTRATOR AGENT             β”‚
β”‚  Gemini 1.5 Pro                 β”‚
β”‚  Vertex AI Reasoning Engine     β”‚
β”‚  - Decomposes tasks             β”‚
β”‚  - Routes to specialists        β”‚
β”‚  - Synthesizes final response   β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
     β”‚  Pub/Sub β”‚          β”‚
     β–Ό          β–Ό          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Inventoryβ”‚ β”‚Pricing β”‚ β”‚Supplier  β”‚ β”‚Customer  β”‚
β”‚Agent    β”‚ β”‚Agent   β”‚ β”‚Agent     β”‚ β”‚Agent     β”‚
β”‚BigQuery β”‚ β”‚BQ ML   β”‚ β”‚Vertex AI β”‚ β”‚Agentic   β”‚
β”‚         β”‚ β”‚        β”‚ β”‚Search    β”‚ β”‚RAG ←────── Layer 3
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Notice: the Customer Agent IS Layer 3 β€” Agentic RAG is not separate, it is the intelligence layer of the Customer Agent. This is where the three layers connect.

A Reorder Request β€” Traced End-to-End

Input: "Should we reorder SKU-991?"

Step 1 β€” Decompose: Orchestrator identifies three parallel sub-tasks.

tasks = orchestrator.decompose(query)
# β†’ [
#     {"agent": "inventory", "task": "get_stock_level", "sku": "SKU-991"},
#     {"agent": "supplier",  "task": "get_eta_and_cost", "sku": "SKU-991"},
#     {"agent": "pricing",   "task": "get_reorder_cost", "sku": "SKU-991"}
# ]
Enter fullscreen mode Exit fullscreen mode

Step 2 β€” Dispatch: All three tasks published to Pub/Sub simultaneously.

Step 3 β€” Execute in Parallel: Each Cloud Run agent handles its task independently:

# Inventory Agent
stock = bq_client.query("""
    SELECT units_available FROM inventory_snapshot
    WHERE sku_id = 'SKU-991' AND store_id = 'DC-ATL'
""").result()

# Pricing Agent (BigQuery ML)
reorder_cost = bq_client.query("""
    SELECT ML.PREDICT(MODEL `retail.pricing_model`,
        (SELECT * FROM pricing_signals WHERE sku_id = 'SKU-991'))
""").result()
Enter fullscreen mode Exit fullscreen mode

Step 4 β€” Synthesize:

Orchestrator β†’ "Reorder 50 units from Vendor A at $4.20/unit, ETA 3 days. 
                Current stock: 8 units (below reorder threshold of 15)." βœ…
Enter fullscreen mode Exit fullscreen mode

Total time = max(slowest agent) β€” not the sum of all three.

The Pub/Sub Design β€” Why It Matters

Three properties you get for free:

  • Loose coupling: agents have no direct dependency on each other, only on topic names
  • Fault tolerance: if an agent crashes, the message is retained and redelivered on recovery
  • Independent scaling: each Cloud Run agent scales on its own Pub/Sub queue depth

Shared Memory: The agent_decision_log Table

Every orchestrated request is fully logged:

CREATE TABLE retail.agent_decision_log (
  request_id      STRING,
  ts              TIMESTAMP,
  agent_called    STRING,
  tools_used      ARRAY<STRING>,
  input_payload   JSON,
  output_payload  JSON,
  latency_ms      INT64,
  confidence      FLOAT64
);
Enter fullscreen mode Exit fullscreen mode

This table powers weekly evaluation reports and feeds back into model fine-tuning β€” your audit trail is also your training dataset.


πŸ“š Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Standard RAG (embed query β†’ retrieve chunks β†’ generate) fails retail because:

  • A single customer question often spans multiple knowledge domains (policy + inventory + product specs)
  • Inventory data goes stale in minutes β€” you cannot index it as static documents
  • Retrieval confidence varies β€” a system that cannot detect low-confidence answers will hallucinate

Design principle: RAG should reason, not just retrieve. The agent decides which source to query, validates the result, and cites its sources.

Three Retrieval Sources

1. Policy & Compliance Index (Vertex AI Search)

Return policies, warranty terms, BOPIS rules, hazmat shipping. Indexed as documents with hybrid retrieval (dense semantic + sparse BM25 keyword).

BM25 matters here: product part numbers and model codes are not well-served by pure vector search. Hybrid retrieval handles both.

2. Product Catalog Index (Vertex AI Search)

Product descriptions, specs, compatibility notes, sizing guides. Indexed with multimodal embeddings (text + image) so "waterproof jacket similar to this one" works.

3. Live Operational Data (BigQuery as a Tool)

Inventory levels, order status, real-time pricing β€” not indexed as documents but called as a live tool. This is the key architectural decision that prevents stale answers.

tools = [
    VertexAISearchTool(index="retail_policy_index"),
    VertexAISearchTool(index="retail_product_index"),
    BigQueryTool(query_template=INVENTORY_QUERY)  # live call, not indexed
]
Enter fullscreen mode Exit fullscreen mode

Query Decomposition in Action

Customer query: "Can I return the 40V battery I bought online at a store, and is it in stock at the Cumming, GA location?"

Agent Plan:
  Sub-query A β†’ Policy Index: "online purchase battery return policy in-store"
  Sub-query B β†’ BigQuery Tool: SELECT units_available 
                               FROM inventory_snapshot 
                               WHERE sku_id='SKU-4471' AND store='GA-CUMMING'
Enter fullscreen mode Exit fullscreen mode

Agent validates Sub-query A: relevance score > 0.82 threshold βœ…

Agent validates Sub-query B: live data, timestamp 2 minutes ago βœ…

Synthesized answer:

"Yes β€” online purchases can be returned in-store within 90 days (Policy Β§3.2). 
The 40V battery (SKU-4471) shows 3 units in stock at Cumming, GA 
as of 14:07 EST today."
Enter fullscreen mode Exit fullscreen mode

Every fact is cited. No hallucination. No "please check the website."

The Self-Correction Loop

MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
    result = vertex_search.retrieve(query, index=index_id)

    if result.confidence_score >= THRESHOLD:
        return result

    # Reformulate: broaden scope, try synonyms, switch retrieval mode
    query = agent.reformulate(query, attempt)

# After max retries: escalate to human agent queue
escalate_to_human(original_query)
Enter fullscreen mode Exit fullscreen mode

This loop means your system knows what it does not know β€” and routes accordingly.


πŸ”— How the Three Layers Connect

The platform is unified, not assembled. Here is how data and events flow across all three layers in a single customer session:

1. Customer browses β†’ Pub/Sub event β†’ Personalization Engine 
                      surfaces relevant products (Layer 1)

2. Inventory drops below threshold β†’ Pub/Sub alert β†’ 
   Orchestrator Agent dispatches reorder across 3 specialist 
   agents in parallel (Layer 2)

3. Customer asks: "Is this in stock?" β†’ Customer Agent (Layer 2) 
   β†’ Agentic RAG (Layer 3) queries BigQuery live + policy index
   β†’ grounded, cited answer in < 2s

4. All events β†’ BigQuery agent_decision_log + interaction_log
   β†’ weekly eval reports + model retraining for Layers 1 & 3
Enter fullscreen mode Exit fullscreen mode

The feedback loop is the platform. Every interaction trains the next version of every model.


πŸ“Š Observability β€” One Dashboard, Three Layers

All three layers write to BigQuery. One Looker Studio dashboard covers the full platform:

Metric Layer Source Table
Recommendation CTR by segment Personalization interaction_log
ANN retrieval latency p99 Personalization serving_metrics
Agent task parallelism ratio Multi-Agent agent_decision_log
Reorder decision accuracy Multi-Agent agent_decision_log
RAG retrieval precision@5 Agentic RAG agent_query_log
Re-query rate Agentic RAG agent_query_log

When retrieval precision drops, you know before customers notice.


πŸš€ Where to Start

Don't try to ship all three layers at once. Here is a proven sequencing:

Week 1–4: Lay the data foundation

  • Set up BigQuery tables: inventory_snapshot, interaction_log, agent_decision_log
  • Stand up Pub/Sub topics and Dataflow streaming job
  • This infrastructure is shared by all three layers β€” do it once, use it everywhere

Week 5–8: Ship Personalization (Layer 1)

  • Train a two-tower model on BigQuery interaction history
  • Index item embeddings into Vertex AI Matching Engine
  • Wire up Cloud Run serving API
  • Measure: recommendation CTR vs. batch baseline

Week 9–12: Add Multi-Agent Ops (Layer 2)

  • Start with two agents: Inventory + Pricing
  • Orchestrator on Vertex AI Reasoning Engine
  • Add Supplier Agent once the first two are stable

Week 13–16: Add Agentic RAG (Layer 3)

  • Index return policy + product catalog into Vertex AI Search
  • Wire the BigQuery inventory tool into the agent
  • Deploy as the Customer Agent inside your multi-agent system

The Pub/Sub bus means each new layer plugs in without touching what already works.


πŸ’‘ Key Takeaways

  • Share infrastructure, not code. BigQuery and Pub/Sub serve all three layers. Build them once.
  • The Customer Agent IS Agentic RAG. Don't build these as separate projects.
  • The agent_decision_log is your most valuable table. It is your audit trail, your eval dataset, and your retraining signal.
  • Personalization cold start is solved by context, not history. Device + time + location gets you 80% of the way there for new users.
  • Hybrid retrieval beats pure vector search for retail. BM25 handles part numbers and model codes that semantic search misses.

Top comments (0)