A shopper searches for rain boots on your storefront. Within 120ms, your personalization engine surfaces the right products. A stock alert fires, and three AI agents coordinate a reorder without a human touching a keyboard. The customer asks a question in chat β the answer comes back grounded in live inventory and your return policy, cited and accurate.
This is not three separate AI projects. It is one unified platform β and this article shows you how to build it on GCP.
ποΈ The Three Layers of an AI-Native Retail Platform
Most retail AI initiatives start with one use case and stop there. What makes a platform is when these three capabilities are designed together, sharing infrastructure and data:
| Layer | What It Does | GCP Services |
|---|---|---|
| Real-Time Personalization | Surfaces relevant products from millions of SKUs in < 120ms | Pub/Sub, Dataflow, Vertex AI Matching Engine, Feature Store, Cloud Run |
| Multi-Agent Operations | Coordinates inventory, pricing, supplier, and customer agents in parallel | Vertex AI Reasoning Engine, Pub/Sub, BigQuery ML, Cloud Run |
| Agentic RAG | Answers complex queries grounded in live data + policy docs | Vertex AI Search, Gemini, BigQuery (as a live tool) |
The key insight: all three layers share the same data backbone β BigQuery as the source of truth, Pub/Sub as the event spine, and Vertex AI as the intelligence layer.
π Unified Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND / API GATEWAY β
βββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββ
β β β
βββββββββΌβββββββ ββββββββββΌββββββββ βββββββββΌβββββββββββ
β PERSONALI- β β MULTI-AGENT β β AGENTIC RAG β
β ZATION β β ORCHESTRATOR β β (Customer Q&A) β
β ENGINE β β (Gemini 1.5) β β (Gemini + β
β (Cloud Run) β β (Vertex AI β β Vertex Search) β
βββββββββ¬βββββββ β Reasoning) β βββββββββ¬βββββββββββ
β ββββββββββ¬ββββββββ β
β β β
ββββββββββββββββββββΌβββββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β GOOGLE CLOUD PUB/SUB β
β (Shared Event Spine) β
βββββ¬βββββββββββ¬βββββββββββ¬βββββββββ
β β β
βββββββββΌβββ βββββββΌβββββ ββββΌβββββββββββββ
β Dataflow β βSpecialistβ β Vertex AI β
β Streaming β β Agents β β Search Index β
βββββββββ¬βββ βββββββ¬βββββ ββββ¬βββββββββββββ
β β β
βββββΌβββββββββββΌβββββββββββΌββββ
β BIGQUERY β
β (Shared Operational Store) β
βββββββββββββββββββββββββββββββ
π― Layer 1: Real-Time Personalization Engine
The Core Problem
Daily batch recommendations ignore the most powerful signal available: what the user is doing right now. A shopper who just added rain boots to their cart does not want yesterday's trending sneakers.
Design principle: Personalization is a retrieval problem. Given a user and their context right now, find the items most likely to convert β in under 120ms.
The Six-Stage Pipeline
Stage 1 β Event Capture (Pub/Sub)
Every user interaction fires a structured event to Pub/Sub. The client SDK is fire-and-forget β it does not wait for a response.
{
"event_type": "CART_ADD",
"user_id": "u_8821",
"sku_id": "SKU-4471",
"session_id": "s_992abc",
"ts": "2026-03-22T14:03:11Z",
"context": { "device": "mobile", "location": "Atlanta, GA" }
}
Stage 2 β Stream Enrichment (Dataflow)
A Dataflow streaming job picks up events, joins with item metadata from BigQuery, and writes two outputs:
- Session feature update β Vertex AI Feature Store (< 5s latency)
- Interaction log β BigQuery (for offline model training)
Stage 3 β Feature Assembly (Vertex AI Feature Store)
At query time, three feature groups are fetched in a single low-latency call:
feature_store_client.read_feature_values(
entity_type="user",
entity_ids=[user_id],
feature_selector={
"id_matcher": {
"ids": ["purchase_history", "session_clicks", "device_type", "location"]
}
}
)
Stage 4 β ANN Retrieval (Vertex AI Matching Engine)
The assembled user context vector is submitted to Matching Engine β Google's managed ANN index. It returns the top 50 candidate SKUs from a catalog of millions in under 10ms.
response = index_endpoint.find_neighbors(
deployed_index_id="retail_item_embeddings",
queries=[user_context_vector],
num_neighbors=50
)
Under the hood: Google's ScaNN algorithm, pre-filtered by in-stock status so the re-ranker never sees unavailable items.
Stage 5 β Re-Ranking (Vertex AI Prediction)
A lightweight model re-scores the 50 candidates using signals the embedding index cannot capture:
- Current inventory level
- Promotional pricing flag
- User's price sensitivity segment
- Real-time trend score
Stage 6 β Serve (Cloud Run)
Top 10 results + display metadata returned to the frontend. End-to-end: < 120ms at p99.
Handling Cold Start
| Scenario | Strategy |
|---|---|
| New user (no history) | Serve contextual top-trending items by device + time + location |
| New item (no interactions) | Content-based embedding from product description + image on ingestion |
| After first click | Session features kick in within 5 seconds |
π€ Layer 2: Multi-Agent Operations
The Core Problem
A single LLM handling all retail operations hits three walls: context overload, sequential latency, and unmaintainable prompts. When the inventory rule, pricing model, supplier contract, and customer policy all need to fit in one context β reasoning quality degrades.
Design principle: Treat operations like a well-run team. One orchestrator receives requests and coordinates specialists. Each specialist does one thing well.
Agent Architecture
Operator / System Trigger
β
βΌ
βββββββββββββββββββββββββββββββββββ
β ORCHESTRATOR AGENT β
β Gemini 1.5 Pro β
β Vertex AI Reasoning Engine β
β - Decomposes tasks β
β - Routes to specialists β
β - Synthesizes final response β
ββββββ¬βββββββββββ¬βββββββββββ¬βββββββ
β Pub/Sub β β
βΌ βΌ βΌ
βββββββββββ ββββββββββ ββββββββββββ ββββββββββββ
βInventoryβ βPricing β βSupplier β βCustomer β
βAgent β βAgent β βAgent β βAgent β
βBigQuery β βBQ ML β βVertex AI β βAgentic β
β β β β βSearch β βRAG βββββββ Layer 3
βββββββββββ ββββββββββ ββββββββββββ ββββββββββββ
Notice: the Customer Agent IS Layer 3 β Agentic RAG is not separate, it is the intelligence layer of the Customer Agent. This is where the three layers connect.
A Reorder Request β Traced End-to-End
Input: "Should we reorder SKU-991?"
Step 1 β Decompose: Orchestrator identifies three parallel sub-tasks.
tasks = orchestrator.decompose(query)
# β [
# {"agent": "inventory", "task": "get_stock_level", "sku": "SKU-991"},
# {"agent": "supplier", "task": "get_eta_and_cost", "sku": "SKU-991"},
# {"agent": "pricing", "task": "get_reorder_cost", "sku": "SKU-991"}
# ]
Step 2 β Dispatch: All three tasks published to Pub/Sub simultaneously.
Step 3 β Execute in Parallel: Each Cloud Run agent handles its task independently:
# Inventory Agent
stock = bq_client.query("""
SELECT units_available FROM inventory_snapshot
WHERE sku_id = 'SKU-991' AND store_id = 'DC-ATL'
""").result()
# Pricing Agent (BigQuery ML)
reorder_cost = bq_client.query("""
SELECT ML.PREDICT(MODEL `retail.pricing_model`,
(SELECT * FROM pricing_signals WHERE sku_id = 'SKU-991'))
""").result()
Step 4 β Synthesize:
Orchestrator β "Reorder 50 units from Vendor A at $4.20/unit, ETA 3 days.
Current stock: 8 units (below reorder threshold of 15)." β
Total time = max(slowest agent) β not the sum of all three.
The Pub/Sub Design β Why It Matters
Three properties you get for free:
- Loose coupling: agents have no direct dependency on each other, only on topic names
- Fault tolerance: if an agent crashes, the message is retained and redelivered on recovery
- Independent scaling: each Cloud Run agent scales on its own Pub/Sub queue depth
Shared Memory: The agent_decision_log Table
Every orchestrated request is fully logged:
CREATE TABLE retail.agent_decision_log (
request_id STRING,
ts TIMESTAMP,
agent_called STRING,
tools_used ARRAY<STRING>,
input_payload JSON,
output_payload JSON,
latency_ms INT64,
confidence FLOAT64
);
This table powers weekly evaluation reports and feeds back into model fine-tuning β your audit trail is also your training dataset.
π Layer 3: Agentic RAG for Retail Knowledge
The Core Problem
Standard RAG (embed query β retrieve chunks β generate) fails retail because:
- A single customer question often spans multiple knowledge domains (policy + inventory + product specs)
- Inventory data goes stale in minutes β you cannot index it as static documents
- Retrieval confidence varies β a system that cannot detect low-confidence answers will hallucinate
Design principle: RAG should reason, not just retrieve. The agent decides which source to query, validates the result, and cites its sources.
Three Retrieval Sources
1. Policy & Compliance Index (Vertex AI Search)
Return policies, warranty terms, BOPIS rules, hazmat shipping. Indexed as documents with hybrid retrieval (dense semantic + sparse BM25 keyword).
BM25 matters here: product part numbers and model codes are not well-served by pure vector search. Hybrid retrieval handles both.
2. Product Catalog Index (Vertex AI Search)
Product descriptions, specs, compatibility notes, sizing guides. Indexed with multimodal embeddings (text + image) so "waterproof jacket similar to this one" works.
3. Live Operational Data (BigQuery as a Tool)
Inventory levels, order status, real-time pricing β not indexed as documents but called as a live tool. This is the key architectural decision that prevents stale answers.
tools = [
VertexAISearchTool(index="retail_policy_index"),
VertexAISearchTool(index="retail_product_index"),
BigQueryTool(query_template=INVENTORY_QUERY) # live call, not indexed
]
Query Decomposition in Action
Customer query: "Can I return the 40V battery I bought online at a store, and is it in stock at the Cumming, GA location?"
Agent Plan:
Sub-query A β Policy Index: "online purchase battery return policy in-store"
Sub-query B β BigQuery Tool: SELECT units_available
FROM inventory_snapshot
WHERE sku_id='SKU-4471' AND store='GA-CUMMING'
Agent validates Sub-query A: relevance score > 0.82 threshold β
Agent validates Sub-query B: live data, timestamp 2 minutes ago β
Synthesized answer:
"Yes β online purchases can be returned in-store within 90 days (Policy Β§3.2).
The 40V battery (SKU-4471) shows 3 units in stock at Cumming, GA
as of 14:07 EST today."
Every fact is cited. No hallucination. No "please check the website."
The Self-Correction Loop
MAX_RETRIES = 3
for attempt in range(MAX_RETRIES):
result = vertex_search.retrieve(query, index=index_id)
if result.confidence_score >= THRESHOLD:
return result
# Reformulate: broaden scope, try synonyms, switch retrieval mode
query = agent.reformulate(query, attempt)
# After max retries: escalate to human agent queue
escalate_to_human(original_query)
This loop means your system knows what it does not know β and routes accordingly.
π How the Three Layers Connect
The platform is unified, not assembled. Here is how data and events flow across all three layers in a single customer session:
1. Customer browses β Pub/Sub event β Personalization Engine
surfaces relevant products (Layer 1)
2. Inventory drops below threshold β Pub/Sub alert β
Orchestrator Agent dispatches reorder across 3 specialist
agents in parallel (Layer 2)
3. Customer asks: "Is this in stock?" β Customer Agent (Layer 2)
β Agentic RAG (Layer 3) queries BigQuery live + policy index
β grounded, cited answer in < 2s
4. All events β BigQuery agent_decision_log + interaction_log
β weekly eval reports + model retraining for Layers 1 & 3
The feedback loop is the platform. Every interaction trains the next version of every model.
π Observability β One Dashboard, Three Layers
All three layers write to BigQuery. One Looker Studio dashboard covers the full platform:
| Metric | Layer | Source Table |
|---|---|---|
| Recommendation CTR by segment | Personalization | interaction_log |
| ANN retrieval latency p99 | Personalization | serving_metrics |
| Agent task parallelism ratio | Multi-Agent | agent_decision_log |
| Reorder decision accuracy | Multi-Agent | agent_decision_log |
| RAG retrieval precision@5 | Agentic RAG | agent_query_log |
| Re-query rate | Agentic RAG | agent_query_log |
When retrieval precision drops, you know before customers notice.
π Where to Start
Don't try to ship all three layers at once. Here is a proven sequencing:
Week 1β4: Lay the data foundation
- Set up BigQuery tables:
inventory_snapshot,interaction_log,agent_decision_log - Stand up Pub/Sub topics and Dataflow streaming job
- This infrastructure is shared by all three layers β do it once, use it everywhere
Week 5β8: Ship Personalization (Layer 1)
- Train a two-tower model on BigQuery interaction history
- Index item embeddings into Vertex AI Matching Engine
- Wire up Cloud Run serving API
- Measure: recommendation CTR vs. batch baseline
Week 9β12: Add Multi-Agent Ops (Layer 2)
- Start with two agents: Inventory + Pricing
- Orchestrator on Vertex AI Reasoning Engine
- Add Supplier Agent once the first two are stable
Week 13β16: Add Agentic RAG (Layer 3)
- Index return policy + product catalog into Vertex AI Search
- Wire the BigQuery inventory tool into the agent
- Deploy as the Customer Agent inside your multi-agent system
The Pub/Sub bus means each new layer plugs in without touching what already works.
π‘ Key Takeaways
- Share infrastructure, not code. BigQuery and Pub/Sub serve all three layers. Build them once.
- The Customer Agent IS Agentic RAG. Don't build these as separate projects.
-
The
agent_decision_logis your most valuable table. It is your audit trail, your eval dataset, and your retraining signal. - Personalization cold start is solved by context, not history. Device + time + location gets you 80% of the way there for new users.
- Hybrid retrieval beats pure vector search for retail. BM25 handles part numbers and model codes that semantic search misses.
Top comments (0)