DEV Community: prithviraj.veluchamy@gmail.com

LLM Evaluation & Observability in Production Retail Systems on GCP

prithviraj.veluchamy@gmail.com — Thu, 02 Apr 2026 11:06:00 +0000

Most teams know when their LLM is wrong after a customer complains. Production-grade retail AI requires knowing before that — with metrics, traces, and automated eval pipelines that catch drift, hallucination, and degradation continuously. This article shows you how to build that system on GCP.

🧭 Why LLM Observability in Retail Is Different

Traditional ML observability tracks distribution drift on structured features and monitors a single scalar metric — accuracy, RMSE, AUC. LLMs break this model in three ways:

Outputs are unstructured. There is no ground-truth label for "did the agent give a good answer?" arriving in real time.
Failure modes are silent. A hallucinated return policy answer looks identical to a correct one in your latency dashboard.
Context windows change behavior. The same model behaves differently depending on what is in the prompt — retrieved chunks, session history, tool results.

In retail specifically, the stakes are asymmetric. A mis-personalized recommendation costs a click. A hallucinated return policy answer costs a customer, a refund, and potentially a chargeback.

Design principle: Treat every LLM inference as a structured event with inputs, outputs, retrieved context, tool calls, and a confidence signal — not just a latency measurement.

🏗️ The Observability Stack

The system is built on four GCP components working in concert:

Component	Role
Cloud Logging + Log Router	Capture structured inference events from Cloud Run
BigQuery	Central eval store — every inference logged with full context
Vertex AI Evaluation Service	Automated metric computation (rouge, BLEU, coherence, groundedness)
Looker Studio	Real-time eval dashboard over BigQuery eval tables

And one additional layer that makes it production-grade:

Component	Role
Vertex AI Pipelines	Scheduled eval pipeline — nightly batch scoring of sampled inferences
Cloud Monitoring + Alerting	Threshold-based alerts when eval metrics degrade

📐 The Inference Event Schema

Everything starts with logging the right data. Every LLM inference — whether personalization re-ranking, agent orchestration, or RAG answer generation — emits a structured event to Cloud Logging, which Log Router sinks to BigQuery.

# Cloud Run inference handler — structured logging
import json
import logging
from google.cloud import logging as gcp_logging

client = gcp_logging.Client()
logger = client.logger("llm_inference_events")

def log_inference_event(
    request_id: str,
    layer: str,              # "personalization" | "multi_agent" | "agentic_rag"
    agent_id: str,
    model_id: str,
    prompt_tokens: int,
    completion_tokens: int,
    retrieved_chunks: list[dict],
    tool_calls: list[dict],
    output: str,
    latency_ms: int,
    self_confidence: float,  # model's own logprob-derived confidence
):
    logger.log_struct({
        "request_id":        request_id,
        "ts":                datetime.utcnow().isoformat() + "Z",
        "layer":             layer,
        "agent_id":          agent_id,
        "model_id":          model_id,
        "prompt_tokens":     prompt_tokens,
        "completion_tokens": completion_tokens,
        "retrieved_chunks":  retrieved_chunks,   # [{chunk_id, score, source}]
        "tool_calls":        tool_calls,          # [{tool, input, output, latency_ms}]
        "output":            output,
        "latency_ms":        latency_ms,
        "self_confidence":   self_confidence,
    }, severity="INFO")

This single schema powers everything downstream — dashboards, evals, fine-tuning, and alerting.

🗄️ BigQuery Eval Schema

Log Router sinks the inference events into a partitioned BigQuery table:

CREATE TABLE retail_ai.llm_inference_log (
  request_id        STRING        NOT NULL,
  ts                TIMESTAMP     NOT NULL,
  layer             STRING,       -- personalization | multi_agent | agentic_rag
  agent_id          STRING,
  model_id          STRING,
  prompt_tokens     INT64,
  completion_tokens INT64,
  retrieved_chunks  ARRAY<STRUCT<
    chunk_id        STRING,
    relevance_score FLOAT64,
    source          STRING
  >>,
  tool_calls        ARRAY<STRUCT<
    tool_name       STRING,
    input_payload   JSON,
    output_payload  JSON,
    latency_ms      INT64
  >>,
  output            STRING,
  latency_ms        INT64,
  self_confidence   FLOAT64,
  -- Populated by the nightly eval pipeline:
  groundedness_score  FLOAT64,
  coherence_score     FLOAT64,
  rouge_l             FLOAT64,
  hallucination_flag  BOOL,
  human_label         STRING        -- populated for sampled reviews
)
PARTITION BY DATE(ts)
CLUSTER BY layer, agent_id;

Partitioned by date, clustered by layer and agent — eval queries over 30-day windows run in seconds, not minutes.

⚙️ The Nightly Eval Pipeline (Vertex AI Pipelines)

Automated evaluation runs nightly over a stratified sample of the previous day's inferences. The pipeline has four stages:

[Sample Inferences] → [Compute Automatic Metrics] → [Flag Anomalies] → [Write Eval Results]

Stage 1 — Stratified Sampling

@component
def sample_inferences(
    eval_date: str,
    sample_per_layer: int = 200,
) -> list[dict]:
    """
    Stratified sample: 200 inferences per layer per day.
    Oversample low-confidence inferences (self_confidence < 0.6).
    """
    query = f"""
        WITH ranked AS (
          SELECT *,
            ROW_NUMBER() OVER (
              PARTITION BY layer
              ORDER BY
                -- oversample low-confidence
                CASE WHEN self_confidence < 0.6 THEN 0 ELSE 1 END,
                RAND()
            ) AS rn
          FROM retail_ai.llm_inference_log
          WHERE DATE(ts) = '{eval_date}'
        )
        SELECT * FROM ranked WHERE rn <= {sample_per_layer}
    """
    return bq_client.query(query).to_dataframe().to_dict("records")

Oversampling low-confidence inferences ensures the eval pipeline focuses attention where failures are most likely.

Stage 2 — Automatic Metric Computation

Three metric families are computed for every sampled inference:

2a. Groundedness (Retrieval Faithfulness)

For RAG inferences — does the output make claims supported by the retrieved chunks?

@component
def score_groundedness(inferences: list[dict]) -> list[dict]:
    results = []
    for inf in inferences:
        if not inf["retrieved_chunks"]:
            continue  # skip non-RAG inferences

        context = "\n\n".join(
            chunk["chunk_text"] for chunk in inf["retrieved_chunks"]
        )
        prompt = f"""
You are an evaluator. Given the retrieved context and the model output below,
score the output's groundedness on a scale of 0.0 to 1.0.
Groundedness = every factual claim in the output is directly supported by the context.
Penalise any claim that is not traceable to the context.

Context:
{context}

Output:
{inf["output"]}

Return ONLY a JSON object: {{"groundedness": <float>}}
"""
        response = gemini.generate_content(prompt)
        score = json.loads(response.text)["groundedness"]
        results.append({**inf, "groundedness_score": score})

    return results

2b. Coherence

Does the output make logical sense as a response to the inferred query?

coherence_prompt = """
Rate the coherence of the following retail AI response on a scale of 0.0 to 1.0.
Coherence = the response is logically consistent, fluent, and directly addresses
the implied user need. Penalise contradictions, non-sequiturs, or incomplete answers.

Response: {output}

Return ONLY: {{"coherence": <float>}}
"""

2c. Hallucination Detection

A binary flag for responses that assert specific facts (prices, stock counts, policy terms) not present in the retrieved context or tool outputs:

@component
def flag_hallucinations(inferences: list[dict]) -> list[dict]:
    results = []
    for inf in inferences:
        tool_facts = extract_tool_facts(inf["tool_calls"])
        chunk_facts = extract_chunk_facts(inf["retrieved_chunks"])
        all_grounded_facts = tool_facts | chunk_facts

        hallucination_prompt = f"""
You are a fact-checker for a retail AI system.
Grounded facts available to the model:
{json.dumps(list(all_grounded_facts), indent=2)}

Model output:
{inf["output"]}

Does the output assert any specific factual claim (price, stock count, 
policy term, date, SKU) that is NOT present in the grounded facts above?

Return ONLY: {{"hallucination_detected": true|false, "reason": "<string>"}}
"""
        response = gemini.generate_content(hallucination_prompt)
        result = json.loads(response.text)
        results.append({
            **inf,
            "hallucination_flag": result["hallucination_detected"],
            "hallucination_reason": result.get("reason"),
        })
    return results

Stage 3 — Anomaly Detection

After scoring, the pipeline computes rolling 7-day baselines and flags sessions where today's metrics fall more than 2σ below baseline:

@component
def detect_anomalies(eval_date: str) -> dict:
    baseline_query = f"""
        SELECT
          layer,
          AVG(groundedness_score)  AS baseline_groundedness,
          STDDEV(groundedness_score) AS stddev_groundedness,
          AVG(coherence_score)     AS baseline_coherence,
          AVG(CAST(hallucination_flag AS INT64)) AS baseline_hallucination_rate
        FROM retail_ai.llm_inference_log
        WHERE DATE(ts) BETWEEN DATE_SUB('{eval_date}', INTERVAL 7 DAY)
                           AND DATE_SUB('{eval_date}', INTERVAL 1 DAY)
        GROUP BY layer
    """
    baseline = bq_client.query(baseline_query).to_dataframe()

    today_query = f"""
        SELECT
          layer,
          AVG(groundedness_score)  AS today_groundedness,
          AVG(coherence_score)     AS today_coherence,
          AVG(CAST(hallucination_flag AS INT64)) AS today_hallucination_rate
        FROM retail_ai.llm_inference_log
        WHERE DATE(ts) = '{eval_date}'
        GROUP BY layer
    """
    today = bq_client.query(today_query).to_dataframe()

    anomalies = []
    for _, row in today.merge(baseline, on="layer").iterrows():
        if row["today_groundedness"] < (
            row["baseline_groundedness"] - 2 * row["stddev_groundedness"]
        ):
            anomalies.append({
                "layer": row["layer"],
                "metric": "groundedness",
                "today": row["today_groundedness"],
                "baseline": row["baseline_groundedness"],
            })
    return {"anomalies": anomalies}

📊 The Eval Dashboard (Looker Studio over BigQuery)

Five views cover the full platform — one per layer plus a cross-layer summary:

Daily Metric Trends (all layers)

SELECT
  DATE(ts)                              AS eval_date,
  layer,
  AVG(groundedness_score)               AS avg_groundedness,
  AVG(coherence_score)                  AS avg_coherence,
  AVG(CAST(hallucination_flag AS INT64)) AS hallucination_rate,
  APPROX_QUANTILES(latency_ms, 100)[OFFSET(99)] AS p99_latency_ms,
  COUNT(*)                              AS inference_count
FROM retail_ai.llm_inference_log
WHERE DATE(ts) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 1 DESC, 2;

Low-Confidence Inference Drill-Down

SELECT
  request_id,
  ts,
  agent_id,
  self_confidence,
  groundedness_score,
  hallucination_flag,
  output
FROM retail_ai.llm_inference_log
WHERE DATE(ts) = CURRENT_DATE() - 1
  AND (self_confidence < 0.6 OR hallucination_flag = TRUE)
ORDER BY self_confidence ASC
LIMIT 100;

Retrieval Quality (RAG Layer)

SELECT
  chunk.source                        AS source_doc,
  AVG(chunk.relevance_score)          AS avg_retrieval_score,
  AVG(groundedness_score)             AS avg_groundedness,
  COUNT(*)                            AS times_retrieved
FROM retail_ai.llm_inference_log,
UNNEST(retrieved_chunks) AS chunk
WHERE DATE(ts) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY 1
ORDER BY times_retrieved DESC;

This query surfaces which source documents are being retrieved most frequently and whether they actually contribute to grounded answers — your index quality scorecard.

🚨 Alerting — Cloud Monitoring Policies

Three alert policies cover the most critical failure modes:

Policy 1 — Hallucination Rate Spike

# Cloud Monitoring alert policy
displayName: "Retail AI — Hallucination Rate Spike"
conditions:
  - displayName: "hallucination_rate > 5% (7-day rolling)"
    conditionThreshold:
      filter: >
        resource.type="bigquery_table"
        metric.type="custom.googleapis.com/retail_ai/hallucination_rate"
      comparison: COMPARISON_GT
      thresholdValue: 0.05
      duration: 3600s   # sustained for 1 hour before alert fires
alertStrategy:
  notificationRateLimit:
    period: 3600s
notificationChannels:
  - projects/PROJECT_ID/notificationChannels/PAGERDUTY_CHANNEL

Policy 2 — Groundedness Drop

Fires when the 24-hour average groundedness score for the RAG layer falls below 0.75 — typically indicating index drift (stale documents) or a retrieval pipeline failure.

Policy 3 — p99 Latency Breach

# Metric written by the serving layer after every inference
monitoring_client.create_time_series(
    name=f"projects/{PROJECT_ID}",
    time_series=[{
        "metric": {
            "type": "custom.googleapis.com/retail_ai/inference_latency_ms",
            "labels": {"layer": layer, "agent_id": agent_id}
        },
        "points": [{"interval": {"endTime": now}, "value": {"int64Value": latency_ms}}]
    }]
)

🔁 Closing the Loop — Eval-Driven Fine-Tuning

The eval pipeline does more than surface problems — it produces the training data for the next model version.

# Weekly: export flagged inferences for human review + supervised fine-tuning
export_query = """
    SELECT
      request_id,
      output                AS model_output,
      groundedness_score,
      hallucination_flag,
      human_label,           -- populated by reviewer workflow
      retrieved_chunks,
      tool_calls
    FROM retail_ai.llm_inference_log
    WHERE DATE(ts) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
      AND (hallucination_flag = TRUE OR groundedness_score < 0.70)
      AND human_label IS NOT NULL
    ORDER BY groundedness_score ASC
"""

flagged = bq_client.query(export_query).to_dataframe()

# Write to GCS in JSONL format for Vertex AI supervised fine-tuning
flagged.to_json(
    f"gs://retail-ai-tuning/weekly/{eval_date}/flagged_inferences.jsonl",
    orient="records",
    lines=True
)

The human review workflow — a lightweight Cloud Run app backed by the same BigQuery table — lets domain experts label outputs as correct, incorrect, or needs-revision. Those labels feed directly into Vertex AI supervised fine-tuning jobs, closing the eval-to-improvement loop.

🔗 How This Connects to the Unified Platform

This observability layer is not bolt-on — it is wired into the three-layer retail AI platform:

Platform Layer	What Eval Monitors	Key Metric
Personalization	Re-ranking model output quality, cold-start recommendation coherence	Coherence score by user segment
Multi-Agent Ops	Agent decision accuracy, tool call success rate, orchestrator reasoning quality	Tool call success rate, hallucination rate per agent
Agentic RAG	Retrieval groundedness, index freshness, self-correction trigger rate	Groundedness score, re-query rate
Cross-layer	End-to-end latency, overall hallucination rate, human escalation rate	p99 latency, hallucination rate, escalation rate

The llm_inference_log table is the single source of truth for all of these — one schema, one dashboard, one alert policy set.

💡 Key Takeaways

Log the full context, not just the output. Retrieved chunks, tool call inputs/outputs, and self-confidence scores are what make downstream eval possible.
Use LLM-as-judge for groundedness and coherence. Gemini evaluating Gemini outputs at scale is practical and cost-effective for production eval at retail volumes.
Oversample low-confidence inferences. Your eval budget is limited — concentrate it where failures are most likely.
Retrieval quality IS answer quality. The source_doc query above will tell you which documents are dragging your groundedness scores down. Fix the index before fine-tuning the model.
The eval pipeline is your fine-tuning dataset. Every flagged, human-reviewed inference is a supervised training example. Treat it that way from day one.

🚀 Where to Start

Week 1: Instrument your Cloud Run inference handlers with structured logging using the schema above. Sink to BigQuery via Log Router.
Week 2: Write the five dashboard queries. Stand up Looker Studio. You now have observability.
Week 3: Add the nightly Vertex AI Pipeline for groundedness and hallucination scoring on a 200-inference sample.
Week 4: Set up the three Cloud Monitoring alert policies. Add a human review queue for flagged inferences.
Ongoing: Feed human-labeled inferences back into Vertex AI fine-tuning jobs monthly.

The schema is the hardest part. Get it right in Week 1 and everything else is additive.

Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack

prithviraj.veluchamy@gmail.com — Mon, 23 Mar 2026 00:44:33 +0000

A shopper searches for rain boots on your storefront. Within 120ms, your personalization engine surfaces the right products. A stock alert fires, and three AI agents coordinate a reorder without a human touching a keyboard. The customer asks a question in chat — the answer comes back grounded in live inventory and your return policy, cited and accurate.

This is not three separate AI projects. It is one unified platform — and this article shows you how to build it on GCP.

🏗️ The Three Layers of an AI-Native Retail Platform

Most retail AI initiatives start with one use case and stop there. What makes a platform is when these three capabilities are designed together, sharing infrastructure and data:

Layer	What It Does	GCP Services
Real-Time Personalization	Surfaces relevant products from millions of SKUs in < 120ms	Pub/Sub, Dataflow, Vertex AI Matching Engine, Feature Store, Cloud Run
Multi-Agent Operations	Coordinates inventory, pricing, supplier, and customer agents in parallel	Vertex AI Reasoning Engine, Pub/Sub, BigQuery ML, Cloud Run
Agentic RAG	Answers complex queries grounded in live data + policy docs	Vertex AI Search, Gemini, BigQuery (as a live tool)

The key insight: all three layers share the same data backbone — BigQuery as the source of truth, Pub/Sub as the event spine, and Vertex AI as the intelligence layer.

📐 Unified Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        FRONTEND / API GATEWAY                   │
└───────────┬──────────────────┬───────────────────┬─────────────┘
            │                  │                   │
    ┌───────▼──────┐  ┌────────▼───────┐  ┌───────▼──────────┐
    │ PERSONALI-   │  │  MULTI-AGENT   │  │   AGENTIC RAG    │
    │ ZATION       │  │  ORCHESTRATOR  │  │   (Customer Q&A) │
    │ ENGINE       │  │  (Gemini 1.5)  │  │   (Gemini +      │
    │ (Cloud Run)  │  │  (Vertex AI    │  │    Vertex Search) │
    └───────┬──────┘  │   Reasoning)   │  └───────┬──────────┘
            │         └────────┬───────┘          │
            │                  │                  │
            └──────────────────┼──────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │         GOOGLE CLOUD PUB/SUB     │
              │         (Shared Event Spine)      │
              └───┬──────────┬──────────┬────────┘
                  │          │          │
          ┌───────▼──┐ ┌─────▼────┐ ┌──▼────────────┐
          │ Dataflow  │ │Specialist│ │ Vertex AI     │
          │ Streaming │ │ Agents   │ │ Search Index  │
          └───────┬──┘ └─────┬────┘ └──┬────────────┘
                  │          │          │
              ┌───▼──────────▼──────────▼───┐
              │          BIGQUERY            │
              │   (Shared Operational Store) │
              └─────────────────────────────┘

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

Daily batch recommendations ignore the most powerful signal available: what the user is doing right now. A shopper who just added rain boots to their cart does not want yesterday's trending sneakers.

Design principle: Personalization is a retrieval problem. Given a user and their context right now, find the items most likely to convert — in under 120ms.

The Six-Stage Pipeline

Stage 1 — Event Capture (Pub/Sub)

Every user interaction fires a structured event to Pub/Sub. The client SDK is fire-and-forget — it does not wait for a response.

{
  "event_type": "CART_ADD",
  "user_id": "u_8821",
  "sku_id": "SKU-4471",
  "session_id": "s_992abc",
  "ts": "2026-03-22T14:03:11Z",
  "context": { "device": "mobile", "location": "Atlanta, GA" }
}

Stage 2 — Stream Enrichment (Dataflow)

A Dataflow streaming job picks up events, joins with item metadata from BigQuery, and writes two outputs:

Session feature update → Vertex AI Feature Store (< 5s latency)
Interaction log → BigQuery (for offline model training)

Stage 3 — Feature Assembly (Vertex AI Feature Store)

At query time, three feature groups are fetched in a single low-latency call:

feature_store_client.read_feature_values(
    entity_type="user",
    entity_ids=[user_id],
    feature_selector={
        "id_matcher": {
            "ids": ["purchase_history", "session_clicks", "device_type", "location"]
        }
    }
)

Stage 4 — ANN Retrieval (Vertex AI Matching Engine)

The assembled user context vector is submitted to Matching Engine — Google's managed ANN index. It returns the top 50 candidate SKUs from a catalog of millions in under 10ms.

response = index_endpoint.find_neighbors(
    deployed_index_id="retail_item_embeddings",
    queries=[user_context_vector],
    num_neighbors=50
)

Under the hood: Google's ScaNN algorithm, pre-filtered by in-stock status so the re-ranker never sees unavailable items.

Stage 5 — Re-Ranking (Vertex AI Prediction)

A lightweight model re-scores the 50 candidates using signals the embedding index cannot capture:

Current inventory level
Promotional pricing flag
User's price sensitivity segment
Real-time trend score

Stage 6 — Serve (Cloud Run)

Top 10 results + display metadata returned to the frontend. End-to-end: < 120ms at p99.

Handling Cold Start

Scenario	Strategy
New user (no history)	Serve contextual top-trending items by device + time + location
New item (no interactions)	Content-based embedding from product description + image on ingestion
After first click	Session features kick in within 5 seconds

🤖 Layer 2: Multi-Agent Operations

The Core Problem

A single LLM handling all retail operations hits three walls: context overload, sequential latency, and unmaintainable prompts. When the inventory rule, pricing model, supplier contract, and customer policy all need to fit in one context — reasoning quality degrades.

Design principle: Treat operations like a well-run team. One orchestrator receives requests and coordinates specialists. Each specialist does one thing well.

Agent Architecture

Operator / System Trigger
        │
        ▼
┌─────────────────────────────────┐
│  ORCHESTRATOR AGENT             │
│  Gemini 1.5 Pro                 │
│  Vertex AI Reasoning Engine     │
│  - Decomposes tasks             │
│  - Routes to specialists        │
│  - Synthesizes final response   │
└────┬──────────┬──────────┬──────┘
     │  Pub/Sub │          │
     ▼          ▼          ▼
┌─────────┐ ┌────────┐ ┌──────────┐ ┌──────────┐
│Inventory│ │Pricing │ │Supplier  │ │Customer  │
│Agent    │ │Agent   │ │Agent     │ │Agent     │
│BigQuery │ │BQ ML   │ │Vertex AI │ │Agentic   │
│         │ │        │ │Search    │ │RAG ←────── Layer 3
└─────────┘ └────────┘ └──────────┘ └──────────┘

Notice: the Customer Agent IS Layer 3 — Agentic RAG is not separate, it is the intelligence layer of the Customer Agent. This is where the three layers connect.

A Reorder Request — Traced End-to-End

Input: "Should we reorder SKU-991?"

Step 1 — Decompose: Orchestrator identifies three parallel sub-tasks.

tasks = orchestrator.decompose(query)
# → [
#     {"agent": "inventory", "task": "get_stock_level", "sku": "SKU-991"},
#     {"agent": "supplier",  "task": "get_eta_and_cost", "sku": "SKU-991"},
#     {"agent": "pricing",   "task": "get_reorder_cost", "sku": "SKU-991"}
# ]

Step 2 — Dispatch: All three tasks published to Pub/Sub simultaneously.

Step 3 — Execute in Parallel: Each Cloud Run agent handles its task independently:

# Inventory Agent
stock = bq_client.query("""
    SELECT units_available FROM inventory_snapshot
    WHERE sku_id = 'SKU-991' AND store_id = 'DC-ATL'
""").result()

# Pricing Agent (BigQuery ML)
reorder_cost = bq_client.query("""
    SELECT ML.PREDICT(MODEL `retail.pricing_model`,
        (SELECT * FROM pricing_signals WHERE sku_id = 'SKU-991'))
""").result()

Step 4 — Synthesize:

Orchestrator → "Reorder 50 units from Vendor A at $4.20/unit, ETA 3 days. 
                Current stock: 8 units (below reorder threshold of 15)." ✅

Total time = max(slowest agent) — not the sum of all three.

The Pub/Sub Design — Why It Matters

Three properties you get for free:

Loose coupling: agents have no direct dependency on each other, only on topic names
Fault tolerance: if an agent crashes, the message is retained and redelivered on recovery
Independent scaling: each Cloud Run agent scales on its own Pub/Sub queue depth

Shared Memory: The `agent_decision_log` Table

Every orchestrated request is fully logged:

CREATE TABLE retail.agent_decision_log (
  request_id      STRING,
  ts              TIMESTAMP,
  agent_called    STRING,
  tools_used      ARRAY<STRING>,
  input_payload   JSON,
  output_payload  JSON,
  latency_ms      INT64,
  confidence      FLOAT64
);

This table powers weekly evaluation reports and feeds back into model fine-tuning — your audit trail is also your training dataset.

📚 Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Standard RAG (embed query → retrieve chunks → generate) fails retail because:

A single customer question often spans multiple knowledge domains (policy + inventory + product specs)
Inventory data goes stale in minutes — you cannot index it as static documents
Retrieval confidence varies — a system that cannot detect low-confidence answers will hallucinate

Design principle: RAG should reason, not just retrieve. The agent decides which source to query, validates the result, and cites its sources.

Three Retrieval Sources

1. Policy & Compliance Index (Vertex AI Search)

Return policies, warranty terms, BOPIS rules, hazmat shipping. Indexed as documents with hybrid retrieval (dense semantic + sparse BM25 keyword).

BM25 matters here: product part numbers and model codes are not well-served by pure vector search. Hybrid retrieval handles both.

2. Product Catalog Index (Vertex AI Search)

Product descriptions, specs, compatibility notes, sizing guides. Indexed with multimodal embeddings (text + image) so "waterproof jacket similar to this one" works.

3. Live Operational Data (BigQuery as a Tool)

Inventory levels, order status, real-time pricing — not indexed as documents but called as a live tool. This is the key architectural decision that prevents stale answers.

tools = [
    VertexAISearchTool(index="retail_policy_index"),
    VertexAISearchTool(index="retail_product_index"),
    BigQueryTool(query_template=INVENTORY_QUERY)  # live call, not indexed
]

Query Decomposition in Action

Customer query: "Can I return the 40V battery I bought online at a store, and is it in stock at the Cumming, GA location?"

Agent Plan:
  Sub-query A → Policy Index: "online purchase battery return policy in-store"
  Sub-query B → BigQuery Tool: SELECT units_available 
                               FROM inventory_snapshot 
                               WHERE sku_id='SKU-4471' AND store='GA-CUMMING'

Agent validates Sub-query A: relevance score > 0.82 threshold ✅

Agent validates Sub-query B: live data, timestamp 2 minutes ago ✅

Synthesized answer:

"Yes — online purchases can be returned in-store within 90 days (Policy §3.2). 
The 40V battery (SKU-4471) shows 3 units in stock at Cumming, GA 
as of 14:07 EST today."

Every fact is cited. No hallucination. No "please check the website."

The Self-Correction Loop

MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
    result = vertex_search.retrieve(query, index=index_id)

    if result.confidence_score >= THRESHOLD:
        return result

    # Reformulate: broaden scope, try synonyms, switch retrieval mode
    query = agent.reformulate(query, attempt)

# After max retries: escalate to human agent queue
escalate_to_human(original_query)

This loop means your system knows what it does not know — and routes accordingly.

🔗 How the Three Layers Connect

The platform is unified, not assembled. Here is how data and events flow across all three layers in a single customer session:

1. Customer browses → Pub/Sub event → Personalization Engine 
                      surfaces relevant products (Layer 1)

2. Inventory drops below threshold → Pub/Sub alert → 
   Orchestrator Agent dispatches reorder across 3 specialist 
   agents in parallel (Layer 2)

3. Customer asks: "Is this in stock?" → Customer Agent (Layer 2) 
   → Agentic RAG (Layer 3) queries BigQuery live + policy index
   → grounded, cited answer in < 2s

4. All events → BigQuery agent_decision_log + interaction_log
   → weekly eval reports + model retraining for Layers 1 & 3

The feedback loop is the platform. Every interaction trains the next version of every model.

📊 Observability — One Dashboard, Three Layers

All three layers write to BigQuery. One Looker Studio dashboard covers the full platform:

Metric	Layer	Source Table
Recommendation CTR by segment	Personalization	`interaction_log`
ANN retrieval latency p99	Personalization	`serving_metrics`
Agent task parallelism ratio	Multi-Agent	`agent_decision_log`
Reorder decision accuracy	Multi-Agent	`agent_decision_log`
RAG retrieval precision@5	Agentic RAG	`agent_query_log`
Re-query rate	Agentic RAG	`agent_query_log`

When retrieval precision drops, you know before customers notice.

🚀 Where to Start

Don't try to ship all three layers at once. Here is a proven sequencing:

Week 1–4: Lay the data foundation

Set up BigQuery tables: inventory_snapshot, interaction_log, agent_decision_log
Stand up Pub/Sub topics and Dataflow streaming job
This infrastructure is shared by all three layers — do it once, use it everywhere

Week 5–8: Ship Personalization (Layer 1)

Train a two-tower model on BigQuery interaction history
Index item embeddings into Vertex AI Matching Engine
Wire up Cloud Run serving API
Measure: recommendation CTR vs. batch baseline

Week 9–12: Add Multi-Agent Ops (Layer 2)

Start with two agents: Inventory + Pricing
Orchestrator on Vertex AI Reasoning Engine
Add Supplier Agent once the first two are stable

Week 13–16: Add Agentic RAG (Layer 3)

Index return policy + product catalog into Vertex AI Search
Wire the BigQuery inventory tool into the agent
Deploy as the Customer Agent inside your multi-agent system

The Pub/Sub bus means each new layer plugs in without touching what already works.

💡 Key Takeaways

Share infrastructure, not code. BigQuery and Pub/Sub serve all three layers. Build them once.
The Customer Agent IS Agentic RAG. Don't build these as separate projects.
The agent_decision_log is your most valuable table. It is your audit trail, your eval dataset, and your retraining signal.
Personalization cold start is solved by context, not history. Device + time + location gets you 80% of the way there for new users.
Hybrid retrieval beats pure vector search for retail. BM25 handles part numbers and model codes that semantic search misses.

DEV Community: prithviraj.veluchamy@gmail.com

LLM Evaluation & Observability in Production Retail Systems on GCP

🧭 Why LLM Observability in Retail Is Different

🏗️ The Observability Stack

📐 The Inference Event Schema

🗄️ BigQuery Eval Schema

⚙️ The Nightly Eval Pipeline (Vertex AI Pipelines)

Stage 1 — Stratified Sampling

Stage 2 — Automatic Metric Computation

Stage 3 — Anomaly Detection

📊 The Eval Dashboard (Looker Studio over BigQuery)

🚨 Alerting — Cloud Monitoring Policies

🔁 Closing the Loop — Eval-Driven Fine-Tuning

🔗 How This Connects to the Unified Platform

💡 Key Takeaways

🚀 Where to Start

Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack

🏗️ The Three Layers of an AI-Native Retail Platform

📐 Unified Architecture Overview

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

The Six-Stage Pipeline

Handling Cold Start

🤖 Layer 2: Multi-Agent Operations

The Core Problem

Agent Architecture

A Reorder Request — Traced End-to-End

The Pub/Sub Design — Why It Matters

Shared Memory: The agent_decision_log Table

📚 Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Three Retrieval Sources

Query Decomposition in Action

The Self-Correction Loop

🔗 How the Three Layers Connect

📊 Observability — One Dashboard, Three Layers

🚀 Where to Start

💡 Key Takeaways

Shared Memory: The `agent_decision_log` Table