<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: prithviraj.veluchamy@gmail.com</title>
    <description>The latest articles on DEV Community by prithviraj.veluchamy@gmail.com (@prithvirajveluchamy).</description>
    <link>https://dev.to/prithvirajveluchamy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3833799%2Fad1dd232-75f1-465f-82f2-1c561750020d.png</url>
      <title>DEV Community: prithviraj.veluchamy@gmail.com</title>
      <link>https://dev.to/prithvirajveluchamy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prithvirajveluchamy"/>
    <language>en</language>
    <item>
      <title>LLM Evaluation &amp; Observability in Production Retail Systems on GCP</title>
      <dc:creator>prithviraj.veluchamy@gmail.com</dc:creator>
      <pubDate>Thu, 02 Apr 2026 11:06:00 +0000</pubDate>
      <link>https://dev.to/prithvirajveluchamy/llm-evaluation-observability-in-production-retail-systems-on-gcp-3hd3</link>
      <guid>https://dev.to/prithvirajveluchamy/llm-evaluation-observability-in-production-retail-systems-on-gcp-3hd3</guid>
      <description>&lt;p&gt;&lt;em&gt;Most teams know when their LLM is wrong after a customer complains. Production-grade retail AI requires knowing before that — with metrics, traces, and automated eval pipelines that catch drift, hallucination, and degradation continuously. This article shows you how to build that system on GCP.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🧭 Why LLM Observability in Retail Is Different
&lt;/h2&gt;

&lt;p&gt;Traditional ML observability tracks distribution drift on structured features and monitors a single scalar metric — accuracy, RMSE, AUC. LLMs break this model in three ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outputs are unstructured.&lt;/strong&gt; There is no ground-truth label for "did the agent give a good answer?" arriving in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure modes are silent.&lt;/strong&gt; A hallucinated return policy answer looks identical to a correct one in your latency dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context windows change behavior.&lt;/strong&gt; The same model behaves differently depending on what is in the prompt — retrieved chunks, session history, tool results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In retail specifically, the stakes are asymmetric. A mis-personalized recommendation costs a click. A hallucinated return policy answer costs a customer, a refund, and potentially a chargeback.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Design principle:&lt;/strong&gt; Treat every LLM inference as a structured event with inputs, outputs, retrieved context, tool calls, and a confidence signal — not just a latency measurement.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🏗️ The Observability Stack
&lt;/h2&gt;

&lt;p&gt;The system is built on four GCP components working in concert:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Logging + Log Router&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Capture structured inference events from Cloud Run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Central eval store — every inference logged with full context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vertex AI Evaluation Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automated metric computation (rouge, BLEU, coherence, groundedness)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Looker Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time eval dashboard over BigQuery eval tables&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And one additional layer that makes it production-grade:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vertex AI Pipelines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scheduled eval pipeline — nightly batch scoring of sampled inferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Monitoring + Alerting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Threshold-based alerts when eval metrics degrade&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  📐 The Inference Event Schema
&lt;/h2&gt;

&lt;p&gt;Everything starts with logging the right data. Every LLM inference — whether personalization re-ranking, agent orchestration, or RAG answer generation — emits a structured event to Cloud Logging, which Log Router sinks to BigQuery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cloud Run inference handler — structured logging
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gcp_logging&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gcp_logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_inference_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_inference_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# "personalization" | "multi_agent" | "agentic_rag"
&lt;/span&gt;    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;self_confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# model's own logprob-derived confidence
&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_struct&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;             &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completion_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# [{chunk_id, score, source}]
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# [{tool, input, output, latency_ms}]
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;            &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;self_confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;self_confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single schema powers everything downstream — dashboards, evals, fine-tuning, and alerting.&lt;/p&gt;




&lt;h2&gt;
  
  
  🗄️ BigQuery Eval Schema
&lt;/h2&gt;

&lt;p&gt;Log Router sinks the inference events into a partitioned BigQuery table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;retail_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_inference_log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;request_id&lt;/span&gt;        &lt;span class="n"&gt;STRING&lt;/span&gt;        &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt;                &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;     &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;layer&lt;/span&gt;             &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;-- personalization | multi_agent | agentic_rag&lt;/span&gt;
  &lt;span class="n"&gt;agent_id&lt;/span&gt;          &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_id&lt;/span&gt;          &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;prompt_tokens&lt;/span&gt;     &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;completion_tokens&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;  &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
    &lt;span class="n"&gt;chunk_id&lt;/span&gt;        &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;relevance_score&lt;/span&gt; &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;source&lt;/span&gt;          &lt;span class="n"&gt;STRING&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tool_calls&lt;/span&gt;        &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;       &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_payload&lt;/span&gt;   &lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_payload&lt;/span&gt;  &lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt;      &lt;span class="n"&gt;INT64&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;output&lt;/span&gt;            &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;latency_ms&lt;/span&gt;        &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;self_confidence&lt;/span&gt;   &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;-- Populated by the nightly eval pipeline:&lt;/span&gt;
  &lt;span class="n"&gt;groundedness_score&lt;/span&gt;  &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;coherence_score&lt;/span&gt;     &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;rouge_l&lt;/span&gt;             &lt;span class="n"&gt;FLOAT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;hallucination_flag&lt;/span&gt;  &lt;span class="nb"&gt;BOOL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;human_label&lt;/span&gt;         &lt;span class="n"&gt;STRING&lt;/span&gt;        &lt;span class="c1"&gt;-- populated for sampled reviews&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Partitioned by date, clustered by layer and agent — eval queries over 30-day windows run in seconds, not minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ The Nightly Eval Pipeline (Vertex AI Pipelines)
&lt;/h2&gt;

&lt;p&gt;Automated evaluation runs nightly over a stratified sample of the previous day's inferences. The pipeline has four stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Sample Inferences] → [Compute Automatic Metrics] → [Flag Anomalies] → [Write Eval Results]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stage 1 — Stratified Sampling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@component&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sample_inferences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sample_per_layer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Stratified sample: 200 inferences per layer per day.
    Oversample low-confidence inferences (self_confidence &amp;lt; 0.6).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        WITH ranked AS (
          SELECT *,
            ROW_NUMBER() OVER (
              PARTITION BY layer
              ORDER BY
                -- oversample low-confidence
                CASE WHEN self_confidence &amp;lt; 0.6 THEN 0 ELSE 1 END,
                RAND()
            ) AS rn
          FROM retail_ai.llm_inference_log
          WHERE DATE(ts) = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
        )
        SELECT * FROM ranked WHERE rn &amp;lt;= &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sample_per_layer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to_dataframe&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Oversampling low-confidence inferences ensures the eval pipeline focuses attention where failures are most likely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2 — Automatic Metric Computation
&lt;/h3&gt;

&lt;p&gt;Three metric families are computed for every sampled inference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2a. Groundedness (Retrieval Faithfulness)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For RAG inferences — does the output make claims supported by the retrieved chunks?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@component&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_groundedness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inferences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inf&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inferences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;  &lt;span class="c1"&gt;# skip non-RAG inferences
&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are an evaluator. Given the retrieved context and the model output below,
score the output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s groundedness on a scale of 0.0 to 1.0.
Groundedness = every factual claim in the output is directly supported by the context.
Penalise any claim that is not traceable to the context.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Output:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Return ONLY a JSON object: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;float&amp;gt;}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groundedness_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2b. Coherence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Does the output make logical sense as a response to the inferred query?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;coherence_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Rate the coherence of the following retail AI response on a scale of 0.0 to 1.0.
Coherence = the response is logically consistent, fluent, and directly addresses
the implied user need. Penalise contradictions, non-sequiturs, or incomplete answers.

Response: {output}

Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coherence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;float&amp;gt;}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2c. Hallucination Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A binary flag for responses that assert specific facts (prices, stock counts, policy terms) not present in the retrieved context or tool outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@component&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;flag_hallucinations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inferences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inf&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inferences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tool_facts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_tool_facts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;chunk_facts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_chunk_facts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;all_grounded_facts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_facts&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;chunk_facts&lt;/span&gt;

        &lt;span class="n"&gt;hallucination_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a fact-checker for a retail AI system.
Grounded facts available to the model:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_grounded_facts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Model output:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Does the output assert any specific factual claim (price, stock count, 
policy term, date, SKU) that is NOT present in the grounded facts above?

Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true|false, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;string&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hallucination_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stage 3 — Anomaly Detection
&lt;/h3&gt;

&lt;p&gt;After scoring, the pipeline computes rolling 7-day baselines and flags sessions where today's metrics fall more than 2σ below baseline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@component&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detect_anomalies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;baseline_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT
          layer,
          AVG(groundedness_score)  AS baseline_groundedness,
          STDDEV(groundedness_score) AS stddev_groundedness,
          AVG(coherence_score)     AS baseline_coherence,
          AVG(CAST(hallucination_flag AS INT64)) AS baseline_hallucination_rate
        FROM retail_ai.llm_inference_log
        WHERE DATE(ts) BETWEEN DATE_SUB(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, INTERVAL 7 DAY)
                           AND DATE_SUB(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, INTERVAL 1 DAY)
        GROUP BY layer
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to_dataframe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;today_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT
          layer,
          AVG(groundedness_score)  AS today_groundedness,
          AVG(coherence_score)     AS today_coherence,
          AVG(CAST(hallucination_flag AS INT64)) AS today_hallucination_rate
        FROM retail_ai.llm_inference_log
        WHERE DATE(ts) = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
        GROUP BY layer
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;today&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;today_query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to_dataframe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;anomalies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;today_groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stddev_groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;today&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;today_groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anomalies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  📊 The Eval Dashboard (Looker Studio over BigQuery)
&lt;/h2&gt;

&lt;p&gt;Five views cover the full platform — one per layer plus a cross-layer summary:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Daily Metric Trends (all layers)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                              &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;groundedness_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_groundedness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coherence_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_coherence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hallucination_flag&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hallucination_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;APPROX_QUANTILES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="k"&gt;OFFSET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p99_latency_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                              &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;inference_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;retail_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_inference_log&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATE_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Low-Confidence Inference Drill-Down&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;self_confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;groundedness_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;hallucination_flag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;output&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;retail_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_inference_log&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self_confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;hallucination_flag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;self_confidence&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Retrieval Quality (RAG Layer)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;source&lt;/span&gt;                        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;source_doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relevance_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_retrieval_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;groundedness_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_groundedness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                            &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;times_retrieved&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;retail_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_inference_log&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;UNNEST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATE_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;times_retrieved&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query surfaces which source documents are being retrieved most frequently and whether they actually contribute to grounded answers — your index quality scorecard.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚨 Alerting — Cloud Monitoring Policies
&lt;/h2&gt;

&lt;p&gt;Three alert policies cover the most critical failure modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy 1 — Hallucination Rate Spike&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cloud Monitoring alert policy&lt;/span&gt;
&lt;span class="na"&gt;displayName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retail&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Hallucination&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Spike"&lt;/span&gt;
&lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;displayName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(7-day&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rolling)"&lt;/span&gt;
    &lt;span class="na"&gt;conditionThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="s"&gt;resource.type="bigquery_table"&lt;/span&gt;
        &lt;span class="s"&gt;metric.type="custom.googleapis.com/retail_ai/hallucination_rate"&lt;/span&gt;
      &lt;span class="na"&gt;comparison&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;COMPARISON_GT&lt;/span&gt;
      &lt;span class="na"&gt;thresholdValue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;
      &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3600s&lt;/span&gt;   &lt;span class="c1"&gt;# sustained for 1 hour before alert fires&lt;/span&gt;
&lt;span class="na"&gt;alertStrategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;notificationRateLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3600s&lt;/span&gt;
&lt;span class="na"&gt;notificationChannels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;projects/PROJECT_ID/notificationChannels/PAGERDUTY_CHANNEL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Policy 2 — Groundedness Drop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fires when the 24-hour average groundedness score for the RAG layer falls below 0.75 — typically indicating index drift (stale documents) or a retrieval pipeline failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy 3 — p99 Latency Breach&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Metric written by the serving layer after every inference
&lt;/span&gt;&lt;span class="n"&gt;monitoring_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_time_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projects/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;time_series&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom.googleapis.com/retail_ai/inference_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;layer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int64Value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;}}]&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔁 Closing the Loop — Eval-Driven Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;The eval pipeline does more than surface problems — it produces the training data for the next model version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Weekly: export flagged inferences for human review + supervised fine-tuning
&lt;/span&gt;&lt;span class="n"&gt;export_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT
      request_id,
      output                AS model_output,
      groundedness_score,
      hallucination_flag,
      human_label,           -- populated by reviewer workflow
      retrieved_chunks,
      tool_calls
    FROM retail_ai.llm_inference_log
    WHERE DATE(ts) &amp;gt;= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
      AND (hallucination_flag = TRUE OR groundedness_score &amp;lt; 0.70)
      AND human_label IS NOT NULL
    ORDER BY groundedness_score ASC
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;flagged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;export_query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to_dataframe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Write to GCS in JSONL format for Vertex AI supervised fine-tuning
&lt;/span&gt;&lt;span class="n"&gt;flagged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gs://retail-ai-tuning/weekly/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eval_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/flagged_inferences.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The human review workflow — a lightweight Cloud Run app backed by the same BigQuery table — lets domain experts label outputs as correct, incorrect, or needs-revision. Those labels feed directly into Vertex AI supervised fine-tuning jobs, closing the eval-to-improvement loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔗 How This Connects to the Unified Platform
&lt;/h2&gt;

&lt;p&gt;This observability layer is not bolt-on — it is wired into the three-layer retail AI platform:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform Layer&lt;/th&gt;
&lt;th&gt;What Eval Monitors&lt;/th&gt;
&lt;th&gt;Key Metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Personalization&lt;/td&gt;
&lt;td&gt;Re-ranking model output quality, cold-start recommendation coherence&lt;/td&gt;
&lt;td&gt;Coherence score by user segment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Agent Ops&lt;/td&gt;
&lt;td&gt;Agent decision accuracy, tool call success rate, orchestrator reasoning quality&lt;/td&gt;
&lt;td&gt;Tool call success rate, hallucination rate per agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agentic RAG&lt;/td&gt;
&lt;td&gt;Retrieval groundedness, index freshness, self-correction trigger rate&lt;/td&gt;
&lt;td&gt;Groundedness score, re-query rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-layer&lt;/td&gt;
&lt;td&gt;End-to-end latency, overall hallucination rate, human escalation rate&lt;/td&gt;
&lt;td&gt;p99 latency, hallucination rate, escalation rate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;llm_inference_log&lt;/code&gt; table is the single source of truth for all of these — one schema, one dashboard, one alert policy set.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log the full context, not just the output.&lt;/strong&gt; Retrieved chunks, tool call inputs/outputs, and self-confidence scores are what make downstream eval possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use LLM-as-judge for groundedness and coherence.&lt;/strong&gt; Gemini evaluating Gemini outputs at scale is practical and cost-effective for production eval at retail volumes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oversample low-confidence inferences.&lt;/strong&gt; Your eval budget is limited — concentrate it where failures are most likely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval quality IS answer quality.&lt;/strong&gt; The &lt;code&gt;source_doc&lt;/code&gt; query above will tell you which documents are dragging your groundedness scores down. Fix the index before fine-tuning the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The eval pipeline is your fine-tuning dataset.&lt;/strong&gt; Every flagged, human-reviewed inference is a supervised training example. Treat it that way from day one.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Where to Start
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1:&lt;/strong&gt; Instrument your Cloud Run inference handlers with structured logging using the schema above. Sink to BigQuery via Log Router.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2:&lt;/strong&gt; Write the five dashboard queries. Stand up Looker Studio. You now have observability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3:&lt;/strong&gt; Add the nightly Vertex AI Pipeline for groundedness and hallucination scoring on a 200-inference sample.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 4:&lt;/strong&gt; Set up the three Cloud Monitoring alert policies. Add a human review queue for flagged inferences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ongoing:&lt;/strong&gt; Feed human-labeled inferences back into Vertex AI fine-tuning jobs monthly.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;The schema is the hardest part. Get it right in Week 1 and everything else is additive.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>vertexai</category>
      <category>observability</category>
    </item>
    <item>
      <title>Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack</title>
      <dc:creator>prithviraj.veluchamy@gmail.com</dc:creator>
      <pubDate>Mon, 23 Mar 2026 00:44:33 +0000</pubDate>
      <link>https://dev.to/prithvirajveluchamy/building-an-ai-native-retail-platform-on-gcp-personalization-multi-agent-ops-agentic-rag-as-lib</link>
      <guid>https://dev.to/prithvirajveluchamy/building-an-ai-native-retail-platform-on-gcp-personalization-multi-agent-ops-agentic-rag-as-lib</guid>
      <description>&lt;p&gt;&lt;em&gt;A shopper searches for rain boots on your storefront. Within 120ms, your personalization engine surfaces the right products. A stock alert fires, and three AI agents coordinate a reorder without a human touching a keyboard. The customer asks a question in chat — the answer comes back grounded in live inventory and your return policy, cited and accurate.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is not three separate AI projects. It is one unified platform — and this article shows you how to build it on GCP.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗️ The Three Layers of an AI-Native Retail Platform
&lt;/h2&gt;

&lt;p&gt;Most retail AI initiatives start with one use case and stop there. What makes a platform is when these three capabilities are designed together, sharing infrastructure and data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;GCP Services&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-Time Personalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Surfaces relevant products from millions of SKUs in &amp;lt; 120ms&lt;/td&gt;
&lt;td&gt;Pub/Sub, Dataflow, Vertex AI Matching Engine, Feature Store, Cloud Run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Agent Operations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coordinates inventory, pricing, supplier, and customer agents in parallel&lt;/td&gt;
&lt;td&gt;Vertex AI Reasoning Engine, Pub/Sub, BigQuery ML, Cloud Run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Answers complex queries grounded in live data + policy docs&lt;/td&gt;
&lt;td&gt;Vertex AI Search, Gemini, BigQuery (as a live tool)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;all three layers share the same data backbone&lt;/strong&gt; — BigQuery as the source of truth, Pub/Sub as the event spine, and Vertex AI as the intelligence layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  📐 Unified Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                        FRONTEND / API GATEWAY                   │
└───────────┬──────────────────┬───────────────────┬─────────────┘
            │                  │                   │
    ┌───────▼──────┐  ┌────────▼───────┐  ┌───────▼──────────┐
    │ PERSONALI-   │  │  MULTI-AGENT   │  │   AGENTIC RAG    │
    │ ZATION       │  │  ORCHESTRATOR  │  │   (Customer Q&amp;amp;A) │
    │ ENGINE       │  │  (Gemini 1.5)  │  │   (Gemini +      │
    │ (Cloud Run)  │  │  (Vertex AI    │  │    Vertex Search) │
    └───────┬──────┘  │   Reasoning)   │  └───────┬──────────┘
            │         └────────┬───────┘          │
            │                  │                  │
            └──────────────────┼──────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │         GOOGLE CLOUD PUB/SUB     │
              │         (Shared Event Spine)      │
              └───┬──────────┬──────────┬────────┘
                  │          │          │
          ┌───────▼──┐ ┌─────▼────┐ ┌──▼────────────┐
          │ Dataflow  │ │Specialist│ │ Vertex AI     │
          │ Streaming │ │ Agents   │ │ Search Index  │
          └───────┬──┘ └─────┬────┘ └──┬────────────┘
                  │          │          │
              ┌───▼──────────▼──────────▼───┐
              │          BIGQUERY            │
              │   (Shared Operational Store) │
              └─────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🎯 Layer 1: Real-Time Personalization Engine
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Daily batch recommendations ignore the most powerful signal available: &lt;strong&gt;what the user is doing right now&lt;/strong&gt;. A shopper who just added rain boots to their cart does not want yesterday's trending sneakers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Design principle:&lt;/strong&gt; Personalization is a retrieval problem. Given a user and their context &lt;em&gt;right now&lt;/em&gt;, find the items most likely to convert — in under 120ms.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Six-Stage Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 — Event Capture (Pub/Sub)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every user interaction fires a structured event to Pub/Sub. The client SDK is fire-and-forget — it does not wait for a response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CART_ADD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"u_8821"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sku_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SKU-4471"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s_992abc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-22T14:03:11Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"device"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mobile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Atlanta, GA"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 2 — Stream Enrichment (Dataflow)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Dataflow streaming job picks up events, joins with item metadata from BigQuery, and writes two outputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session feature update → Vertex AI Feature Store (&amp;lt; 5s latency)&lt;/li&gt;
&lt;li&gt;Interaction log → BigQuery (for offline model training)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stage 3 — Feature Assembly (Vertex AI Feature Store)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At query time, three feature groups are fetched in a single low-latency call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;feature_store_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_feature_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;entity_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;entity_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;feature_selector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id_matcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_clicks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 4 — ANN Retrieval (Vertex AI Matching Engine)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The assembled user context vector is submitted to Matching Engine — Google's managed ANN index. It returns the top 50 candidate SKUs from a catalog of millions &lt;strong&gt;in under 10ms&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index_endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_neighbors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;deployed_index_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retail_item_embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_context_vector&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;num_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood: Google's ScaNN algorithm, pre-filtered by in-stock status so the re-ranker never sees unavailable items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 5 — Re-Ranking (Vertex AI Prediction)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A lightweight model re-scores the 50 candidates using signals the embedding index cannot capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current inventory level&lt;/li&gt;
&lt;li&gt;Promotional pricing flag&lt;/li&gt;
&lt;li&gt;User's price sensitivity segment&lt;/li&gt;
&lt;li&gt;Real-time trend score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stage 6 — Serve (Cloud Run)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Top 10 results + display metadata returned to the frontend. End-to-end: &lt;strong&gt;&amp;lt; 120ms at p99&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Cold Start
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New user (no history)&lt;/td&gt;
&lt;td&gt;Serve contextual top-trending items by device + time + location&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New item (no interactions)&lt;/td&gt;
&lt;td&gt;Content-based embedding from product description + image on ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After first click&lt;/td&gt;
&lt;td&gt;Session features kick in within 5 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🤖 Layer 2: Multi-Agent Operations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;A single LLM handling all retail operations hits three walls: &lt;strong&gt;context overload&lt;/strong&gt;, &lt;strong&gt;sequential latency&lt;/strong&gt;, and &lt;strong&gt;unmaintainable prompts&lt;/strong&gt;. When the inventory rule, pricing model, supplier contract, and customer policy all need to fit in one context — reasoning quality degrades.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Design principle:&lt;/strong&gt; Treat operations like a well-run team. One orchestrator receives requests and coordinates specialists. Each specialist does one thing well.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Agent Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Operator / System Trigger
        │
        ▼
┌─────────────────────────────────┐
│  ORCHESTRATOR AGENT             │
│  Gemini 1.5 Pro                 │
│  Vertex AI Reasoning Engine     │
│  - Decomposes tasks             │
│  - Routes to specialists        │
│  - Synthesizes final response   │
└────┬──────────┬──────────┬──────┘
     │  Pub/Sub │          │
     ▼          ▼          ▼
┌─────────┐ ┌────────┐ ┌──────────┐ ┌──────────┐
│Inventory│ │Pricing │ │Supplier  │ │Customer  │
│Agent    │ │Agent   │ │Agent     │ │Agent     │
│BigQuery │ │BQ ML   │ │Vertex AI │ │Agentic   │
│         │ │        │ │Search    │ │RAG ←────── Layer 3
└─────────┘ └────────┘ └──────────┘ └──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice: &lt;strong&gt;the Customer Agent IS Layer 3&lt;/strong&gt; — Agentic RAG is not separate, it is the intelligence layer of the Customer Agent. This is where the three layers connect.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Reorder Request — Traced End-to-End
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; &lt;code&gt;"Should we reorder SKU-991?"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Decompose:&lt;/strong&gt; Orchestrator identifies three parallel sub-tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orchestrator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decompose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → [
#     {"agent": "inventory", "task": "get_stock_level", "sku": "SKU-991"},
#     {"agent": "supplier",  "task": "get_eta_and_cost", "sku": "SKU-991"},
#     {"agent": "pricing",   "task": "get_reorder_cost", "sku": "SKU-991"}
# ]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — Dispatch:&lt;/strong&gt; All three tasks published to Pub/Sub simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Execute in Parallel:&lt;/strong&gt; Each Cloud Run agent handles its task independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inventory Agent
&lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT units_available FROM inventory_snapshot
    WHERE sku_id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SKU-991&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; AND store_id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DC-ATL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Pricing Agent (BigQuery ML)
&lt;/span&gt;&lt;span class="n"&gt;reorder_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT ML.PREDICT(MODEL `retail.pricing_model`,
        (SELECT * FROM pricing_signals WHERE sku_id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SKU-991&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;))
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4 — Synthesize:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Orchestrator → "Reorder 50 units from Vendor A at $4.20/unit, ETA 3 days. 
                Current stock: 8 units (below reorder threshold of 15)." ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Total time = max(slowest agent)&lt;/strong&gt; — not the sum of all three.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Pub/Sub Design — Why It Matters
&lt;/h3&gt;

&lt;p&gt;Three properties you get for free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loose coupling:&lt;/strong&gt; agents have no direct dependency on each other, only on topic names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault tolerance:&lt;/strong&gt; if an agent crashes, the message is retained and redelivered on recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independent scaling:&lt;/strong&gt; each Cloud Run agent scales on its own Pub/Sub queue depth&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Shared Memory: The &lt;code&gt;agent_decision_log&lt;/code&gt; Table
&lt;/h3&gt;

&lt;p&gt;Every orchestrated request is fully logged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;retail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_decision_log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;request_id&lt;/span&gt;      &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt;              &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;agent_called&lt;/span&gt;    &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tools_used&lt;/span&gt;      &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;input_payload&lt;/span&gt;   &lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;output_payload&lt;/span&gt;  &lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;latency_ms&lt;/span&gt;      &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;confidence&lt;/span&gt;      &lt;span class="n"&gt;FLOAT64&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This table powers weekly evaluation reports and feeds back into model fine-tuning — your audit trail is also your training dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  📚 Layer 3: Agentic RAG for Retail Knowledge
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Standard RAG (embed query → retrieve chunks → generate) fails retail because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single customer question often spans multiple knowledge domains (policy + inventory + product specs)&lt;/li&gt;
&lt;li&gt;Inventory data goes stale in minutes — you cannot index it as static documents&lt;/li&gt;
&lt;li&gt;Retrieval confidence varies — a system that cannot detect low-confidence answers will hallucinate&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Design principle:&lt;/strong&gt; RAG should reason, not just retrieve. The agent decides which source to query, validates the result, and cites its sources.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Three Retrieval Sources
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Policy &amp;amp; Compliance Index (Vertex AI Search)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Return policies, warranty terms, BOPIS rules, hazmat shipping. Indexed as documents with hybrid retrieval (dense semantic + sparse BM25 keyword).&lt;/p&gt;

&lt;p&gt;BM25 matters here: product part numbers and model codes are not well-served by pure vector search. Hybrid retrieval handles both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Product Catalog Index (Vertex AI Search)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Product descriptions, specs, compatibility notes, sizing guides. Indexed with multimodal embeddings (text + image) so &lt;code&gt;"waterproof jacket similar to this one"&lt;/code&gt; works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Live Operational Data (BigQuery as a Tool)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inventory levels, order status, real-time pricing — &lt;strong&gt;not indexed as documents but called as a live tool&lt;/strong&gt;. This is the key architectural decision that prevents stale answers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;VertexAISearchTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retail_policy_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;VertexAISearchTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retail_product_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;BigQueryTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INVENTORY_QUERY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# live call, not indexed
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Query Decomposition in Action
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Customer query:&lt;/strong&gt; &lt;code&gt;"Can I return the 40V battery I bought online at a store, and is it in stock at the Cumming, GA location?"&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;Agent&lt;/span&gt; &lt;span class="n"&gt;Plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;Sub&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt; &lt;span class="k"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"online purchase battery return policy in-store"&lt;/span&gt;
  &lt;span class="n"&gt;Sub&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;BigQuery&lt;/span&gt; &lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;units_available&lt;/span&gt; 
                               &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;inventory_snapshot&lt;/span&gt; 
                               &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;sku_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'SKU-4471'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'GA-CUMMING'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent validates Sub-query A:&lt;/strong&gt; relevance score &amp;gt; 0.82 threshold ✅&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent validates Sub-query B:&lt;/strong&gt; live data, timestamp 2 minutes ago ✅&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synthesized answer:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Yes — online purchases can be returned in-store within 90 days (Policy §3.2). 
The 40V battery (SKU-4471) shows 3 units in stock at Cumming, GA 
as of 14:07 EST today."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every fact is cited. No hallucination. No "please check the website."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Self-Correction Loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vertex_search&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

    &lt;span class="c1"&gt;# Reformulate: broaden scope, try synonyms, switch retrieval mode
&lt;/span&gt;    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reformulate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After max retries: escalate to human agent queue
&lt;/span&gt;&lt;span class="nf"&gt;escalate_to_human&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This loop means your system knows what it does not know — and routes accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔗 How the Three Layers Connect
&lt;/h2&gt;

&lt;p&gt;The platform is unified, not assembled. Here is how data and events flow across all three layers in a single customer session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Customer browses → Pub/Sub event → Personalization Engine 
                      surfaces relevant products (Layer 1)

2. Inventory drops below threshold → Pub/Sub alert → 
   Orchestrator Agent dispatches reorder across 3 specialist 
   agents in parallel (Layer 2)

3. Customer asks: "Is this in stock?" → Customer Agent (Layer 2) 
   → Agentic RAG (Layer 3) queries BigQuery live + policy index
   → grounded, cited answer in &amp;lt; 2s

4. All events → BigQuery agent_decision_log + interaction_log
   → weekly eval reports + model retraining for Layers 1 &amp;amp; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The feedback loop is the platform.&lt;/strong&gt; Every interaction trains the next version of every model.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Observability — One Dashboard, Three Layers
&lt;/h2&gt;

&lt;p&gt;All three layers write to BigQuery. One Looker Studio dashboard covers the full platform:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Source Table&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Recommendation CTR by segment&lt;/td&gt;
&lt;td&gt;Personalization&lt;/td&gt;
&lt;td&gt;&lt;code&gt;interaction_log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ANN retrieval latency p99&lt;/td&gt;
&lt;td&gt;Personalization&lt;/td&gt;
&lt;td&gt;&lt;code&gt;serving_metrics&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent task parallelism ratio&lt;/td&gt;
&lt;td&gt;Multi-Agent&lt;/td&gt;
&lt;td&gt;&lt;code&gt;agent_decision_log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reorder decision accuracy&lt;/td&gt;
&lt;td&gt;Multi-Agent&lt;/td&gt;
&lt;td&gt;&lt;code&gt;agent_decision_log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG retrieval precision@5&lt;/td&gt;
&lt;td&gt;Agentic RAG&lt;/td&gt;
&lt;td&gt;&lt;code&gt;agent_query_log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Re-query rate&lt;/td&gt;
&lt;td&gt;Agentic RAG&lt;/td&gt;
&lt;td&gt;&lt;code&gt;agent_query_log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When retrieval precision drops, you know before customers notice.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Where to Start
&lt;/h2&gt;

&lt;p&gt;Don't try to ship all three layers at once. Here is a proven sequencing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1–4: Lay the data foundation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up BigQuery tables: &lt;code&gt;inventory_snapshot&lt;/code&gt;, &lt;code&gt;interaction_log&lt;/code&gt;, &lt;code&gt;agent_decision_log&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Stand up Pub/Sub topics and Dataflow streaming job&lt;/li&gt;
&lt;li&gt;This infrastructure is shared by all three layers — do it once, use it everywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 5–8: Ship Personalization (Layer 1)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train a two-tower model on BigQuery interaction history&lt;/li&gt;
&lt;li&gt;Index item embeddings into Vertex AI Matching Engine&lt;/li&gt;
&lt;li&gt;Wire up Cloud Run serving API&lt;/li&gt;
&lt;li&gt;Measure: recommendation CTR vs. batch baseline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 9–12: Add Multi-Agent Ops (Layer 2)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with two agents: Inventory + Pricing&lt;/li&gt;
&lt;li&gt;Orchestrator on Vertex AI Reasoning Engine&lt;/li&gt;
&lt;li&gt;Add Supplier Agent once the first two are stable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 13–16: Add Agentic RAG (Layer 3)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Index return policy + product catalog into Vertex AI Search&lt;/li&gt;
&lt;li&gt;Wire the BigQuery inventory tool into the agent&lt;/li&gt;
&lt;li&gt;Deploy as the Customer Agent inside your multi-agent system&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The Pub/Sub bus means each new layer plugs in without touching what already works.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Share infrastructure, not code.&lt;/strong&gt; BigQuery and Pub/Sub serve all three layers. Build them once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Customer Agent IS Agentic RAG.&lt;/strong&gt; Don't build these as separate projects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;agent_decision_log&lt;/code&gt; is your most valuable table.&lt;/strong&gt; It is your audit trail, your eval dataset, and your retraining signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personalization cold start is solved by context, not history.&lt;/strong&gt; Device + time + location gets you 80% of the way there for new users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid retrieval beats pure vector search for retail.&lt;/strong&gt; BM25 handles part numbers and model codes that semantic search misses.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gcp</category>
      <category>vertexai</category>
      <category>machinelearning</category>
      <category>retail</category>
    </item>
  </channel>
</rss>
