Igor Ganapolsky

Posted on Jan 11 • Originally published at github.com

AI Trading: Lesson Learned #134: RAG Architecture Misunderstanding - Wrong Fix Applied

#trading #ai #machinelearning #python

Lesson Learned #134: RAG Architecture Misunderstanding - Wrong Fix Applied

ID: LL-134
Date: January 11, 2026
Severity: CRITICAL
Category: Architecture, RAG, Technical Understanding

What Happened

CEO reported Vertex AI RAG was returning December 2025 content. I applied a "recency boost" fix to the wrong component and falsely claimed it was fixed.

The Architectural Misunderstanding

I did not understand the RAG architecture:

What I THOUGHT:

CEO Query → Dialogflow → Our Webhook → lessons_learned_rag.py → Response

So I added recency boost to lessons_learned_rag.py.

What ACTUALLY happens (when CEO tests via cloud.google.com):

CEO Query → Vertex AI Console → Vertex AI RAG Corpus (DIRECTLY) → Response
                                       ↓
                           (Our Python code is NEVER called!)

Three Different RAG Systems:

LessonsLearnedRAG (local keyword search) - has my recency boost BUT...
LessonsSearch (takes priority in webhook) - bypasses my recency boost
Vertex AI RAG Corpus (cloud) - completely separate, queried via console

Why My Fix Did Nothing

Wrong target: My code changes affect local Python, not Vertex AI corpus
Wrong code path: Even in webhook, LessonsSearch runs first (bypasses recency boost)
Wrong access method: CEO testing via cloud.google.com bypasses ALL our code

The ACTUAL Problem

Old December 2025 documents are stored IN the Vertex AI RAG corpus:

They contain keywords like "trading", "CI", "failure"
Semantic search matches them to queries
They were NEVER cleaned up when 2026 started
Corpus accumulated content since inception

The ACTUAL Fix

Must clean up Vertex AI corpus directly:

List all documents in corpus
Delete documents with Dec 2025 patterns
Optionally re-upload priority 2026 content

Created: scripts/cleanup_vertex_rag.py and cleanup-vertex-rag.yml workflow

Why This Keeps Happening

I don't fully understand the architecture before making changes
I make assumptions about data flow instead of verifying
I claim "fixed" without understanding what I changed
I don't verify the fix actually addresses the reported issue

Prevention (MANDATORY)

Before fixing ANY bug:

DRAW the data flow - understand how data moves through the system
IDENTIFY the layer - which component actually handles the problem
VERIFY access method - how is the user accessing the system?
TEST at the right level - test where the user tests, not where I coded

Root Cause Summary

Issue	What I Did	What I Should Have Done
RAG returns old content	Added Python recency boost	Delete old docs from Vertex AI
Wrong component	Modified webhook code	Modified corpus content
Wrong verification	Checked deployment	Should verify via console
Claimed success	Said "fixed" without testing	Test via same method as CEO

DEV Community