my rag bot thinks python is a snake

t12s — Thu, 19 Jun 2025 20:37:58 +0000

remember yesterday when i fixed my hallucination problem? woke up to this gem: "python decorators work like a python snake constricting its prey." my senior engineer just stared at me.

apparently fixing general hallucinations wasn't enough. now my bot was creatively misinterpreting every technical term it could find. kafka became literary analysis. circuit breakers became electrical safety lessons. had to fix this before the whole engineering team revolted.

quick answers for the desperate

Q: How can I detect when my LangChain RAG pipeline hallucinates technical terminology?
pattern matching for danger words works. if your bot explains "python" with "snake" or "kafka" with "author", you've got terminology hallucination. takes ~80ms to check.

Q: What's the most effective way to prevent domain terminology confusion in production RAG systems?
inject correct definitions before the llm sees anything. pre-populate context with your glossary. stopped 95% of our terminology disasters.

Q: Should I use pre-filtering or post-processing for terminology validation?
both. pre-filter removes obviously wrong contexts (python + reptile docs). post-process catches creative interpretations. belt and suspenders.

Q: How do I handle ambiguous technical terms in my RAG pipeline?
force disambiguation in your prompts. explicitly state "Python (programming language, NOT the snake)". sounds dumb, works great.

the morning logs of shame

checked slack. it got worse:

user: "explain our circuit breaker pattern"
bot: "circuit breakers are electrical safety devices that stop current flow..."

user: "what's kafka in our stack?"
bot: "kafka, named after franz kafka, handles messages with existential reliability..."

we use hystrix, not electrical circuits. and that kafka explanation? our cto called it "poetic but useless."

why yesterday's fix missed this

my pattern detection caught lies about features. but terminology? different beast:

llms know multiple meanings (python = snake AND language)
retrieval gets partial matches
bot fills gaps with general knowledge

the 20-minute panic fix

class TerminologyValidator:
    def __init__(self):
        # the cursed words that break everything
        self.danger_terms = {
            "python": ["snake", "reptile", "constrictor"],
            "java": ["coffee", "island", "indonesian"],
            "rust": ["corrosion", "oxidation", "metal"],
            "kafka": ["franz", "author", "metamorphosis"]
        }

    def check_response(self, query, response):
        disasters = []

        for term, bad_contexts in self.danger_terms.items():
            if term in query.lower():
                for bad in bad_contexts:
                    if bad in response.lower():
                        disasters.append({
                            "term": term,
                            "found": bad,
                            "severity": "fire_me"
                        })

        return disasters

definition injection that actually works

SAFE_DEFINITIONS = {
    "python": "high-level programming language",
    "circuit breaker": "resilience pattern preventing cascading failures",
    "kafka": "distributed event streaming platform"
}

def inject_glossary(query, retrieved_docs):
    # find terms in query
    terms_found = [term for term in SAFE_DEFINITIONS if term in query.lower()]

    if terms_found:
        # add our definitions FIRST
        glossary = "\n".join([f"{term}: {SAFE_DEFINITIONS[term]}" 
                             for term in terms_found])

        glossary_doc = Document(
            page_content=f"DEFINITIONS:\n{glossary}",
            metadata={"source": "company_glossary"}
        )
        retrieved_docs.insert(0, glossary_doc)

    return retrieved_docs

the prompt that saved my job

TERMINOLOGY_PROMPT = """You are a technical assistant.

CRITICAL: For these terms, ONLY use technical meanings:
- Python (programming language, NEVER the snake)
- Java (programming language, NEVER coffee)
- Kafka (streaming platform, NEVER the author)

Context: {context}
Question: {question}

Answer using technical definitions only:"""

damage report

morning: 47 terminology disasters
after fix: 2 (both edge cases)
response time: +80ms (worth it)
engineer trust: restored

tomorrow: handling when the bot explains "git" as british slang. because apparently that's also a thing.

detect hallucinations in langchain rag pipelines

t12s — Thu, 19 Jun 2025 02:26:28 +0000

okay so you're building a rag pipeline with langchain and your ai keeps making stuff up. been there. here's what actually works.

the problem: your bot sounds smart but lies

my customer support bot was telling people we had 24/7 support when we only work 9-5. it claimed we had "automatic refund processing" when everything's manual. subtle lies that sound totally reasonable.

the worst part? these aren't obvious hallucinations. they're plausible features we just don't have.

why it happens

your rag pipeline:

retrieves somewhat relevant docs
llm fills in gaps with "helpful" details
you get 70% truth, 30% fiction

detection method 1: see what's happening

first, add openllmetry to see everything:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_rag_pipeline")

# your existing langchain code stays the same

now you can see exactly where the llm adds stuff not in your docs.

detection method 2: llm checking (75% accurate)

def detect_hallucination(context, response):
    prompt = f"""
    Context: {context}
    Response: {response}

    Does the response contain information not in the context? YES/NO only.
    """

    result = llm.invoke(prompt)
    return "yes" in result.lower()

detection method 3: pattern matching

these patterns almost always mean hallucination:

suspicious_patterns = [
    r'\d+\s*hours?',  # "48 hours"
    r'24/?7',  # "24/7 support"
    r'automatically',  # "automatically processed"
    r'real-time',  # usually a lie
]

the fix: better prompts

this cut my hallucinations by 60%:

ANTI_HALLUCINATION_PROMPT = """
Use ONLY information in the context. 
Do not add details not explicitly mentioned.
If information isn't available, say "I don't have that information."

Context: {context}
Question: {question}
"""

production setup that works

from traceloop.sdk import Traceloop
Traceloop.init(app_name="production_rag")

class ProductionRAG:
    def __init__(self):
        self.llm = OpenAI(temperature=0)
        self.prompt = ANTI_HALLUCINATION_PROMPT

    def query(self, question):
        result = self.qa_chain({"question": question})

        # check for hallucinations
        if detect_hallucination(context, result['answer']):
            # retry with stricter prompt
            strict_q = f"{question}\nOnly use exact information."
            result = self.qa_chain({"question": strict_q})

        return result['answer']

results

before: 30% of responses had hallucinations
after: <5% hallucination rate

cost: ~30% more for checking, worth it

quick wins

add openllmetry (2 lines of code)
use explicit anti-hallucination prompts
implement basic pattern detection
set temperature to 0
track what gets flagged

the scariest hallucinations are the plausible ones. "24/7 support" when you're 9-5. "automatic processing" when it's manual. with proper detection, you catch them before customers do.

tools that work

openllmetry: see everything
traceloop: track patterns
simple pattern matching: catches 90% of common lies

that's it. detect what your rag pipeline makes up, tell it to stop, verify it listened. your customers will thank you.

DEV Community: t12s