Sindhu Murthy

Posted on Feb 16

How to Troubleshoot RAG in Production: A Field Guide

#ai #rag #devops

TL;DR: RAG isn't one system — it's a pipeline with 6 stages. When something breaks, follow the data from start to finish. This guide shows you exactly which log fields to check at each stage and what they mean.

The Scenario

A customer messages you at 2 PM on a Tuesday:

"The AI is giving wrong answers."

That's it. No logs. No screenshots. Just vibes.

You have 25 fields scattered across 6 pipeline stages, and somewhere in there is the answer. This guide tells you where to look.

The Pipeline at a Glance

User Query → Embedding → Retrieval → Context Assembly → LLM Call → Response

The mistake most people make: they jump straight to the LLM. "Must be a model problem." It usually isn't. 70% of RAG failures happen before the LLM is ever called — in retrieval and context assembly.

Stage 1: The Query Comes In

Fields: request_id · user_id · timestamp

Always start with request_id. This is your case number. Every other log field is useless without it because you can't tell which retrieval, which LLM call, which response belongs to this specific complaint.

Then check user_id. One user affected = their data or permissions. Hundreds of users at the same time = infrastructure.

Then check timestamp. Correlate with:

Recent deployments — did someone push a change?
Known outages — is the LLM provider having issues?
Batch jobs — did an embedding re-index just run?

Example: Customer says answers broke "recently." You check timestamps — every bad answer started at 3:47 AM, exactly when a cron job re-indexed the knowledge base with a new embedding model. Mystery solved in 30 seconds.

Stage 2: The Embedding Step

Fields: embedding_model · embedding_latency_ms · embedding_job_failed

The user's question gets converted into a vector (a list of numbers) so it can be compared against your document vectors.

The silent killer: If this step uses a different model than what was used to index the documents, the vectors live in different mathematical spaces. It's like searching a Spanish library with a French dictionary. Nothing errors out — the results are just irrelevant.

Field	What to Look For
`embedding_model`	Does it match the model used during indexing? If not, every search result is garbage.
`embedding_latency_ms`	Normal: 10-50ms. Above 2000ms: embedding service is struggling.
`embedding_job_failed`	If `true`, the query never got embedded. The LLM is answering with zero context — it's guessing.

Example: Search quality drops overnight. No deployments, no config changes. The team upgraded from text-embedding-ada-002 to text-embedding-3-small for new queries, but stored document vectors are still from the old model. Fix: re-index all documents with the new model.

Stage 3: The Retrieval Step

Fields: collection_name · top_k · chunk_size · chunk_overlap · retrieved_docs · result_count · similarity_score

This is where most RAG failures actually happen.

Check `result_count` first:

Count	What It Means
0	Knowledge base is empty, collection doesn't exist, or query is totally unrelated. Check `collection_name` — staging vs. production mix-ups are more common than you'd think.
1-3	Might be fine. Might mean your knowledge base is too small or chunks are too large.
50+	You're flooding the LLM with noise. Lower `top_k`.

Then check `similarity_score`:

Score	Quality
Above 0.7	Strong matches. Retrieval is working.
0.3 - 0.7	Mediocre. Docs are somewhat related but might not answer the question.
Below 0.3	Retrieval is grabbing garbage. The system would give better answers with no context at all.

Then check chunking:

Problem	Symptom
Chunks too large (2000+ tokens)	Similarity score looks decent but the answer is diluted with irrelevant content
Chunks too small (50-100 tokens)	Important context is split across chunks that don't get retrieved together
No overlap (overlap = 0)	Sentences at chunk boundaries get cut in half. Critical info lost.

Example: Customer asks "What's our refund policy?" and gets an answer about shipping timelines. The top retrieved doc is a 3000-token chunk titled "Order Processing" that mentions refunds in one sentence buried in paragraph 8. Fix: reduce chunk size to 500 tokens so the refund policy lives in its own chunk.

Stage 4: Context Assembly

Fields: prompt_tokens · total_tokens · context_truncated · system_prompt

This is where retrieved documents get packed into a prompt and sent to the LLM. The main failure: stuffing more context than the model can handle.

Field	What to Look For
`prompt_tokens`	Approaching the model's context window limit? (GPT-4o: 128K, Claude Sonnet: 200K)
`context_truncated`	If `true`, the LLM is working with incomplete information. It's like summarizing a book using only chapters 1-7 out of 20.
`system_prompt`	Did someone change it? "Answer only from provided context" vs. "Be helpful" = very different behavior. The first says "I don't know." The second hallucinates.

Example: Simple questions are correct, complex ones are wrong. Simple questions use 800 tokens, complex ones use 45,000. context_truncated is true for every complex query. Fix: set a max context budget and prioritize higher-scoring docs.

Stage 5: The LLM Call

Fields: model · temperature · max_tokens · api_version · status_code · retry_count · latency_ms · cache_hit

Check `status_code` first:

Code	Meaning	Action
200	Success. Problem is elsewhere.	Move on.
429	Rate limited.	Check `retry_count` — high count means retry storm making it worse.
500	Provider's problem.	Retry or failover.
503	Model overloaded.	Common during peak hours. Wait or switch models.

Then check configuration:

Field	What to Look For
`model`	Is it the model you expect? Config drift is real — someone changes an env var and production silently downgrades.
`temperature`	For RAG, should be 0.0-0.3. At 1.0, the model is improvising instead of sticking to context.
`latency_ms`	Normal: 1-5 seconds. 15-30 seconds: model is overloaded or generating very long responses.
`cache_hit`	Answers seem outdated? A cache layer might be serving stale responses.

Example: Customer reports "inconsistent" answers — same question, different answers each time. You check temperature: it's set to 0.8. Every request is a roll of the dice. Fix: set to 0.1 for factual RAG.

Stage 6: The Response

Fields: completion_tokens · finish_reason · error_message

Check `finish_reason`:

Value	Meaning	Fix
`stop`	Model finished naturally. This is good.	—
`length`	Hit `max_tokens` limit. Answer cut off mid-sentence.	Increase `max_tokens` or add "Be concise" to system prompt.
`content_filter`	Blocked by safety filters. User sees an error for a legitimate question.	Adjust content filter settings.

Check `completion_tokens`:

Pattern	Likely Issue
Very low (10-20 tokens)	Model defaulting to "I don't know" — retrieval probably returned nothing useful
Very high (4000+ tokens)	Model is rambling — tighten the system prompt

And always check error_message. Sometimes the answer is literally written in the error. Read it before you start investigating.

Example: Users report the AI "cuts off mid-sentence." finish_reason = length on every affected request. max_tokens is set to 256 — not enough for detailed technical answers. Fix: increase to 1024.

The 10-Step Checklist

When a ticket comes in, work through this in order:

Step	What to Check
1	Get the `request_id`
2	Check `timestamp` — correlate with deployments/outages
3	Check `user_id` — one user or many?
4	Check `embedding_job_failed` — did embedding work?
5	Check `result_count` + `similarity_score` — did retrieval return good docs?
6	Check `context_truncated` — did the full context reach the LLM?
7	Check `status_code` — did the LLM call succeed?
8	Check `model` + `temperature` — is the LLM configured correctly?
9	Check `finish_reason` — did the response complete?
10	Check `error_message` — does it just tell you?

Steps 1-3 scope the problem. Steps 4-6 catch 70% of issues. Steps 7-10 catch the rest.

Common Patterns Quick Reference

Symptom	Likely Cause	Check These Fields
Wrong answers for everyone	Embedding model mismatch or bad re-index	`embedding_model`, `similarity_score`
Wrong answers for one user	Missing docs in their collection	`collection_name`, `result_count`, `user_id`
Incomplete answers	Response truncation	`finish_reason`, `max_tokens`, `context_truncated`
Inconsistent answers	Temperature too high or cache issues	`temperature`, `cache_hit`
Slow responses	LLM overload or too much context	`latency_ms`, `prompt_tokens`, `retry_count`
No response at all	API failure or rate limiting	`status_code`, `error_message`, `embedding_job_failed`
Hallucinated answers	No relevant docs retrieved	`result_count`, `similarity_score`, `system_prompt`
Outdated answers	Stale cache or stale index	`cache_hit`, `timestamp`, `embedding_job_failed`

The Bottom Line

Follow the pipeline. Query → Embedding → Retrieval → Context → LLM → Response. Six stages, 25 fields, one direction.

Start at the beginning. Follow the data. The logs will tell you where it broke.

DEV Community

How to Troubleshoot RAG in Production: A Field Guide

The Scenario

The Pipeline at a Glance

Stage 1: The Query Comes In

Stage 2: The Embedding Step