TL;DR: RAG isn't one system — it's a pipeline with 6 stages. When something breaks, follow the data from start to finish. This guide shows you exactly which log fields to check at each stage and what they mean.
The Scenario
A customer messages you at 2 PM on a Tuesday:
"The AI is giving wrong answers."
That's it. No logs. No screenshots. Just vibes.
You have 25 fields scattered across 6 pipeline stages, and somewhere in there is the answer. This guide tells you where to look.
The Pipeline at a Glance
User Query → Embedding → Retrieval → Context Assembly → LLM Call → Response
The mistake most people make: they jump straight to the LLM. "Must be a model problem." It usually isn't. 70% of RAG failures happen before the LLM is ever called — in retrieval and context assembly.
Stage 1: The Query Comes In
Fields:
request_id·user_id·timestamp
Always start with request_id. This is your case number. Every other log field is useless without it because you can't tell which retrieval, which LLM call, which response belongs to this specific complaint.
Then check user_id. One user affected = their data or permissions. Hundreds of users at the same time = infrastructure.
Then check timestamp. Correlate with:
- Recent deployments — did someone push a change?
- Known outages — is the LLM provider having issues?
- Batch jobs — did an embedding re-index just run?
Example: Customer says answers broke "recently." You check timestamps — every bad answer started at 3:47 AM, exactly when a cron job re-indexed the knowledge base with a new embedding model. Mystery solved in 30 seconds.
Stage 2: The Embedding Step
Fields:
embedding_model·embedding_latency_ms·embedding_job_failed
The user's question gets converted into a vector (a list of numbers) so it can be compared against your document vectors.
The silent killer: If this step uses a different model than what was used to index the documents, the vectors live in different mathematical spaces. It's like searching a Spanish library with a French dictionary. Nothing errors out — the results are just irrelevant.
| Field | What to Look For |
|---|---|
embedding_model |
Does it match the model used during indexing? If not, every search result is garbage. |
embedding_latency_ms |
Normal: 10-50ms. Above 2000ms: embedding service is struggling. |
embedding_job_failed |
If true, the query never got embedded. The LLM is answering with zero context — it's guessing. |
Example: Search quality drops overnight. No deployments, no config changes. The team upgraded from
text-embedding-ada-002totext-embedding-3-smallfor new queries, but stored document vectors are still from the old model. Fix: re-index all documents with the new model.
Stage 3: The Retrieval Step
Fields:
collection_name·top_k·chunk_size·chunk_overlap·retrieved_docs·result_count·similarity_score
This is where most RAG failures actually happen.
Check result_count first:
| Count | What It Means |
|---|---|
| 0 | Knowledge base is empty, collection doesn't exist, or query is totally unrelated. Check collection_name — staging vs. production mix-ups are more common than you'd think. |
| 1-3 | Might be fine. Might mean your knowledge base is too small or chunks are too large. |
| 50+ | You're flooding the LLM with noise. Lower top_k. |
Then check similarity_score:
| Score | Quality |
|---|---|
| Above 0.7 | Strong matches. Retrieval is working. |
| 0.3 - 0.7 | Mediocre. Docs are somewhat related but might not answer the question. |
| Below 0.3 | Retrieval is grabbing garbage. The system would give better answers with no context at all. |
Then check chunking:
| Problem | Symptom |
|---|---|
| Chunks too large (2000+ tokens) | Similarity score looks decent but the answer is diluted with irrelevant content |
| Chunks too small (50-100 tokens) | Important context is split across chunks that don't get retrieved together |
| No overlap (overlap = 0) | Sentences at chunk boundaries get cut in half. Critical info lost. |
Example: Customer asks "What's our refund policy?" and gets an answer about shipping timelines. The top retrieved doc is a 3000-token chunk titled "Order Processing" that mentions refunds in one sentence buried in paragraph 8. Fix: reduce chunk size to 500 tokens so the refund policy lives in its own chunk.
Stage 4: Context Assembly
Fields:
prompt_tokens·total_tokens·context_truncated·system_prompt
This is where retrieved documents get packed into a prompt and sent to the LLM. The main failure: stuffing more context than the model can handle.
| Field | What to Look For |
|---|---|
prompt_tokens |
Approaching the model's context window limit? (GPT-4o: 128K, Claude Sonnet: 200K) |
context_truncated |
If true, the LLM is working with incomplete information. It's like summarizing a book using only chapters 1-7 out of 20. |
system_prompt |
Did someone change it? "Answer only from provided context" vs. "Be helpful" = very different behavior. The first says "I don't know." The second hallucinates. |
Example: Simple questions are correct, complex ones are wrong. Simple questions use 800 tokens, complex ones use 45,000.
context_truncatedistruefor every complex query. Fix: set a max context budget and prioritize higher-scoring docs.
Stage 5: The LLM Call
Fields:
model·temperature·max_tokens·api_version·status_code·retry_count·latency_ms·cache_hit
Check status_code first:
| Code | Meaning | Action |
|---|---|---|
| 200 | Success. Problem is elsewhere. | Move on. |
| 429 | Rate limited. | Check retry_count — high count means retry storm making it worse. |
| 500 | Provider's problem. | Retry or failover. |
| 503 | Model overloaded. | Common during peak hours. Wait or switch models. |
Then check configuration:
| Field | What to Look For |
|---|---|
model |
Is it the model you expect? Config drift is real — someone changes an env var and production silently downgrades. |
temperature |
For RAG, should be 0.0-0.3. At 1.0, the model is improvising instead of sticking to context. |
latency_ms |
Normal: 1-5 seconds. 15-30 seconds: model is overloaded or generating very long responses. |
cache_hit |
Answers seem outdated? A cache layer might be serving stale responses. |
Example: Customer reports "inconsistent" answers — same question, different answers each time. You check
temperature: it's set to 0.8. Every request is a roll of the dice. Fix: set to 0.1 for factual RAG.
Stage 6: The Response
Fields:
completion_tokens·finish_reason·error_message
Check finish_reason:
| Value | Meaning | Fix |
|---|---|---|
stop |
Model finished naturally. This is good. | — |
length |
Hit max_tokens limit. Answer cut off mid-sentence. |
Increase max_tokens or add "Be concise" to system prompt. |
content_filter |
Blocked by safety filters. User sees an error for a legitimate question. | Adjust content filter settings. |
Check completion_tokens:
| Pattern | Likely Issue |
|---|---|
| Very low (10-20 tokens) | Model defaulting to "I don't know" — retrieval probably returned nothing useful |
| Very high (4000+ tokens) | Model is rambling — tighten the system prompt |
And always check error_message. Sometimes the answer is literally written in the error. Read it before you start investigating.
Example: Users report the AI "cuts off mid-sentence."
finish_reason=lengthon every affected request.max_tokensis set to 256 — not enough for detailed technical answers. Fix: increase to 1024.
The 10-Step Checklist
When a ticket comes in, work through this in order:
| Step | What to Check |
|---|---|
| 1 | Get the request_id
|
| 2 | Check timestamp — correlate with deployments/outages |
| 3 | Check user_id — one user or many? |
| 4 | Check embedding_job_failed — did embedding work? |
| 5 | Check result_count + similarity_score — did retrieval return good docs? |
| 6 | Check context_truncated — did the full context reach the LLM? |
| 7 | Check status_code — did the LLM call succeed? |
| 8 | Check model + temperature — is the LLM configured correctly? |
| 9 | Check finish_reason — did the response complete? |
| 10 | Check error_message — does it just tell you? |
Steps 1-3 scope the problem. Steps 4-6 catch 70% of issues. Steps 7-10 catch the rest.
Common Patterns Quick Reference
| Symptom | Likely Cause | Check These Fields |
|---|---|---|
| Wrong answers for everyone | Embedding model mismatch or bad re-index |
embedding_model, similarity_score
|
| Wrong answers for one user | Missing docs in their collection |
collection_name, result_count, user_id
|
| Incomplete answers | Response truncation |
finish_reason, max_tokens, context_truncated
|
| Inconsistent answers | Temperature too high or cache issues |
temperature, cache_hit
|
| Slow responses | LLM overload or too much context |
latency_ms, prompt_tokens, retry_count
|
| No response at all | API failure or rate limiting |
status_code, error_message, embedding_job_failed
|
| Hallucinated answers | No relevant docs retrieved |
result_count, similarity_score, system_prompt
|
| Outdated answers | Stale cache or stale index |
cache_hit, timestamp, embedding_job_failed
|
The Bottom Line
Follow the pipeline. Query → Embedding → Retrieval → Context → LLM → Response. Six stages, 25 fields, one direction.
Start at the beginning. Follow the data. The logs will tell you where it broke.
Top comments (0)