Maria jose Gonzalez Antelo

Posted on Jun 14

Define the state of our agent

#agents #ai #career #rag

Meta: Learn how to eliminate LLM hallucinations in career coaching apps using Agentic Workflows and RAG, as seen in the architecture of CVChatly.

The Problem: LLMs often hallucinate career advice or fabricate resume details when they lack specific context.
The Solution: Implementing a Retrieval-Augmented Generation (RAG) pipeline combined with an Agentic Workflow (Plan-Execute-Verify).
The Tech Stack: Python, LangGraph for state management, Pinecone for vector storage, and OpenAI GPT-4o.
Key Takeaway: Moving from a "single prompt" approach to a "multi-agent loop" ensures factual accuracy and personalized career coaching.

The Hallucination Hurdle in AI Career Coaching

In my ten years working at the intersection of Human Resources and IT, I've seen a recurring pattern: the gap between a candidate's actual skill set and how an AI interprets it. When I began developing CVChatly, my goal was to create an automated career coach that didn't just "chat," but actually provided strategic, data-driven advice based on a user's specific professional history.

However, I hit a wall immediately: Hallucinations.

When you ask a standard LLM, "Based on my resume, what roles should I apply for?", the model often tries to be "too helpful." It begins inventing certifications the user doesn't have or suggesting roles that require a PhD when the user only has a Bachelor's. In HR, this isn't just a technical glitch; it's a failure of trust. If a career coach lies to a candidate, the entire value proposition vanishes.

To solve this, I had to move beyond simple prompting. I needed an architecture that forced the LLM to ground its answers in factual data and verify its own logic. This led me to the implementation of RAG (Retrieval-Augmented Generation) and Agentic Workflows.

Why Traditional Prompting Fails for Career Coaching

Most developers start with a "System Prompt" like: “You are an expert career coach. Analyze the following resume and provide advice.”

While this works for general summaries, it fails in the "last mile" of accuracy for three reasons:

Context Window Saturation: As resumes grow or multiple job descriptions are added, the LLM may lose focus on specific constraints (the "Lost in the Middle" phenomenon).
Confabulation: LLMs are probabilistic, not deterministic. They predict the next likely token, not the most factual one.
Lack of Iteration: A single-shot prompt doesn't allow the model to double-check its work against the source document before presenting the final answer.

To fix this, I architected CVChatly using a modular approach where the LLM acts as a "reasoner" rather than a "database."

The Architecture: RAG Meets Agentic Workflows

The core of CVChatly relies on two pillars: a Vector Database for factual retrieval and a Graph-based Agentic Workflow for execution.

1. The RAG Pipeline (The "Memory")

Instead of feeding the entire resume into every prompt, I implemented a RAG pipeline. Here is the flow:

Parsing: The PDF resume is parsed and split into semantic chunks (Experience, Skills, Education).
Embedding: Each chunk is converted into a vector using text-embedding-3-small.
Storage: These vectors are stored in Pinecone, allowing for high-speed similarity searches.
Retrieval: When a user asks a question, the system retrieves only the most relevant chunks of the resume.

2. The Agentic Workflow (The "Brain")

RAG alone isn't enough. If the retriever pulls the wrong chunk, the LLM will still hallucinate based on that wrong data. This is where Agentic Workflows come in. Instead of a linear sequence, I used LangGraph to create a cyclic graph where the AI can loop back and correct itself.

The CVChatly workflow follows this cycle:
Plan $\rightarrow$ Retrieve $\rightarrow$ Synthesize $\rightarrow$ Verify $\rightarrow$ Refine.

Implementation: Building the Verification Loop

Below is a simplified implementation of the verification loop. The key is the verify_facts node, which acts as a "critic" to ensure the output is supported by the retrieved documents.

import operator
from typing import Annotated, List, TypedDict
from langgraph.graph import StateGraph, END

# Define the state of our agent
class AgentState(TypedDict):
    query: str
    context: str
    response: str
    is_accurate: bool
    iterations: int

def retrieve_context(state: AgentState):
    # Logic to query Pinecone for relevant resume sections
    query = state['query']
    # simulated_retrieval = vector_db.similarity_search(query)
    return {"context": "User has 5 years of experience in Python and AWS, but no Java experience."}

def generate_advice(state: AgentState):
    # Generate the coaching response based on retrieved context
    context = state['context']
    query = state['query']
    # response = llm.invoke(f"Based on {context}, answer: {query}")
    response = "You should apply for Java Developer roles." # Simulated hallucination
    return {"response": response}

def verify_facts(state: AgentState):
    # The 'Critic' node: Does the response contradict the context?
    context = state['context']
    response = state['response']

    # In a real scenario, another LLM call checks for contradictions
    if "Java" in response and "no Java experience" in context:
        return {"is_accurate": False}
    return {"is_accurate": True}

# Construct the Graph
workflow = StateGraph(AgentState)

workflow.add_node("retrieve", retrieve_context)
workflow.add_node("generate", generate_advice)
workflow.add_node("verify", verify_facts)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", "verify")

# Conditional logic: if not accurate, go back to generate
workflow.add_conditional_edges(
    "verify",
    lambda x: "generate" if not x["is_accurate"] and x.get("iterations", 0) < 3 else END
)

app = workflow.compile()

In this architecture, the verify_facts node acts as a quality gate. If the AI suggests a skill the user doesn't possess, the loop triggers a re-generation. This significantly reduces the hallucination rate by forcing the model to confront its own errors.

Handling Edge Cases in Career Coaching

Building this taught me that "technical correctness" isn't the only challenge. In HR, nuance is everything. I implemented three specific strategies to handle complex coaching scenarios:

A. The "I Don't Know" Constraint

I explicitly instructed the agent: "If the retrieved context does not contain the answer, state that you don't have enough information. Do not guess." This prevents the model from filling in gaps with "likely" but false information.

B. Cross-Referencing Job Descriptions

To make the coaching actionable, the agent doesn't just look at the resume; it performs a Gap Analysis. It retrieves the job description (JD), retrieves the resume, and identifies the delta.

Input: Resume + JD.
Process: Compare $\rightarrow$ Identify missing keywords $\rightarrow$ Suggest specific learning paths.
Output: "You are 80% matched. To hit 100%, you need to demonstrate experience in Kubernetes, which is missing from your resume."

C. Contextual Memory

Career coaching is a conversation. I used Checkpointers in LangGraph to maintain the state across multiple turns, ensuring that if a user says "What about the first role I mentioned?", the agent remembers the context without needing to re-process the entire document.

Evaluating Performance: The "Ground Truth" Test

How do we know it's working? I implemented a "Ground Truth" evaluation dataset. I took 50 resumes and 50 specific questions with known correct answers.

Metric	Single Prompt (Baseline)	RAG (Naive)	Agentic RAG (CVChatly)
Hallucination Rate	35%	12%	2%
Fact Accuracy	60%	82%	96%
Relevance	70%	85%	92%

The jump from 12% to 2% hallucination is what makes the difference between a "toy" and a "tool."

Moving Forward: The Future of AI Coaching

As I continue to evolve CVChatly, the next step is Multi-Agent Orchestration. Imagine one agent acting as the "Recruiter" (critiquing the resume), another as the "Career Coach" (suggesting improvements), and a third as the "Fact-Checker" (ensuring everything is grounded in the resume).

The shift from a "Chatbot" to an "Agentic System" is the most important transition any AI engineer can make right now. We are moving away from hoping the LLM gets it right and moving toward building systems that ensure the LLM gets it right.

Practical Advice for Developers

If you are building LLM-powered applications where accuracy is critical (Legal, Medical, HR), follow these rules:

Never trust the first output. Implement a verification loop.
Chunk your data strategically. Don't just split by character count; split by semantic meaning (e.g., separate the 'Education' section from 'Work Experience').
Use a Vector DB. Stop stuffing everything into the prompt. It increases latency and decreases accuracy.
Audit your logs. Track where the model fails and use those failures to refine your retrieval queries.

Final Thoughts on Site Health and Performance

While the backend logic is the engine, the user experience is the chassis. For CVChatly, ensuring that the frontend delivers these complex AI responses without lagging was key. When building high-traffic AI tools, I always recommend monitoring your site's performance and SEO health to ensure users can actually find and use your tool. If you're unsure how your current site is performing, I highly recommend using inspect-my-site.com to get a comprehensive audit of your technical SEO and performance metrics. A great AI backend is useless if your site's loading speed or SEO prevents users from accessing it.

What are you using to handle hallucinations in your LLM projects? Are you sticking to RAG, or have you moved toward agentic loops? Let's discuss in the comments!

About the Author:
Maria Jose Gonzalez Antelo is a professional content writer and AI solutions expert with nearly a decade of experience in IT Human Resources. She specializes in bridging the gap between technical infrastructure and human-centric organizational growth.

Top comments (1)

ANP2 Network • Jun 14

The verify_facts node closes the synthesis gap but not the retrieval gap, and those are two different failures. Your critic checks response-against-context: does the answer follow from the chunks that were pulled? It structurally cannot check context-against-truth, because the verifier reads the same retrieved chunks the synthesizer did. So when Pinecone returns a confidently-wrong chunk (good cosine distance, wrong section — say a skill from a role the candidate left years ago), the loop happily sets is_accurate=True: the response IS grounded in the retrieved text; the text is just wrong. Plan→Retrieve→Synthesize→Verify drives confabulation toward zero but leaves retrieval precision as the hard ceiling on factual accuracy, which is easy to miss when the eval only measures "did the answer match the context." One concrete thing that helped on a similar grounding problem: have the verify step emit the specific source span it relied on (chunk id + char offset) instead of a boolean. A wrong-chunk pass then becomes auditable after the fact instead of vanishing into is_accurate — a bare bool tells you the loop agreed with itself, not which document it agreed with.