DEV Community: Akhilesh

106. LangGraph: Stateful Agent Workflows

Akhilesh — Sat, 06 Jun 2026 18:03:44 +0000

LangChain chains flow in one direction. Input enters, output exits, done.

Real agent workflows are not linear. A plan might need revision. A search might return empty results and require a different approach. A code review might fail and send the work back for fixes. A task might need to branch differently depending on what kind of input it receives.

LangGraph models agent workflows as directed graphs. Nodes are actions. Edges are conditional transitions. The agent's state flows through the graph, taking different paths based on what it finds at each step.

The result: agent workflows that are inspectable, debuggable, resumable from any node, and capable of complex conditional logic without becoming spaghetti code.

What LangGraph Adds to LangChain

print("LangGraph: What It Adds")
print()
print("LangChain gives you: components (LLMs, retrievers, tools)")
print("LangGraph gives you: orchestration as a directed graph")
print()

differences = {
    "Control flow":   ("Linear chain",        "Conditional graph with loops and branches"),
    "State":          ("Passed between steps", "Typed state shared across all nodes"),
    "Debugging":      ("Trace the chain",      "Visualize the graph, step through nodes"),
    "Resume":         ("Start over",           "Checkpoint any node, resume from there"),
    "Parallelism":    ("Sequential only",      "Parallel branches in the graph"),
    "Human-in-loop":  ("Not supported",        "Pause graph, wait for human, resume"),
}

print(f"{'Feature':<18} {'LangChain':<30} {'LangGraph'}")
print("=" * 80)
for feature, (lc, lg) in differences.items():
    print(f"{feature:<18} {lc:<30} {lg}")

print()
print("Install:")
print("  pip install langgraph langchain-core langchain-anthropic")

Core Concepts

import os
import json
from typing import TypedDict, Annotated, List, Optional, Literal
from langgraph.graph import StateGraph, END, START
from langgraph.checkpoint.memory import MemorySaver
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, BaseMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import anthropic
import warnings
warnings.filterwarnings("ignore")

llm    = ChatAnthropic(model="claude-3-5-haiku-20241022",
                        api_key=os.environ.get("ANTHROPIC_API_KEY"),
                        max_tokens=500)
parser = StrOutputParser()

class ConversationState(TypedDict):
    messages:      Annotated[List[BaseMessage], lambda x, y: x + y]
    user_input:    str
    intent:        str
    response:      str
    needs_retrieval: bool

print("LangGraph Core Concepts:")
print()
print("1. STATE: A TypedDict shared across all nodes")
print("   - Nodes read from state and write back to state")
print("   - Annotated fields define how values merge (append vs replace)")
print()
print("2. NODES: Functions that take state and return updated state")
print("   - def my_node(state: MyState) -> dict:")
print("       # do work, return updates")
print("       return {'field': new_value}")
print()
print("3. EDGES: Connections between nodes")
print("   - Unconditional: always go to next node")
print("   - Conditional: function decides which node comes next")
print()
print("4. COMPILE: StateGraph().compile() turns the graph into a runnable")

Minimal Example: Intent Router

class RouterState(TypedDict):
    user_input:   str
    intent:       str
    response:     str

def detect_intent(state: RouterState) -> dict:
    """Node 1: Classify the user's intent."""
    prompt  = f"""Classify this input into exactly one category:
    - math: involves calculation or numbers
    - code: involves programming
    - general: anything else

    Input: {state['user_input']}
    Respond with only the category name."""

    intent = (llm | parser).invoke(prompt).strip().lower()
    if intent not in ["math", "code", "general"]:
        intent = "general"

    print(f"  [detect_intent] → '{intent}'")
    return {"intent": intent}

def handle_math(state: RouterState) -> dict:
    """Node 2a: Handle math queries."""
    response = (llm | parser).invoke(
        f"Solve this math problem clearly: {state['user_input']}")
    print(f"  [handle_math] → answered")
    return {"response": response}

def handle_code(state: RouterState) -> dict:
    """Node 2b: Handle code queries."""
    response = (llm | parser).invoke(
        f"Answer this coding question with example code: {state['user_input']}")
    print(f"  [handle_code] → answered")
    return {"response": response}

def handle_general(state: RouterState) -> dict:
    """Node 2c: Handle general queries."""
    response = (llm | parser).invoke(state["user_input"])
    print(f"  [handle_general] → answered")
    return {"response": response}

def route_by_intent(state: RouterState) -> Literal["handle_math", "handle_code", "handle_general"]:
    """Conditional edge: decide which handler to call."""
    return f"handle_{state['intent']}" if state["intent"] in ["math", "code"] else "handle_general"

graph = StateGraph(RouterState)
graph.add_node("detect_intent", detect_intent)
graph.add_node("handle_math",    handle_math)
graph.add_node("handle_code",    handle_code)
graph.add_node("handle_general", handle_general)

graph.add_edge(START, "detect_intent")
graph.add_conditional_edges("detect_intent", route_by_intent)
graph.add_edge("handle_math",    END)
graph.add_edge("handle_code",    END)
graph.add_edge("handle_general", END)

router_app = graph.compile()

print("Intent Router Graph:")
test_inputs = [
    "What is 15% of 840?",
    "How do I reverse a list in Python?",
    "What is the capital of Japan?",
]
for inp in test_inputs:
    print(f"\nInput: '{inp}'")
    result = router_app.invoke({"user_input": inp, "intent": "", "response": ""})
    print(f"Response: {result['response'][:120]}...")

Multi-Step Agent with Loops

class ResearchState(TypedDict):
    topic:         str
    search_queries: List[str]
    findings:      List[str]
    draft:         str
    review_score:  int
    final_output:  str
    iteration:     int

def plan_research(state: ResearchState) -> dict:
    """Generate search queries for the topic."""
    response = (llm | parser).invoke(
        f"Generate 3 specific search queries to research: {state['topic']}\n"
        f"Output as a numbered list, one query per line.")
    queries = [line.strip().lstrip("123. ").strip()
               for line in response.split("\n")
               if line.strip() and line.strip()[0].isdigit()][:3]
    print(f"  [plan] Generated {len(queries)} queries")
    return {"search_queries": queries}

def execute_search(state: ResearchState) -> dict:
    """Simulate searching and gathering findings."""
    findings = []
    for i, query in enumerate(state["search_queries"], 1):
        finding = (llm | parser).invoke(
            f"Provide 2-3 key facts about: {query}\nBe specific and concise.")
        findings.append(f"Query {i}: {query}\n{finding}")
        print(f"  [search {i}] gathered findings")
    return {"findings": findings}

def write_draft(state: ResearchState) -> dict:
    """Write a draft based on findings."""
    findings_text = "\n\n".join(state["findings"])
    draft = (llm | parser).invoke(
        f"Write a 3-paragraph summary about '{state['topic']}' "
        f"based on these findings:\n\n{findings_text}")
    print(f"  [write] draft written ({len(draft.split())} words)")
    return {"draft": draft, "iteration": state.get("iteration", 0) + 1}

def review_draft(state: ResearchState) -> dict:
    """Score the draft quality."""
    response = (llm | parser).invoke(
        f"Rate this text quality 1-10 (just the number):\n\n{state['draft']}")
    import re
    score_match = re.search(r'\b([0-9]|10)\b', response)
    score = int(score_match.group()) if score_match else 7
    print(f"  [review] score: {score}/10 (iteration {state.get('iteration', 1)})")
    return {"review_score": score}

def revise_draft(state: ResearchState) -> dict:
    """Improve the draft based on review."""
    revised = (llm | parser).invoke(
        f"Improve this text. Make it clearer and more informative:\n\n{state['draft']}")
    print(f"  [revise] draft improved")
    return {"draft": revised}

def finalize(state: ResearchState) -> dict:
    """Package the final output."""
    print(f"  [finalize] approved after {state.get('iteration', 1)} iteration(s)")
    return {"final_output": state["draft"]}

def should_revise(state: ResearchState) -> Literal["revise_draft", "finalize"]:
    """Decide: revise or finalize based on score and iteration count."""
    iteration = state.get("iteration", 1)
    score     = state.get("review_score", 7)
    if score < 8 and iteration < 3:
        print(f"  [routing] score {score} < 8, iteration {iteration} < 3 → revise")
        return "revise_draft"
    print(f"  [routing] score {score} or max iterations → finalize")
    return "finalize"

research_graph = StateGraph(ResearchState)
research_graph.add_node("plan_research",  plan_research)
research_graph.add_node("execute_search", execute_search)
research_graph.add_node("write_draft",    write_draft)
research_graph.add_node("review_draft",   review_draft)
research_graph.add_node("revise_draft",   revise_draft)
research_graph.add_node("finalize",       finalize)

research_graph.add_edge(START,            "plan_research")
research_graph.add_edge("plan_research",  "execute_search")
research_graph.add_edge("execute_search", "write_draft")
research_graph.add_edge("write_draft",    "review_draft")
research_graph.add_conditional_edges("review_draft", should_revise)
research_graph.add_edge("revise_draft",   "review_draft")
research_graph.add_edge("finalize",       END)

research_app = research_graph.compile()

print("\nResearch Agent with Review Loop:")
print("=" * 55)
result = research_app.invoke({
    "topic":          "transformer attention mechanisms",
    "search_queries": [],
    "findings":       [],
    "draft":          "",
    "review_score":   0,
    "final_output":   "",
    "iteration":      0,
})
print(f"\nFinal Output Preview:\n{result['final_output'][:300]}...")
print(f"\nCompleted in {result['iteration']} iteration(s)")

Checkpointing: Pause and Resume

print("\nLangGraph Checkpointing: Pause and Resume Any State")
print()

memory = MemorySaver()
checkpointed_app = research_graph.compile(checkpointer=memory)

thread_config = {"configurable": {"thread_id": "research_session_001"}}

print("Running with checkpointing enabled...")
result = checkpointed_app.invoke(
    {
        "topic":          "neural network activation functions",
        "search_queries": [],
        "findings":       [],
        "draft":          "",
        "review_score":   0,
        "final_output":   "",
        "iteration":      0,
    },
    config=thread_config
)

state_snapshot = checkpointed_app.get_state(thread_config)
print(f"\nCheckpoint saved. State snapshot:")
print(f"  Next nodes available: {state_snapshot.next}")
print(f"  Topic: {state_snapshot.values.get('topic', 'N/A')}")
print(f"  Iteration: {state_snapshot.values.get('iteration', 0)}")
print()
print("To resume from this checkpoint later:")
print("  result = app.invoke(None, config=thread_config)")
print("  (pass None to resume from saved state)")

Human-in-the-Loop

print("\nHuman-in-the-Loop: Pause for Human Approval")
print()

from langgraph.types import interrupt

class ApprovalState(TypedDict):
    content:    str
    approved:   bool
    feedback:   str

def generate_content(state: ApprovalState) -> dict:
    content = (llm | parser).invoke(
        "Write a 2-sentence marketing tagline for a machine learning platform.")
    print(f"  [generate] content ready for review")
    return {"content": content, "approved": False}

def human_review(state: ApprovalState) -> dict:
    """This node PAUSES the graph and waits for human input."""
    print(f"\n  [PAUSED] Human review required.")
    print(f"  Content: {state['content']}")
    print(f"  (In production: send notification, wait for response via API)")

    decision = interrupt({
        "message":   "Please review and approve/reject this content",
        "content":   state["content"],
        "actions":   ["approve", "reject"],
    })
    return {"approved": decision == "approve", "feedback": str(decision)}

def handle_approved(state: ApprovalState) -> dict:
    print(f"  [approved] Content published")
    return {}

def handle_rejected(state: ApprovalState) -> dict:
    print(f"  [rejected] Content sent back for revision")
    return {}

def route_approval(state: ApprovalState) -> Literal["handle_approved", "handle_rejected"]:
    return "handle_approved" if state["approved"] else "handle_rejected"

hitl_graph = StateGraph(ApprovalState)
hitl_graph.add_node("generate_content", generate_content)
hitl_graph.add_node("human_review",     human_review)
hitl_graph.add_node("handle_approved",  handle_approved)
hitl_graph.add_node("handle_rejected",  handle_rejected)

hitl_graph.add_edge(START, "generate_content")
hitl_graph.add_edge("generate_content", "human_review")
hitl_graph.add_conditional_edges("human_review", route_approval)
hitl_graph.add_edge("handle_approved", END)
hitl_graph.add_edge("handle_rejected", END)

print("Human-in-loop graph structure:")
print("  generate_content → human_review → [approved: publish] OR [rejected: revise]")
print()
print("In production:")
print("  1. Graph runs to 'human_review' node and pauses")
print("  2. State saved to checkpointer (Redis, Postgres)")
print("  3. Human receives notification (email, Slack, UI)")
print("  4. Human responds via API endpoint")
print("  5. Graph resumes from checkpoint with human's decision")

Parallel Branches

print("\nParallel Execution in LangGraph")
print()

class ParallelState(TypedDict):
    topic:       str
    perspective1: str
    perspective2: str
    perspective3: str
    synthesis:   str

def get_technical(state: ParallelState) -> dict:
    resp = (llm | parser).invoke(
        f"Give a 2-sentence technical perspective on: {state['topic']}")
    return {"perspective1": resp}

def get_business(state: ParallelState) -> dict:
    resp = (llm | parser).invoke(
        f"Give a 2-sentence business perspective on: {state['topic']}")
    return {"perspective2": resp}

def get_ethical(state: ParallelState) -> dict:
    resp = (llm | parser).invoke(
        f"Give a 2-sentence ethical perspective on: {state['topic']}")
    return {"perspective3": resp}

def synthesize(state: ParallelState) -> dict:
    synthesis = (llm | parser).invoke(
        f"Synthesize these three perspectives into one balanced paragraph:\n\n"
        f"Technical: {state['perspective1']}\n"
        f"Business:  {state['perspective2']}\n"
        f"Ethical:   {state['perspective3']}")
    return {"synthesis": synthesis}

parallel_graph = StateGraph(ParallelState)
parallel_graph.add_node("get_technical", get_technical)
parallel_graph.add_node("get_business",  get_business)
parallel_graph.add_node("get_ethical",   get_ethical)
parallel_graph.add_node("synthesize",    synthesize)

parallel_graph.add_edge(START, "get_technical")
parallel_graph.add_edge(START, "get_business")
parallel_graph.add_edge(START, "get_ethical")
parallel_graph.add_edge("get_technical", "synthesize")
parallel_graph.add_edge("get_business",  "synthesize")
parallel_graph.add_edge("get_ethical",   "synthesize")
parallel_graph.add_edge("synthesize", END)

parallel_app = parallel_graph.compile()

print("Running three parallel perspective nodes:")
import time
start = time.time()
result = parallel_app.invoke({
    "topic":       "using AI to make hiring decisions",
    "perspective1": "",
    "perspective2": "",
    "perspective3": "",
    "synthesis":   ""
})
elapsed = time.time() - start
print(f"Completed in {elapsed:.1f}s")
print(f"\nSynthesis:\n{result['synthesis'][:300]}...")

Reference Links

print("\nLangGraph Reference Links:")
print()

refs = {
    "Official Documentation": [
        ("LangGraph docs",                  "langchain-ai.github.io/langgraph"),
        ("LangGraph tutorials",             "langchain-ai.github.io/langgraph/tutorials/introduction"),
        ("How-To guides",                   "langchain-ai.github.io/langgraph/how-tos"),
        ("Conceptual guides",               "langchain-ai.github.io/langgraph/concepts"),
        ("LangGraph Cloud (deployment)",    "langchain-ai.github.io/langgraph/cloud"),
    ],
    "Key Concepts Deep Dives": [
        ("State management",                "langchain-ai.github.io/langgraph/concepts/low_level/#state"),
        ("Checkpointing / persistence",     "langchain-ai.github.io/langgraph/concepts/persistence"),
        ("Human-in-the-loop",               "langchain-ai.github.io/langgraph/concepts/human_in_the_loop"),
        ("Multi-agent with LangGraph",      "langchain-ai.github.io/langgraph/tutorials/multi_agent"),
        ("Streaming",                       "langchain-ai.github.io/langgraph/how-tos/streaming-tokens"),
    ],
    "Cheat Sheets": [
        ("LangGraph GitHub",                "github.com/langchain-ai/langgraph"),
        ("LangGraph Python API ref",        "langchain-ai.github.io/langgraph/reference/graphs"),
        ("StateGraph methods",              "langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.StateGraph"),
        ("LangSmith tracing with LangGraph","docs.smith.langchain.com/how_to_guides/tracing/langgraph"),
    ],
    "Examples": [
        ("Agent supervisor pattern",        "langchain-ai.github.io/langgraph/tutorials/multi_agent/agent_supervisor"),
        ("Plan-and-execute agent",          "langchain-ai.github.io/langgraph/tutorials/plan-and-execute/plan-and-execute"),
        ("ReAct agent from scratch",        "langchain-ai.github.io/langgraph/tutorials/introduction"),
        ("RAG with LangGraph",              "langchain-ai.github.io/langgraph/tutorials/rag/langgraph_agentic_rag"),
    ],
}

for category, links in refs.items():
    print(f"  {category}:")
    for name, url in links:
        print(f"    • {name:<45} {url}")
    print()

Try This

Create langgraph_practice.py.

Part 1: build the intent router from this post. Extend it with two more intents: "translation" and "summarization." Add appropriate handler nodes. Test with 10 different inputs and verify routing is correct.

Part 2: build a workflow with a revision loop. Create a writing agent that: generates content → evaluates readability → revises if score below threshold → repeats up to 3 times. Track iteration count in state. Does quality improve across iterations?

Part 3: add checkpointing. Compile your graph with MemorySaver(). Run a workflow partway, inspect the checkpoint state with app.get_state(). Modify one field in the state and resume with app.invoke(None, config=thread_config). Verify the modified state is used.

Part 4: parallel branches. Build a graph that simultaneously generates three different versions of a piece of writing (formal, casual, technical), then synthesizes them into one final version. Time the parallel vs sequential execution.

What's Next

You can now build complex stateful agent workflows. The next post is the Phase 10 capstone: build a complete autonomous research assistant that plans, searches, writes, reviews, and publishes reports autonomously using everything from posts 101 to 106.

105. LangChain: Orchestrating AI Applications

Akhilesh — Thu, 04 Jun 2026 10:34:30 +0000

You have spent four posts building agents from scratch. Raw API calls. Custom tool loops. Manual memory management. Now see it in ten lines.

chain = prompt | llm | parser

LangChain wraps the patterns you built by hand into composable, reusable components. The criticism: it abstracts too much and debugging is hard. The counter: you now know what is under the hood, so the abstractions are navigable. You built the engine. Now drive the car.

Setup

# pip install langchain langchain-core langchain-community
# pip install langchain-anthropic langchain-openai
# pip install langchain-chroma sentence-transformers faiss-cpu

import os
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document
import warnings
warnings.filterwarnings("ignore")

llm = ChatAnthropic(
    model      = "claude-3-5-haiku-20241022",
    api_key    = os.environ.get("ANTHROPIC_API_KEY"),
    max_tokens = 500,
    temperature= 0.7,
)

parser = StrOutputParser()
response = llm.invoke("What is LangChain in one sentence?")
print(f"LLM response: {response.content}")

LCEL: The Pipe Operator

basic_chain = (
    ChatPromptTemplate.from_template("Explain {concept} to a beginner in 2 sentences.")
    | llm
    | parser
)

result = basic_chain.invoke({"concept": "vector embeddings"})
print(f"Chain output: {result}")

from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser
from typing import List

class CodeReview(BaseModel):
    has_bugs:    bool       = Field(description="Whether code has bugs")
    quality:     int        = Field(description="Quality score 1-10")
    issues:      List[str]  = Field(description="Specific issues found")
    suggestions: List[str]  = Field(description="Improvement suggestions")

pydantic_parser = PydanticOutputParser(pydantic_object=CodeReview)

review_chain = (
    ChatPromptTemplate.from_messages([
        ("system", f"Review Python code. {pydantic_parser.get_format_instructions()}"),
        ("human",  "Review:\n```

python\n{code}\n

```")
    ])
    | llm
    | pydantic_parser
)

buggy = "def divide(a, b):\n    return a / b\n\nprint(divide(10, 0))"
review = review_chain.invoke({"code": buggy})
print(f"\nCode Review:")
print(f"  Has bugs: {review.has_bugs}")
print(f"  Quality:  {review.quality}/10")
print(f"  Issues:   {review.issues}")

Memory and History

from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

store = {}

def get_session_history(session_id: str) -> InMemoryChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

chat_prompt = ChatPromptTemplate.from_messages([
    ("system",      "You are a helpful tutor."),
    ("placeholder", "{chat_history}"),
    ("human",       "{input}"),
])

chain_with_history = RunnableWithMessageHistory(
    chat_prompt | llm | parser,
    get_session_history,
    input_messages_key   = "input",
    history_messages_key = "chat_history",
)

config = {"configurable": {"session_id": "student_001"}}

turns = [
    "My name is Priya. I am learning about neural networks.",
    "Can you explain what backpropagation does?",
    "What is my name and what am I studying?",
]

print("\nMulti-turn conversation with memory:")
for turn in turns:
    response = chain_with_history.invoke({"input": turn}, config=config)
    print(f"  User:  {turn}")
    print(f"  Agent: {response[:100]}...")
    print()

print(f"History: {len(store['student_001'].messages)} messages stored")

RAG with LangChain

documents = [
    Document(page_content="The transformer uses self-attention mechanisms.",
             metadata={"source": "transformer_paper"}),
    Document(page_content="BERT is pretrained using masked language modeling.",
             metadata={"source": "bert_paper"}),
    Document(page_content="GPT uses autoregressive next-token prediction.",
             metadata={"source": "gpt_paper"}),
    Document(page_content="LoRA adds low-rank matrices to frozen pretrained weights.",
             metadata={"source": "lora_paper"}),
    Document(page_content="Fine-tuning adapts pretrained models to downstream tasks.",
             metadata={"source": "fine_tuning_guide"}),
]

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter    = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
split_docs  = splitter.split_documents(documents)

embeddings  = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(split_docs, embeddings)
retriever   = vectorstore.as_retriever(search_kwargs={"k": 3})

def format_docs(docs):
    return "\n\n".join([
        f"[{i+1}] Source: {d.metadata['source']}\n{d.page_content}"
        for i, d in enumerate(docs)
    ])

rag_prompt = ChatPromptTemplate.from_template("""
Answer ONLY from the context below. Cite sources as [1], [2].
If not in context: "I cannot find this in the documents."

Context:
{context}

Question: {question}
""")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | parser
)

queries = [
    "How does BERT get pretrained?",
    "What is LoRA?",
    "Who is the president of France?",
]

print("\nRAG Chain Demo:")
for query in queries:
    answer = rag_chain.invoke(query)
    print(f"  Q: {query}")
    print(f"  A: {answer[:120]}...")
    print()

Agents in LangChain

from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
import math

@tool
def calculator(expression: str) -> str:
    """Evaluate a math expression. Supports +,-,*,/,sqrt,pi,e."""
    try:
        result = eval(expression, {"__builtins__": {}},
                      {"sqrt": math.sqrt, "pi": math.pi, "e": math.e})
        return str(round(float(result), 6))
    except Exception as e:
        return f"Error: {e}"

@tool
def word_count(text: str) -> str:
    """Count words and characters in text."""
    return f"Words: {len(text.split())}, Characters: {len(text)}"

@tool
def reverse_string(text: str) -> str:
    """Reverse a string."""
    return text[::-1]

agent_prompt = ChatPromptTemplate.from_messages([
    ("system",      "You are a helpful assistant. Use tools when needed."),
    ("placeholder", "{chat_history}"),
    ("human",       "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent    = create_tool_calling_agent(llm, [calculator, word_count, reverse_string], agent_prompt)
executor = AgentExecutor(agent=agent, tools=[calculator, word_count, reverse_string],
                          verbose=False, max_iterations=5)

test_queries = [
    "What is the square root of 1764?",
    "Count the words in 'To be or not to be that is the question'",
    "Reverse the string 'LangChain'",
]

print("\nAgent with Tools:")
for query in test_queries:
    result = executor.invoke({"input": query, "chat_history": []})
    print(f"  Q: {query}")
    print(f"  A: {result['output']}")
    print()

When to Use LangChain vs From Scratch

comparison = {
    "Simple single LLM call":          ("From scratch", "Overkill to add LangChain"),
    "RAG with multiple retrievers":     ("LangChain",    "Pre-built components save time"),
    "Multi-provider LLM switching":     ("LangChain",    "Unified interface is valuable"),
    "Custom agent with full control":   ("From scratch", "Full visibility, easier debug"),
    "Rapid prototyping":                ("LangChain",    "Much faster to build"),
    "Production with LangSmith tracing":("LangChain",    "Observability built in"),
}

print("LangChain vs From Scratch:")
print(f"{'Scenario':<42} {'Recommendation':<15} {'Reason'}")
print("=" * 85)
for scenario, (rec, reason) in comparison.items():
    print(f"{scenario:<42} {rec:<15} {reason}")

LangSmith: Debugging LangChain

print("\nLangSmith: Essential for Production LangChain")
print()
print("Setup (3 lines):")
print("  export LANGSMITH_API_KEY='your_key'")
print("  export LANGSMITH_TRACING='true'")
print("  export LANGCHAIN_PROJECT='my_project'")
print()
print("That is it. Every run is traced automatically.")
print()
print("What you see in LangSmith:")
features = [
    "Full trace of every LLM call with tokens and cost",
    "Timing breakdown per chain step",
    "Side-by-side diff of prompt changes",
    "User feedback collection",
    "Dataset creation from production traces",
    "A/B testing of different prompts",
]
for f in features:
    print(f"  • {f}")
print()
print("Without LangSmith, debugging LangChain in production is painful.")
print("It is not optional for serious applications.")

Reference Links

print("\nLangChain Reference Links:")
print()

refs = {
    "Official Docs": [
        ("LangChain Python docs",           "python.langchain.com"),
        ("LCEL reference",                  "python.langchain.com/docs/expression_language"),
        ("LangGraph (stateful agents)",     "langchain-ai.github.io/langgraph"),
        ("LangSmith docs",                  "docs.smith.langchain.com"),
        ("All integrations",                "python.langchain.com/docs/integrations"),
    ],
    "Cheat Sheets": [
        ("LCEL cheatsheet",                 "python.langchain.com/docs/expression_language/interface"),
        ("Output parsers guide",            "python.langchain.com/docs/modules/model_io/output_parsers"),
        ("v0.3 migration guide",            "python.langchain.com/docs/versions/migrating_chains"),
        ("Memory types reference",          "python.langchain.com/docs/modules/memory"),
    ],
    "Tutorials": [
        ("DeepLearning.AI LangChain course", "learn.deeplearning.ai/langchain"),
        ("LangChain RAG tutorial",           "python.langchain.com/docs/use_cases/question_answering"),
        ("Build a chatbot tutorial",         "python.langchain.com/docs/use_cases/chatbots"),
        ("LangChain cookbook",               "github.com/langchain-ai/langchain/tree/master/cookbook"),
        ("LangGraph agent tutorial",         "langchain-ai.github.io/langgraph/tutorials/introduction"),
    ],
    "Community": [
        ("LangChain GitHub (star it)",      "github.com/langchain-ai/langchain"),
        ("Discord",                         "discord.gg/langchain"),
        ("Awesome LangChain resources",     "github.com/kyrolabs/awesome-langchain"),
    ],
}

for category, links in refs.items():
    print(f"  {category}:")
    for name, url in links:
        print(f"    • {name:<45} {url}")
    print()

Try This

Create langchain_practice.py.

Part 1: build three LCEL chains using the pipe operator: a translation chain, a code review chain that outputs structured JSON, and a topic classification chain.

Part 2: build a RAG chain with FAISS and all-MiniLM-L6-v2. Load 20 documents. Test 10 queries. Compare retrieval quality versus the manual cosine similarity approach from post 85.

Part 3: build a LangChain agent with four tools. Give it three multi-step tasks. Enable verbose=True and compare the reasoning trace to the raw agent you built in post 101.

Part 4: add conversation history with RunnableWithMessageHistory. Build a 5-turn conversation. Inspect the stored history. Start a new session and confirm it has no memory of the previous one.

What's Next

LangChain provides the scaffolding. LangGraph adds state machines: model agent behavior as a directed graph where nodes are actions and edges are conditional transitions. Complex agents become debuggable, resumable, and production-ready. Next post.

104. Code Agents: Writing, Running, and Fixing Code

Akhilesh — Mon, 01 Jun 2026 17:46:45 +0000

Liquid syntax error: Unknown tag 'endraw'

103. Agent Memory: Short-Term, Long-Term, and Episodic

Akhilesh — Sun, 31 May 2026 06:53:59 +0000

Agent Memory: Short-Term, Long-Term, and Episodic

Main Thumbnail Image Prompt: A human brain cross-section illustration in neon tones on dark background. Three regions clearly demarcated and labeled. The hippocampus region glows blue, labeled "Episodic Memory: what happened." The prefrontal cortex glows orange, labeled "Working Memory: what I'm doing now." A network of distributed nodes glows green, labeled "Semantic Memory: what I know." Arrows show information flowing between regions. Scientific but accessible, the memory architecture made neural and visual.

Memory Architecture Diagram Image Prompt: Four storage boxes arranged vertically on dark background. Top: "In-Context Window (Working Memory)" — fastest, smallest, temporary, shown as RAM chip icon. Second: "External Vector Store (Semantic Memory)" — fast retrieval, persistent, shown as cylinder with search icon. Third: "Key-Value Store (Episodic Memory)" — structured facts, shown as database icon. Bottom: "Fine-Tuned Weights (Procedural Memory)" — slowest to update, most permanent, shown as brain with lock. Arrows showing read/write speeds between boxes. Clean, technical, the hierarchy is the insight.

Memory Retrieval Flow Image Prompt: A query arrives at an agent on the left. Four parallel arrows go right to four memory sources: conversation history (short chat bubbles), vector database (semantic search visualization), structured database (table icon), model weights (brain icon). Each source returns relevant items. A "Memory Fusion" box on the right combines the results. The agent sees an enriched context. The retrieval from multiple stores is the architecture.

Every conversation with an LLM starts from zero.

You explain your project. You explain your preferences. You explain your constraints. You spend five minutes providing context. You come back tomorrow. You do it all again.

The model remembers nothing between sessions. The context window closes. The state is gone. Every interaction is the agent's first day on the job.

Human productivity depends on memory. We remember what worked last time. We build on past experience. We know our tools, our colleagues, our recurring problems. We do not start from scratch daily.

Agents with memory do this. They remember past conversations. They recall relevant facts. They store successful strategies. They build up a model of the user's preferences and project context over time.

This post builds all four types of agent memory from scratch.

The Four Types of Agent Memory

import os
import json
import time
import hashlib
from typing import List, Dict, Optional, Any, Tuple
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
import anthropic
import numpy as np

print("The Four Types of Agent Memory:")
print()

memory_types = {
    "Working Memory (In-Context)": {
        "speed":       "Instant",
        "capacity":    "Limited by context window (~200K tokens)",
        "persistence": "Session only — gone when conversation ends",
        "best_for":    "Current conversation, active task state",
        "implementation": "messages list in API call",
    },
    "Semantic Memory (Vector Store)": {
        "speed":       "Fast (milliseconds)",
        "capacity":    "Millions of embeddings",
        "persistence": "Persistent across sessions",
        "best_for":    "Knowledge base, past conversations, documents",
        "implementation": "ChromaDB, Pinecone, FAISS",
    },
    "Episodic Memory (Structured Store)": {
        "speed":       "Fast (key-value lookup)",
        "capacity":    "Unlimited",
        "persistence": "Persistent across sessions",
        "best_for":    "User preferences, facts, past actions, outcomes",
        "implementation": "SQLite, Redis, JSON files",
    },
    "Procedural Memory (Weights)": {
        "speed":       "Instant (baked in)",
        "capacity":    "Model-dependent",
        "persistence": "Requires fine-tuning to update",
        "best_for":    "Skills, domain knowledge, behavioral patterns",
        "implementation": "Fine-tuning, LoRA adapters",
    },
}

for name, info in memory_types.items():
    print(f"  {name}:")
    for key, val in info.items():
        print(f"    {key:<18}: {val}")
    print()

Type 1: Working Memory (Conversation History)

class WorkingMemory:
    """
    Short-term memory that lives in the context window.
    Automatically manages the sliding window to stay within token limits.
    """

    def __init__(self, max_turns: int = 20, max_tokens: int = 50000):
        self.turns:      List[Dict] = []
        self.max_turns   = max_turns
        self.max_tokens  = max_tokens
        self._token_count = 0

    def add(self, role: str, content: str):
        self.turns.append({
            "role":      role,
            "content":   content,
            "timestamp": datetime.utcnow().isoformat(),
            "tokens":    len(content.split()) * 1.3  # rough estimate
        })
        self._trim_if_needed()

    def _trim_if_needed(self):
        while len(self.turns) > self.max_turns * 2:
            self.turns.pop(0)

    def get_messages(self) -> List[Dict]:
        return [{"role": t["role"], "content": t["content"]} for t in self.turns]

    def get_recent(self, n_turns: int = 5) -> List[Dict]:
        recent = self.turns[-(n_turns * 2):]
        return [{"role": t["role"], "content": t["content"]} for t in recent]

    def summarize_old(self, keep_last: int = 5) -> str:
        """Compress old turns into a summary to free context space."""
        if len(self.turns) <= keep_last * 2:
            return ""
        old_turns = self.turns[:-(keep_last * 2)]
        summary_parts = []
        for turn in old_turns:
            if turn["role"] == "user":
                summary_parts.append(f"User asked about: {turn['content'][:50]}")
        return "Previous conversation summary: " + "; ".join(summary_parts)

    def clear(self):
        self.turns = []

    def __len__(self):
        return len(self.turns) // 2

wm = WorkingMemory(max_turns=10)
wm.add("user",      "My name is Rahul and I am building a recommendation system.")
wm.add("assistant", "Great! What type of recommendations? User-based or item-based?")
wm.add("user",      "User-based collaborative filtering for an e-commerce platform.")
wm.add("assistant", "For user-based CF, you will need a user-item interaction matrix...")

print("Working Memory Demo:")
print(f"  Current turns:  {len(wm)}")
print(f"  Messages in context: {len(wm.get_messages())}")
print()
print("  Recent context:")
for msg in wm.get_messages():
    print(f"    [{msg['role']:<10}]: {msg['content'][:60]}...")

Type 2: Semantic Memory (Vector Store)

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class SemanticMemory:
    """
    Long-term memory stored as embeddings.
    Retrieves relevant past context by semantic similarity.
    Think of this as the agent's 'searchable journal.'
    """

    def __init__(self, embed_model: str = "all-MiniLM-L6-v2",
                 persist_path: str = "./agent_memory"):
        self.embedder      = SentenceTransformer(embed_model)
        self.persist_path  = Path(persist_path)
        self.persist_path.mkdir(exist_ok=True)

        self._entries:      List[Dict]  = []
        self._embeddings:   Optional[np.ndarray] = None
        self._load()

    def remember(self, content: str, memory_type: str = "conversation",
                 metadata: Dict = None):
        """Store a memory with embedding."""
        entry = {
            "id":          hashlib.md5(content.encode()).hexdigest()[:8],
            "content":     content,
            "type":        memory_type,
            "timestamp":   datetime.utcnow().isoformat(),
            "metadata":    metadata or {}
        }
        self._entries.append(entry)

        new_emb = self.embedder.encode([content])
        self._embeddings = (
            new_emb if self._embeddings is None
            else np.vstack([self._embeddings, new_emb])
        )
        self._save()

    def recall(self, query: str, top_k: int = 3,
               memory_type: Optional[str] = None,
               min_score: float = 0.3) -> List[Dict]:
        """Retrieve most relevant memories for a query."""
        if not self._entries:
            return []

        query_emb   = self.embedder.encode([query])
        scores      = cosine_similarity(query_emb, self._embeddings)[0]
        ranked_idxs = np.argsort(scores)[::-1]

        results = []
        for idx in ranked_idxs:
            if len(results) >= top_k:
                break
            entry = self._entries[idx]
            score = float(scores[idx])

            if score < min_score:
                continue
            if memory_type and entry["type"] != memory_type:
                continue

            results.append({**entry, "relevance_score": round(score, 4)})

        return results

    def forget(self, memory_id: str):
        """Remove a specific memory."""
        idx = next((i for i, e in enumerate(self._entries)
                    if e["id"] == memory_id), None)
        if idx is not None:
            self._entries.pop(idx)
            self._embeddings = np.delete(self._embeddings, idx, axis=0)
            self._save()

    def _save(self):
        data_path = self.persist_path / "memories.json"
        with open(data_path, "w") as f:
            json.dump(self._entries, f, indent=2)

        if self._embeddings is not None:
            np.save(self.persist_path / "embeddings.npy", self._embeddings)

    def _load(self):
        data_path = self.persist_path / "memories.json"
        emb_path  = self.persist_path / "embeddings.npy"

        if data_path.exists():
            with open(data_path) as f:
                self._entries = json.load(f)

        if emb_path.exists():
            self._embeddings = np.load(emb_path)

    def __len__(self):
        return len(self._entries)

sm = SemanticMemory(persist_path="./test_agent_memory")

sm.remember("User is building a recommendation system for e-commerce", "preference")
sm.remember("User prefers Python and PyTorch over TensorFlow", "preference")
sm.remember("Previous session: debugged a cosine similarity bug in the recommendation engine", "episode")
sm.remember("User's company uses PostgreSQL for the main database", "fact")
sm.remember("User struggled with cold-start problem for new users", "episode")
sm.remember("Solved cold-start by using content-based features initially", "solution")

print("Semantic Memory Demo:")
print(f"  Stored memories: {len(sm)}")
print()

queries = [
    "What database does this user use?",
    "Has this user had problems with new users?",
    "What tools does this user prefer?",
]

for query in queries:
    results = sm.recall(query, top_k=2)
    print(f"  Query: '{query}'")
    for r in results:
        print(f"    [{r['relevance_score']:.3f}] ({r['type']}) {r['content'][:60]}")
    print()

Type 3: Episodic Memory (Structured Store)

import sqlite3
from contextlib import contextmanager

class EpisodicMemory:
    """
    Structured memory for facts, preferences, and past events.
    Uses SQLite for persistence. Think of it as the agent's 'fact file.'
    """

    def __init__(self, db_path: str = "./agent_episodes.db"):
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        with self._conn() as conn:
            conn.executescript("""
                CREATE TABLE IF NOT EXISTS facts (
                    key         TEXT PRIMARY KEY,
                    value       TEXT NOT NULL,
                    category    TEXT DEFAULT 'general',
                    confidence  REAL DEFAULT 1.0,
                    created_at  TEXT,
                    updated_at  TEXT,
                    source      TEXT
                );

                CREATE TABLE IF NOT EXISTS episodes (
                    id          INTEGER PRIMARY KEY AUTOINCREMENT,
                    action      TEXT NOT NULL,
                    result      TEXT,
                    success     INTEGER DEFAULT 1,
                    context     TEXT,
                    timestamp   TEXT,
                    session_id  TEXT
                );

                CREATE TABLE IF NOT EXISTS preferences (
                    key         TEXT PRIMARY KEY,
                    value       TEXT NOT NULL,
                    updated_at  TEXT
                );
            """)

    @contextmanager
    def _conn(self):
        conn = sqlite3.connect(self.db_path)
        conn.row_factory = sqlite3.Row
        try:
            yield conn
            conn.commit()
        finally:
            conn.close()

    def store_fact(self, key: str, value: str,
                   category: str = "general",
                   confidence: float = 1.0,
                   source: str = ""):
        now = datetime.utcnow().isoformat()
        with self._conn() as conn:
            conn.execute("""
                INSERT OR REPLACE INTO facts
                VALUES (?, ?, ?, ?, COALESCE((SELECT created_at FROM facts WHERE key=?), ?), ?, ?)
            """, (key, value, category, confidence, key, now, now, source))

    def get_fact(self, key: str) -> Optional[Dict]:
        with self._conn() as conn:
            row = conn.execute(
                "SELECT * FROM facts WHERE key = ?", (key,)).fetchone()
            return dict(row) if row else None

    def get_facts_by_category(self, category: str) -> List[Dict]:
        with self._conn() as conn:
            rows = conn.execute(
                "SELECT * FROM facts WHERE category = ? ORDER BY updated_at DESC",
                (category,)).fetchall()
            return [dict(r) for r in rows]

    def log_episode(self, action: str, result: str = "",
                     success: bool = True, context: str = "",
                     session_id: str = ""):
        with self._conn() as conn:
            conn.execute("""
                INSERT INTO episodes (action, result, success, context, timestamp, session_id)
                VALUES (?, ?, ?, ?, ?, ?)
            """, (action, result, int(success), context,
                  datetime.utcnow().isoformat(), session_id))

    def get_recent_episodes(self, n: int = 10,
                             success_only: bool = False) -> List[Dict]:
        query = "SELECT * FROM episodes"
        if success_only:
            query += " WHERE success = 1"
        query += " ORDER BY timestamp DESC LIMIT ?"
        with self._conn() as conn:
            return [dict(r) for r in conn.execute(query, (n,)).fetchall()]

    def set_preference(self, key: str, value: str):
        with self._conn() as conn:
            conn.execute(
                "INSERT OR REPLACE INTO preferences VALUES (?, ?, ?)",
                (key, value, datetime.utcnow().isoformat()))

    def get_preference(self, key: str, default: str = "") -> str:
        with self._conn() as conn:
            row = conn.execute(
                "SELECT value FROM preferences WHERE key = ?", (key,)).fetchone()
            return row["value"] if row else default

    def get_all_preferences(self) -> Dict[str, str]:
        with self._conn() as conn:
            rows = conn.execute("SELECT key, value FROM preferences").fetchall()
            return {r["key"]: r["value"] for r in rows}

em = EpisodicMemory(db_path="./test_episodes.db")

em.store_fact("user_name",        "Rahul",           category="identity")
em.store_fact("user_role",        "ML Engineer",      category="identity")
em.store_fact("project_type",     "recommendation",   category="project")
em.store_fact("db_technology",    "PostgreSQL",        category="tech_stack")
em.store_fact("preferred_lang",   "Python",            category="preference")
em.store_fact("preferred_ml_lib", "PyTorch",           category="preference")

em.log_episode("Helped debug cosine similarity", "Fixed shape mismatch",
               success=True, session_id="sess_001")
em.log_episode("Explained collaborative filtering", "User understood",
               success=True, session_id="sess_001")
em.log_episode("Tried matrix factorization approach", "Memory error on large data",
               success=False, session_id="sess_002")

em.set_preference("response_style", "concise with code examples")
em.set_preference("explanation_depth", "intermediate")

print("Episodic Memory Demo:")
print()
print("  User Facts:")
for fact in em.get_facts_by_category("identity"):
    print(f"    {fact['key']}: {fact['value']}")

print()
print("  Recent Episodes:")
for ep in em.get_recent_episodes(3):
    status = "✓" if ep["success"] else "✗"
    print(f"    {status} {ep['action'][:50]}: {ep['result'][:40]}")

print()
print("  Preferences:")
for key, val in em.get_all_preferences().items():
    print(f"    {key}: {val}")

Bringing It All Together: Memory-Augmented Agent

class MemoryAgent:
    """
    A complete agent with all four memory types integrated.
    Personalizes responses based on accumulated memory.
    """

    def __init__(self, agent_id: str = "agent_default",
                 model: str = "claude-3-5-haiku-20241022"):
        self.agent_id = agent_id
        self.client   = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
        self.model    = model

        self.working_memory  = WorkingMemory(max_turns=15)
        self.semantic_memory = SemanticMemory(
            persist_path=f"./memory_{agent_id}/semantic")
        self.episodic_memory = EpisodicMemory(
            db_path=f"./memory_{agent_id}/episodic.db")

        self._session_id = hashlib.md5(
            str(time.time()).encode()).hexdigest()[:8]

    def _build_memory_context(self, query: str) -> str:
        """Assemble relevant memories into a context block."""
        parts = []

        prefs = self.episodic_memory.get_all_preferences()
        if prefs:
            parts.append("User preferences: " +
                         "; ".join(f"{k}={v}" for k, v in prefs.items()))

        key_facts = self.episodic_memory.get_facts_by_category("identity")
        key_facts += self.episodic_memory.get_facts_by_category("project")
        if key_facts:
            facts_str = "; ".join(f"{f['key']}={f['value']}" for f in key_facts[:5])
            parts.append(f"Known facts: {facts_str}")

        relevant_memories = self.semantic_memory.recall(query, top_k=3)
        if relevant_memories:
            mem_str = "\n".join(
                f"- [{m['type']}] {m['content']}" for m in relevant_memories)
            parts.append(f"Relevant past context:\n{mem_str}")

        recent_episodes = self.episodic_memory.get_recent_episodes(3, success_only=True)
        if recent_episodes:
            ep_str = "; ".join(ep["action"][:40] for ep in recent_episodes)
            parts.append(f"Recent successful actions: {ep_str}")

        return "\n\n".join(parts) if parts else ""

    def chat(self, user_message: str, verbose: bool = False) -> str:
        self.working_memory.add("user", user_message)

        memory_context = self._build_memory_context(user_message)

        system = f"""You are a helpful AI assistant with memory of past interactions.
Use the provided context to personalize your responses.

{f'Memory context:{chr(10)}{memory_context}' if memory_context else ''}

Adapt your response to the user's known preferences and expertise level."""

        response = self.client.messages.create(
            model      = self.model,
            max_tokens = 800,
            system     = system,
            messages   = self.working_memory.get_messages()
        )
        answer = response.content[0].text
        self.working_memory.add("assistant", answer)

        self.semantic_memory.remember(
            f"User asked: {user_message[:100]}",
            memory_type = "conversation",
            metadata    = {"session": self._session_id}
        )
        self.episodic_memory.log_episode(
            action     = f"Answered: {user_message[:50]}",
            result     = "Success",
            session_id = self._session_id
        )

        if verbose:
            used_memories = len(self.semantic_memory.recall(user_message, top_k=3))
            print(f"  [Memory] Used {used_memories} relevant memories, "
                  f"{len(self.working_memory)} conversation turns in context")

        return answer

mem_agent = MemoryAgent(agent_id="rahul_session")

mem_agent.episodic_memory.store_fact("user_name", "Rahul", "identity")
mem_agent.episodic_memory.store_fact("project",   "e-commerce recommender", "project")
mem_agent.episodic_memory.set_preference("explanation_depth", "intermediate")
mem_agent.semantic_memory.remember(
    "User previously struggled with cold-start problem", "episode")

print("\nMemory-Augmented Agent Demo:")
print("=" * 60)

questions = [
    "Can you remind me where we left off with my recommendation system?",
    "What approach did we decide to use for new users?",
    "I want to add diversity to the recommendations. Any ideas?",
]

for q in questions:
    print(f"\nUser: {q}")
    answer = mem_agent.chat(q, verbose=True)
    print(f"Agent: {answer[:200]}...")

Memory Management: Forgetting and Summarization

class MemoryManager:
    """Handles memory maintenance: summarization, pruning, importance scoring."""

    def __init__(self, semantic_memory: SemanticMemory,
                 episodic_memory: EpisodicMemory):
        self.semantic = semantic_memory
        self.episodic = episodic_memory

    def summarize_session(self, session_id: str,
                           llm_client=None) -> str:
        """Compress a full session into a summary memory."""
        episodes = [
            ep for ep in self.episodic.get_recent_episodes(50)
            if ep.get("session_id") == session_id
        ]

        if not episodes:
            return ""

        session_text = "\n".join(
            f"- {ep['action']}: {ep['result']}" for ep in episodes)

        summary = (
            f"Session {session_id}: " +
            "; ".join(ep["action"][:30] for ep in episodes[:5])
        )

        self.semantic.remember(
            summary,
            memory_type = "session_summary",
            metadata    = {"session_id": session_id}
        )
        return summary

    def get_memory_stats(self) -> Dict:
        return {
            "semantic_memories":     len(self.semantic),
            "total_episodes":        len(self.episodic.get_recent_episodes(1000)),
            "successful_episodes":   len(self.episodic.get_recent_episodes(1000, success_only=True)),
            "stored_preferences":    len(self.episodic.get_all_preferences()),
            "stored_facts":          len(self.episodic.get_facts_by_category("identity") +
                                         self.episodic.get_facts_by_category("project")),
        }

mm = MemoryManager(mem_agent.semantic_memory, mem_agent.episodic_memory)

print("\nMemory Statistics:")
stats = mm.get_memory_stats()
for key, value in stats.items():
    print(f"  {key:<30}: {value}")

Reference Links

print("\nAgent Memory Reference Links:")
print()

refs = {
    "Papers": [
        ("MemGPT: Memory in LLM OS",           "arxiv.org/abs/2310.08560"),
        ("Generative Agents (Stanford)",        "arxiv.org/abs/2304.03442"),
        ("Memory-Augmented LLM Survey",         "arxiv.org/abs/2312.17512"),
        ("Cognitive Architectures for LLMs",    "arxiv.org/abs/2309.02427"),
        ("Reflexion: Verbal Reinforcement",     "arxiv.org/abs/2303.11366"),
    ],
    "Implementations": [
        ("MemGPT GitHub",                        "github.com/cpacker/MemGPT"),
        ("LangChain Memory docs",                "python.langchain.com/docs/modules/memory"),
        ("LlamaIndex Memory module",             "docs.llamaindex.ai/en/stable/module_guides/storing/index_stores"),
        ("Zep: Long-term memory for agents",     "getzep.com"),
        ("Mem0: Memory layer for AI",            "mem0.ai"),
    ],
    "Tutorials": [
        ("Building agents with memory (Anthropic)", "github.com/anthropics/anthropic-cookbook"),
        ("LangGraph memory persistence",             "langchain-ai.github.io/langgraph/how-tos/persistence"),
        ("Vector memory with ChromaDB",              "docs.trychroma.com/usage-guide"),
    ],
    "Cheat Sheets": [
        ("SQLite Python reference",              "docs.python.org/3/library/sqlite3.html"),
        ("Sentence Transformers quickstart",     "sbert.net/docs/quickstart.html"),
        ("NumPy array operations",               "numpy.org/doc/stable/reference/routines.array-manipulation"),
    ],
}

for category, links in refs.items():
    print(f"  {category}:")
    for name, url in links:
        print(f"    • {name:<48} {url}")
    print()

Try This

Create agent_memory_practice.py.

Part 1: build the four-type memory system from this post. Initialize WorkingMemory, SemanticMemory, and EpisodicMemory. Run a 5-turn conversation. After each turn, store the exchange in both semantic (embedding) and episodic (SQLite) memory. Verify both stores contain the data.

Part 2: test cross-session recall. Start a new conversation. Without providing any prior context, ask the agent something that requires remembering a fact from the previous session. Does it retrieve the relevant memory and personalize the response?

Part 3: memory retrieval comparison. Take 10 queries. For each, retrieve top 3 results from semantic memory. Also retrieve results from episodic memory by category. Compare what each memory type surfaces. When is each one more useful?

Part 4: memory decay. Add a "recency weight" to semantic memory recall: recent memories score higher than old ones. Implement this by multiplying the cosine similarity score by a decay factor based on age. Does it change which memories get retrieved?

What's Next

Agents with memory are powerful. Agents that can write and execute code are transformative. The next post covers code agents: agents that write Python, run it, observe the output, and iteratively improve their code until it solves the problem. This is how GitHub Copilot and Cursor work at their core.

102. Multi-Agent Systems: When One Agent Is Not Enough

Akhilesh — Sat, 30 May 2026 05:14:45 +0000

One agent is powerful but limited.

Ask it to research a topic, write an article, review that article, check the code examples, and format everything for publishing. It has to do everything sequentially. When it makes a mistake in step 2, it might not catch it until step 7. It has one perspective. One "voice." One set of strengths and weaknesses.

Now imagine three specialized agents working on the same task. A research agent that searches exhaustively and compiles sources. A writing agent that takes those sources and drafts the article with a clear structure. A review agent that reads the draft critically and flags errors, gaps, and unsupported claims. Each one knows its job deeply. They check each other's work. They have different system prompts that give them different strengths.

This is how complex knowledge work actually gets done. Not one person doing everything. A team of specialists coordinated toward a shared goal.

Multi-agent systems bring this pattern to AI.

The Core Patterns

import os
import json
import time
from typing import List, Dict, Callable, Optional, Any
from dataclasses import dataclass, field
from enum import Enum
import anthropic

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

class Pattern(Enum):
    ORCHESTRATOR_WORKER = "orchestrator_worker"
    SEQUENTIAL_PIPELINE = "sequential_pipeline"
    PARALLEL_EXECUTION  = "parallel_execution"
    DEBATE              = "debate"
    CRITIC_REVIEW       = "critic_review"

print("Multi-Agent Patterns:")
print()

patterns = {
    "Orchestrator-Worker": {
        "description": "One LLM breaks down tasks, delegates to specialized workers, aggregates results",
        "best_for":    "Complex tasks that can be decomposed into subtasks",
        "example":     "Research assistant: orchestrator delegates to researcher, writer, editor"
    },
    "Sequential Pipeline": {
        "description": "Output of one agent becomes input to the next in a fixed chain",
        "best_for":    "Multi-stage transformation: draft → edit → format → publish",
        "example":     "Content pipeline: researcher → writer → fact-checker → publisher"
    },
    "Parallel Execution": {
        "description": "Multiple agents work simultaneously on independent subtasks",
        "best_for":    "Tasks with independent components that can run concurrently",
        "example":     "Market research: agent A covers Asia, agent B covers Europe simultaneously"
    },
    "Debate/Adversarial": {
        "description": "Two agents argue opposing positions, a judge evaluates and decides",
        "best_for":    "Decision-making, fact-checking, reducing overconfidence",
        "example":     "Agent A argues for approach X, Agent B argues against, judge decides"
    },
    "Critic-Review": {
        "description": "Creator agent produces output, critic agent evaluates and gives feedback",
        "best_for":    "Quality assurance, catching blind spots, improving output quality",
        "example":     "Writer produces article, critic identifies weaknesses, writer revises"
    },
}

for name, info in patterns.items():
    print(f"  {name}:")
    print(f"    {info['description']}")
    print(f"    Best for: {info['best_for']}")
    print(f"    Example:  {info['example']}")
    print()

Building a Base Agent Class

@dataclass
class AgentMessage:
    from_agent: str
    to_agent:   str
    content:    str
    message_type: str = "task"
    metadata:   Dict  = field(default_factory=dict)

class BaseAgent:
    """Foundation agent that all specialized agents inherit from."""

    def __init__(self, name: str, role: str, system_prompt: str,
                 model: str = "claude-3-5-haiku-20241022",
                 tools: List[Dict] = None):
        self.name          = name
        self.role          = role
        self.system_prompt = system_prompt
        self.model         = model
        self.tools         = tools or []
        self.history:List[AgentMessage] = []

    def think(self, message: str,
               context: List[Dict] = None,
               max_tokens: int = 1000) -> str:
        messages = list(context or [])
        messages.append({"role": "user", "content": message})

        kwargs = {
            "model":      self.model,
            "max_tokens": max_tokens,
            "system":     self.system_prompt,
            "messages":   messages,
        }
        if self.tools:
            kwargs["tools"] = self.tools

        response = client.messages.create(**kwargs)

        if response.stop_reason == "tool_use":
            return self._handle_tool_use(response, messages, max_tokens)

        return response.content[0].text if response.content else ""

    def _handle_tool_use(self, response, messages, max_tokens):
        messages.append({"role": "assistant", "content": response.content})
        tool_results = []

        for block in response.content:
            if block.type == "tool_use":
                result = self._execute_tool(block.name, block.input)
                tool_results.append({
                    "type":        "tool_result",
                    "tool_use_id": block.id,
                    "content":     json.dumps(result)
                })

        messages.append({"role": "user", "content": tool_results})

        final = client.messages.create(
            model=self.model, max_tokens=max_tokens,
            system=self.system_prompt, messages=messages,
            tools=self.tools
        )
        return final.content[0].text if final.content else ""

    def _execute_tool(self, tool_name: str, tool_input: Dict) -> Any:
        return {"error": f"Tool {tool_name} not implemented in {self.name}"}

    def __repr__(self):
        return f"Agent({self.name}, role={self.role})"

print("BaseAgent class built.")

Pattern 1: Orchestrator-Worker

class OrchestratorAgent(BaseAgent):
    """Breaks down complex goals and delegates to specialized workers."""

    def __init__(self, workers: List[BaseAgent]):
        super().__init__(
            name   = "Orchestrator",
            role   = "coordinator",
            system_prompt = f"""You are an orchestrator that delegates tasks to specialized agents.

Available workers:
{self._format_workers(workers)}

To delegate a task, respond with JSON:
{{
  "delegations": [
    {{"agent": "agent_name", "task": "specific task description", "priority": 1}},
    ...
  ],
  "execution": "sequential" or "parallel"
}}

After receiving worker results, synthesize them into a final coherent answer."""
        )
        self.workers = {w.name: w for w in workers}

    def _format_workers(self, workers):
        return "\n".join(f"- {w.name} ({w.role}): handles {w.role} tasks"
                         for w in workers)

    def run(self, goal: str, verbose: bool = True) -> str:
        if verbose:
            print(f"\n{'='*60}")
            print(f"Orchestrator Goal: {goal}")
            print(f"{'='*60}")

        plan_prompt = f"""Goal: {goal}

Create a delegation plan. Which agents should handle which parts?
Respond with the JSON delegation format."""

        plan_json = self.think(plan_prompt)

        try:
            plan = json.loads(plan_json)
        except json.JSONDecodeError:
            import re
            match = re.search(r'\{.*\}', plan_json, re.DOTALL)
            if match:
                plan = json.loads(match.group())
            else:
                plan = {"delegations": [{"agent": list(self.workers.keys())[0],
                                          "task": goal, "priority": 1}],
                         "execution": "sequential"}

        if verbose:
            print(f"\nPlan: {plan.get('execution', 'sequential')} execution")
            for d in plan.get("delegations", []):
                print(f"  → {d['agent']}: {d['task'][:60]}")

        worker_results = {}
        for delegation in plan.get("delegations", []):
            agent_name = delegation["agent"]
            task       = delegation["task"]

            if agent_name in self.workers:
                if verbose:
                    print(f"\n[{agent_name}] working on: {task[:50]}...")
                result = self.workers[agent_name].think(task)
                worker_results[agent_name] = result
                if verbose:
                    print(f"[{agent_name}] done: {result[:100]}...")

        synthesis_prompt = f"""Original goal: {goal}

Worker results:
{json.dumps(worker_results, indent=2)}

Synthesize these results into a single, coherent, well-structured answer."""

        final_answer = self.think(synthesis_prompt)
        return final_answer

research_agent = BaseAgent(
    name          = "Researcher",
    role          = "research",
    system_prompt = """You are a research specialist. Your job is to find and synthesize information.
Always cite sources, be thorough, and organize findings clearly.
Present information as bullet points with key facts highlighted."""
)

writer_agent = BaseAgent(
    name          = "Writer",
    role          = "writing",
    system_prompt = """You are a technical writer. Your job is to turn research into clear, engaging prose.
Write in an accessible but precise style.
Structure content with clear headings and logical flow.
Target audience: developers and data scientists."""
)

critic_agent = BaseAgent(
    name          = "Critic",
    role          = "review",
    system_prompt = """You are a critical reviewer. Your job is to find flaws and gaps.
Be constructive but rigorous. Identify:
- Factual errors or unsupported claims
- Missing important information
- Unclear or confusing passages
- Structural improvements needed
Score quality 1-10 and explain your rating."""
)

orchestrator = OrchestratorAgent(
    workers = [research_agent, writer_agent, critic_agent]
)

print("\nOrchestrator-Worker system ready.")
print(f"Workers: {list(orchestrator.workers.keys())}")

result = orchestrator.run(
    "Explain the key differences between BERT and GPT, including their architectures, "
    "training objectives, and best use cases.",
    verbose=True
)
print(f"\nFinal Answer:\n{result[:500]}...")

Pattern 2: Sequential Pipeline

class Pipeline:
    """Agents run in sequence, output flows to next agent as input."""

    def __init__(self, agents: List[BaseAgent], verbose: bool = True):
        self.agents  = agents
        self.verbose = verbose
        self.outputs = {}

    def run(self, initial_input: str) -> str:
        current = initial_input

        for i, agent in enumerate(self.agents):
            if self.verbose:
                print(f"\n[Stage {i+1}/{len(self.agents)}] {agent.name}")
                print(f"  Input:  {current[:80]}...")

            prompt = (
                f"Previous stage output:\n{current}\n\nYour task: {agent.role}"
                if i > 0 else current
            )
            current = agent.think(prompt)
            self.outputs[agent.name] = current

            if self.verbose:
                print(f"  Output: {current[:80]}...")

        return current

draft_agent = BaseAgent(
    name          = "Drafter",
    role          = "Write a first draft. Do not worry about perfection, focus on getting ideas down.",
    system_prompt = "You are a first-draft writer. Write quickly and completely. Cover all the key points."
)

editor_agent = BaseAgent(
    name          = "Editor",
    role          = "Edit the draft for clarity, concision, and flow. Fix any awkward sentences.",
    system_prompt = "You are a skilled editor. Improve clarity and remove redundancy while preserving meaning."
)

formatter_agent = BaseAgent(
    name          = "Formatter",
    role          = "Format the edited content with proper markdown, headers, and structure.",
    system_prompt = "You are a content formatter. Add appropriate markdown formatting, headers, and bullet points."
)

pipeline = Pipeline(
    agents  = [draft_agent, editor_agent, formatter_agent],
    verbose = True
)

print("\nSequential Pipeline: Draft → Edit → Format")
final = pipeline.run(
    "Write a brief explanation of how neural networks learn through backpropagation.")
print(f"\nFinal formatted output:\n{final[:400]}...")

Pattern 3: Parallel Execution

import concurrent.futures
import threading

class ParallelAgentRunner:
    """Run multiple agents simultaneously on independent subtasks."""

    def __init__(self, agents_and_tasks: List[tuple],
                 max_workers: int = 4, verbose: bool = True):
        self.agents_and_tasks = agents_and_tasks
        self.max_workers      = max_workers
        self.verbose          = verbose
        self._lock            = threading.Lock()

    def run(self) -> Dict[str, str]:
        results = {}
        start   = time.time()

        def run_agent(agent_task_pair):
            agent, task = agent_task_pair
            if self.verbose:
                with self._lock:
                    print(f"  → [{agent.name}] started: {task[:50]}...")
            result = agent.think(task)
            if self.verbose:
                with self._lock:
                    print(f"  ✓ [{agent.name}] done ({time.time()-start:.1f}s)")
            return agent.name, result

        with concurrent.futures.ThreadPoolExecutor(
            max_workers=self.max_workers
        ) as executor:
            futures = {executor.submit(run_agent, pair): pair
                       for pair in self.agents_and_tasks}
            for future in concurrent.futures.as_completed(futures):
                name, result = future.result()
                results[name] = result

        elapsed = time.time() - start
        if self.verbose:
            print(f"\nAll agents completed in {elapsed:.1f}s total")
        return results

asia_agent  = BaseAgent("Asia_Researcher",   "researcher",
    "You research the Asian tech market. Focus on China, Japan, South Korea, India.")
europe_agent = BaseAgent("Europe_Researcher", "researcher",
    "You research the European tech market. Focus on UK, Germany, France, Nordics.")
us_agent    = BaseAgent("US_Researcher",     "researcher",
    "You research the US tech market. Focus on Silicon Valley, NYC, emerging hubs.")

topic = "the adoption and trends in AI/ML technology in 2024"

parallel_runner = ParallelAgentRunner(
    agents_and_tasks = [
        (asia_agent,   f"Research {topic} in Asia"),
        (europe_agent, f"Research {topic} in Europe"),
        (us_agent,     f"Research {topic} in the United States"),
    ],
    verbose = True
)

print("\nParallel Execution: 3 regional researchers running simultaneously")
parallel_results = parallel_runner.run()

synthesizer = BaseAgent(
    name          = "Synthesizer",
    role          = "synthesis",
    system_prompt = "You synthesize multiple research reports into one coherent global overview."
)

global_report = synthesizer.think(
    f"Synthesize these regional research reports into a global overview:\n\n" +
    "\n\n".join(f"=== {name} ===\n{result}"
                for name, result in parallel_results.items())
)
print(f"\nGlobal synthesis:\n{global_report[:400]}...")

Pattern 4: Debate Agent

class DebateSystem:
    """Two agents argue opposing sides, a judge evaluates."""

    def __init__(self, model: str = "claude-3-5-haiku-20241022"):
        self.proposer = BaseAgent(
            name          = "Proposer",
            role          = "advocate",
            system_prompt = """You are an advocate for the proposition.
Make the strongest possible case FOR the position you are assigned.
Use evidence, logic, and compelling arguments. Be persuasive.""",
            model=model
        )
        self.opponent = BaseAgent(
            name          = "Opponent",
            role          = "critic",
            system_prompt = """You are a critic of the proposition.
Make the strongest possible case AGAINST the position presented.
Find flaws, gaps, counterexamples, and alternative views. Be rigorous.""",
            model=model
        )
        self.judge = BaseAgent(
            name          = "Judge",
            role          = "arbitrator",
            system_prompt = """You are an impartial judge evaluating a debate.
Assess both sides fairly. Identify the strongest arguments from each side.
Make a reasoned final verdict with clear justification.
Format: [FOR arguments] [AGAINST arguments] [Verdict] [Reasoning]""",
            model=model
        )

    def debate(self, proposition: str, rounds: int = 2,
                verbose: bool = True) -> Dict:
        if verbose:
            print(f"\nDebate: '{proposition}'")
            print("=" * 60)

        context_p = []
        context_o = []

        for round_num in range(1, rounds + 1):
            if verbose:
                print(f"\n--- Round {round_num} ---")

            prop_arg = self.proposer.think(
                f"Round {round_num}: Argue FOR: '{proposition}'",
                context=context_p
            )
            context_p.append({"role": "assistant", "content": prop_arg})
            if verbose:
                print(f"FOR:     {prop_arg[:150]}...")

            opp_arg = self.opponent.think(
                f"Round {round_num}: Counter this argument against '{proposition}':\n{prop_arg}",
                context=context_o
            )
            context_o.append({"role": "assistant", "content": opp_arg})
            if verbose:
                print(f"AGAINST: {opp_arg[:150]}...")

            context_p.append({"role": "user",
                               "content": f"Opponent says: {opp_arg}"})
            context_o.append({"role": "user",
                               "content": f"Proposer says: {prop_arg}"})

        all_args = "\n\n".join([
            f"FOR:\n{context_p[i]['content']}"
            for i in range(0, len(context_p), 2)
        ] + [
            f"AGAINST:\n{context_o[i]['content']}"
            for i in range(0, len(context_o), 2)
        ])

        verdict = self.judge.think(
            f"Proposition: '{proposition}'\n\nDebate arguments:\n{all_args}\n\nDeliver your verdict.")

        if verbose:
            print(f"\nJudge's Verdict:\n{verdict[:300]}...")

        return {
            "proposition":  proposition,
            "for_arguments": [context_p[i]["content"] for i in range(0, len(context_p), 2)],
            "against_arguments": [context_o[i]["content"] for i in range(0, len(context_o), 2)],
            "verdict":      verdict
        }

debate = DebateSystem()
result = debate.debate(
    proposition = "Large Language Models will replace most software engineering jobs within 10 years",
    rounds      = 1,
    verbose     = True
)

Pattern 5: Critic-Review Loop

class CriticReviewLoop:
    """Creator produces, critic evaluates, loop until quality threshold met."""

    def __init__(self, creator: BaseAgent, critic: BaseAgent,
                 max_iterations: int = 3, quality_threshold: float = 8.0):
        self.creator           = creator
        self.critic            = critic
        self.max_iterations    = max_iterations
        self.quality_threshold = quality_threshold

    def run(self, task: str, verbose: bool = True) -> Dict:
        history  = []
        feedback = ""

        for iteration in range(1, self.max_iterations + 1):
            if verbose:
                print(f"\n--- Iteration {iteration} ---")

            creation_prompt = (
                f"{task}\n\nFeedback from previous attempt:\n{feedback}\nImprove accordingly."
                if feedback else task
            )
            content = self.creator.think(creation_prompt)
            history.append({"iteration": iteration, "content": content})

            if verbose:
                print(f"[{self.creator.name}]: {content[:120]}...")

            critique = self.critic.think(
                f"Evaluate this content (score 1-10 and feedback):\n\n{content}"
            )
            if verbose:
                print(f"[{self.critic.name}]: {critique[:120]}...")

            import re
            score_match = re.search(r'\b([0-9]|10)\b', critique)
            score = float(score_match.group()) if score_match else 7.0

            if score >= self.quality_threshold:
                if verbose:
                    print(f"\n✓ Quality threshold reached (score={score})")
                break

            feedback = critique

        return {
            "final_content": content,
            "iterations":    iteration,
            "history":       history
        }

code_writer = BaseAgent(
    name="CodeWriter", role="code_creator",
    system_prompt="You write clean, well-documented Python code. Include docstrings and type hints.")

code_reviewer = BaseAgent(
    name="CodeReviewer", role="code_critic",
    system_prompt="""You review Python code rigorously. Check for:
- Correctness and edge cases
- Code clarity and documentation  
- PEP 8 compliance
- Error handling
Score 1-10 and give specific actionable feedback.""")

review_loop = CriticReviewLoop(
    creator           = code_writer,
    critic            = code_reviewer,
    max_iterations    = 3,
    quality_threshold = 8.0
)

print("\nCritic-Review Loop: write and improve code iteratively")
result = review_loop.run(
    "Write a Python function that finds the longest palindrome substring in a string.")
print(f"\nFinal code after {result['iterations']} iteration(s):")
print(result["final_content"][:400])

When Multi-Agent Adds Real Value

print("\nWhen to Use Multi-Agent Systems:")
print()

use_cases = {
    "Use multi-agent when": [
        "Tasks naturally decompose into specialized subtasks",
        "Quality requires multiple independent perspectives",
        "Parallel execution would save significant time",
        "Different parts of the task need different 'personalities' or constraints",
        "One agent's output quality is not good enough and critique helps",
        "Tasks exceed a single context window",
    ],
    "Stick with single agent when": [
        "Task is straightforward and fits one context window",
        "Coordination overhead would outweigh the benefits",
        "You need predictable, debuggable behavior",
        "Latency is critical (multi-agent adds round trips)",
        "Budget is tight (each agent call costs tokens)",
        "You are still prototyping (complexity kills iteration speed)",
    ],
}

for category, points in use_cases.items():
    print(f"  {category}:")
    for point in points:
        print(f"    {'✓' if 'Use' in category else '✗'} {point}")
    print()

Reference Links

print("Essential Multi-Agent Reference Links:")
print()

refs = {
    "Papers": [
        ("Society of Mind (Minsky, 1986)",      "en.wikipedia.org/wiki/Society_of_Mind"),
        ("LLM-based Multi-Agent Survey",         "arxiv.org/abs/2402.01680"),
        ("AutoGen: Multi-agent conversations",   "arxiv.org/abs/2308.08155"),
        ("MetaGPT: Meta programming agents",     "arxiv.org/abs/2308.00352"),
        ("ChatDev: Software development agents", "arxiv.org/abs/2307.07924"),
    ],
    "Frameworks": [
        ("AutoGen (Microsoft)",          "github.com/microsoft/autogen"),
        ("CrewAI",                       "crewai.com"),
        ("LangGraph (stateful graphs)",  "langchain-ai.github.io/langgraph"),
        ("Semantic Kernel (Microsoft)",  "learn.microsoft.com/semantic-kernel"),
        ("Agency Swarm",                 "github.com/VRSEN/agency-swarm"),
        ("Camel-AI",                     "github.com/camel-ai/camel"),
    ],
    "Tutorials": [
        ("Anthropic multi-agent cookbook",       "github.com/anthropics/anthropic-cookbook/tree/main/patterns/agents"),
        ("DeepLearning.AI Multi-agent course",   "learn.deeplearning.ai/multi-ai-agent-systems"),
        ("LangGraph multi-agent tutorial",       "langchain-ai.github.io/langgraph/tutorials"),
        ("AutoGen docs and examples",            "microsoft.github.io/autogen"),
    ],
    "Blog Posts": [
        ("Lilian Weng: LLM Powered Autonomous Agents", "lilianweng.github.io/posts/2023-06-23-agent"),
        ("Andrej Karpathy: Software 2.0",              "karpathy.medium.com/software-2-0-a64152b37c35"),
        ("Anthropic: Building effective agents",        "anthropic.com/research/building-effective-agents"),
    ],
}

for category, links in refs.items():
    print(f"  {category}:")
    for name, url in links:
        print(f"    • {name:<48} {url}")
    print()

Try This

Create multi_agent_practice.py.

Part 1: implement the orchestrator-worker pattern from scratch. Create three specialized agents: a researcher (mock web search), a summarizer, and a formatter. Give the orchestrator a goal like "Research and summarize the key concepts of reinforcement learning." Verify it delegates appropriately.

Part 2: build a sequential pipeline with four stages. Stage 1: brainstorm 10 ideas for a blog post on a technical topic. Stage 2: select the best three and outline each. Stage 3: write one paragraph for each. Stage 4: format into a complete post with headings.

Part 3: implement the critic-review loop. Write a code generation task (sort algorithm, data structure, utility function). Run 3 iterations of write-critique-improve. Does the code quality measurably improve across iterations?

Part 4: debate two real technical positions. Example: "Python is better than JavaScript for backend development." Run two rounds. Print both sides' arguments and the judge's verdict. Does the debate surface arguments you had not considered?

What's Next

Agents need memory to be truly useful across sessions. The next post covers agent memory systems: how to store past actions, how to recall relevant past experience, and how to build agents that improve over time rather than starting fresh every conversation.

101. AI Agents: When LLMs Start Taking Actions

Akhilesh — Fri, 29 May 2026 11:03:32 +0000

Everything you have built so far is reactive.

User sends a message. System processes it. System sends a response. Done.

An agent is different. An agent receives a goal, not a message. It decides what steps to take to achieve that goal. It uses tools. It observes the results. It adjusts its plan. It continues until the goal is achieved or it determines the goal cannot be achieved.

"Summarize this document" is a task. One call. One response.

"Research recent papers on transformer efficiency, write a comparison table, and save it as a CSV" is a goal. An agent needs to search the web multiple times, decide which papers are relevant, extract data from multiple sources, format it consistently, handle failures, and write to disk. Five to twenty tool calls. Dynamic decisions at each step.

This is the frontier of AI engineering. Agents are brittle. They fail in surprising ways. They are also what makes AI systems feel genuinely useful rather than just responsive.

What Makes Something an Agent

print("Agent vs Non-Agent:")
print()
print("NON-AGENT (chain/pipeline):")
print("  - Fixed sequence of steps")
print("  - Steps determined at design time")
print("  - No ability to react to intermediate results")
print("  - Predictable, debuggable, less capable")
print()
print("AGENT:")
print("  - LLM decides what to do at each step")
print("  - Steps determined at runtime based on observations")
print("  - Can loop, backtrack, try alternative approaches")
print("  - Powerful, unpredictable, capable of novel solutions")
print()

agent_properties = {
    "Perception":   "Receives inputs: user goal, tool results, memory",
    "Reasoning":    "LLM decides what to do next given current state",
    "Action":       "Executes tools, writes files, calls APIs, searches",
    "Memory":       "Maintains context across multiple steps",
    "Goal":         "Works toward a specified objective, not just responding",
}

print("The five properties of an agent:")
for prop, description in agent_properties.items():
    print(f"  {prop:<15}: {description}")

print()
print("The ReAct pattern (Reason + Act):")
print("  Thought: 'I need to find the population of Tokyo'")
print("  Action:  search_web('Tokyo population 2024')")
print("  Observation: '13.96 million in city proper, 37.4M metro'")
print("  Thought: 'I have the answer, now I can respond'")
print("  Answer:  'Tokyo's population is approximately 13.96 million...'")

Building an Agent From Scratch

import json
import os
from typing import List, Dict, Callable, Any, Optional
from dataclasses import dataclass, field
import anthropic

@dataclass
class Tool:
    name:        str
    description: str
    fn:          Callable
    schema:      Dict

    def to_api_format(self) -> Dict:
        return {
            "name":         self.name,
            "description":  self.description,
            "input_schema": self.schema
        }

class AgentMemory:
    def __init__(self, max_steps: int = 20):
        self.steps:     List[Dict] = []
        self.max_steps  = max_steps

    def add_step(self, role: str, content: Any):
        self.steps.append({"role": role, "content": content})

    def get_messages(self) -> List[Dict]:
        return self.steps.copy()

    def __len__(self):
        return len(self.steps)

class Agent:
    """
    A simple but complete agent using Claude with tool use.
    Implements the ReAct (Reason + Act) loop.
    """

    def __init__(self, tools: List[Tool], system_prompt: str = "",
                 model: str = "claude-3-5-haiku-20241022",
                 max_steps: int = 15, verbose: bool = True):
        self.client      = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
        self.tools       = {t.name: t for t in tools}
        self.system      = system_prompt or self._default_system()
        self.model       = model
        self.max_steps   = max_steps
        self.verbose     = verbose

    def _default_system(self) -> str:
        return """You are a helpful AI agent. You have access to tools to help complete tasks.
Use tools when needed. Think step by step. When you have enough information to answer, respond directly.
If a task fails, explain what went wrong and what you tried."""

    def _execute_tool(self, tool_name: str, tool_input: Dict) -> str:
        if tool_name not in self.tools:
            return json.dumps({"error": f"Tool '{tool_name}' not found"})
        try:
            result = self.tools[tool_name].fn(**tool_input)
            return json.dumps(result) if not isinstance(result, str) else result
        except Exception as e:
            return json.dumps({"error": str(e)})

    def run(self, goal: str) -> str:
        memory   = AgentMemory(self.max_steps)
        api_tools = [t.to_api_format() for t in self.tools.values()]

        memory.add_step("user", goal)

        if self.verbose:
            print(f"\n{'='*60}")
            print(f"Agent Goal: {goal}")
            print(f"{'='*60}")

        for step in range(self.max_steps):
            response = self.client.messages.create(
                model      = self.model,
                max_tokens = 1024,
                system     = self.system,
                tools      = api_tools,
                messages   = memory.get_messages()
            )

            if response.stop_reason == "end_turn":
                answer = next(
                    (b.text for b in response.content if b.type == "text"), "")
                if self.verbose:
                    print(f"\n✓ Final Answer: {answer[:200]}")
                return answer

            if response.stop_reason == "tool_use":
                memory.add_step("assistant", response.content)
                tool_results = []

                for block in response.content:
                    if block.type == "tool_use":
                        if self.verbose:
                            print(f"\n[Step {step+1}] 🔧 {block.name}({json.dumps(block.input)[:80]})")

                        result = self._execute_tool(block.name, block.input)

                        if self.verbose:
                            print(f"         ↳ {result[:120]}")

                        tool_results.append({
                            "type":        "tool_result",
                            "tool_use_id": block.id,
                            "content":     result
                        })

                memory.add_step("user", tool_results)
            else:
                break

        return "Agent reached maximum steps without completing the task."


print("Agent class built. Now we need tools.")

Building the Tool Library

import math
import datetime
import random

def calculator(expression: str) -> Dict:
    """Evaluate a mathematical expression safely."""
    try:
        allowed = set("0123456789+-*/()., ")
        if not all(c in allowed for c in expression):
            return {"error": "Invalid characters in expression"}
        result = eval(expression, {"__builtins__": {}},
                      {"sqrt": math.sqrt, "pi": math.pi, "e": math.e})
        return {"result": round(float(result), 6), "expression": expression}
    except Exception as e:
        return {"error": str(e)}

def web_search(query: str, max_results: int = 3) -> Dict:
    """Simulated web search (replace with real API in production)."""
    mock_results = {
        "transformer architecture": [
            {"title": "Attention Is All You Need", "snippet": "Introduces the transformer architecture using self-attention mechanisms.", "url": "arxiv.org/abs/1706.03762"},
            {"title": "BERT paper", "snippet": "Bidirectional encoder representations from transformers for NLP.", "url": "arxiv.org/abs/1810.04805"},
        ],
        "python list comprehension": [
            {"title": "Python Docs", "snippet": "List comprehensions provide a concise way to create lists: [expr for item in iterable if condition]", "url": "docs.python.org"},
        ],
        "climate change": [
            {"title": "IPCC Report 2023", "snippet": "Global surface temperature increased by 1.1°C above pre-industrial levels.", "url": "ipcc.ch/report/ar6"},
            {"title": "NASA Climate", "snippet": "CO2 levels reached 421 ppm in 2023, highest in 3 million years.", "url": "climate.nasa.gov"},
        ],
    }
    query_lower = query.lower()
    for key, results in mock_results.items():
        if any(word in query_lower for word in key.split()):
            return {"query": query, "results": results[:max_results]}
    return {"query": query, "results": [
        {"title": f"Result for '{query}'",
         "snippet": f"Information about {query}. This is a simulated search result.",
         "url": f"example.com/search?q={query.replace(' ', '+')}"}
    ]}

def get_current_time(timezone: str = "UTC") -> Dict:
    now = datetime.datetime.utcnow()
    return {
        "datetime": now.strftime("%Y-%m-%d %H:%M:%S"),
        "timezone": timezone,
        "date":     now.strftime("%B %d, %Y"),
        "day":      now.strftime("%A")
    }

def write_file(filename: str, content: str) -> Dict:
    try:
        with open(filename, "w") as f:
            f.write(content)
        return {"status": "success", "filename": filename,
                "bytes_written": len(content)}
    except Exception as e:
        return {"error": str(e)}

def read_file(filename: str) -> Dict:
    try:
        with open(filename, "r") as f:
            content = f.read()
        return {"filename": filename, "content": content,
                "lines": content.count("\n") + 1}
    except FileNotFoundError:
        return {"error": f"File '{filename}' not found"}

def python_repl(code: str) -> Dict:
    """Execute Python code and return output."""
    import io, contextlib
    output = io.StringIO()
    try:
        with contextlib.redirect_stdout(output):
            exec(code, {"__builtins__": __builtins__})
        return {"output": output.getvalue(), "error": None}
    except Exception as e:
        return {"output": output.getvalue(), "error": str(e)}

TOOLS = [
    Tool(
        name="calculator",
        description="Evaluate mathematical expressions. Supports +,-,*,/,(,),sqrt,pi,e",
        fn=calculator,
        schema={
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression to evaluate"}
            },
            "required": ["expression"]
        }
    ),
    Tool(
        name="web_search",
        description="Search the web for current information on any topic",
        fn=web_search,
        schema={
            "type": "object",
            "properties": {
                "query":       {"type": "string",  "description": "Search query"},
                "max_results": {"type": "integer", "description": "Number of results", "default": 3}
            },
            "required": ["query"]
        }
    ),
    Tool(
        name="get_current_time",
        description="Get the current date and time",
        fn=get_current_time,
        schema={
            "type": "object",
            "properties": {
                "timezone": {"type": "string", "description": "Timezone name", "default": "UTC"}
            }
        }
    ),
    Tool(
        name="write_file",
        description="Write text content to a file",
        fn=write_file,
        schema={
            "type": "object",
            "properties": {
                "filename": {"type": "string", "description": "File name to write"},
                "content":  {"type": "string", "description": "Content to write"}
            },
            "required": ["filename", "content"]
        }
    ),
    Tool(
        name="read_file",
        description="Read content from a file",
        fn=read_file,
        schema={
            "type": "object",
            "properties": {
                "filename": {"type": "string", "description": "File name to read"}
            },
            "required": ["filename"]
        }
    ),
    Tool(
        name="python_repl",
        description="Execute Python code and return the output",
        fn=python_repl,
        schema={
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python code to execute"}
            },
            "required": ["code"]
        }
    ),
]

print(f"Tool library ready: {len(TOOLS)} tools")
for tool in TOOLS:
    print(f"  • {tool.name}: {tool.description[:50]}")

Running the Agent

agent = Agent(tools=TOOLS, max_steps=10, verbose=True)

tasks = [
    "What is 15% of 847 plus the square root of 144?",
    "Search for information about the transformer architecture, then write a 3-sentence summary to a file called 'transformer_summary.txt'",
    "What day of the week is it today? Then calculate how many days until the next New Year's Day.",
]

for task in tasks[:1]:
    print(f"\n{'#'*60}")
    result = agent.run(task)
    print(f"\nResult: {result}")

Output:

============================================================
Agent Goal: What is 15% of 847 plus the square root of 144?
============================================================

[Step 1] 🔧 calculator({"expression": "847 * 0.15 + sqrt(144)"})
         ↳ {"result": 139.05, "expression": "847 * 0.15 + sqrt(144)"}

✓ Final Answer: 15% of 847 is 127.05, and the square root of 144 is 12.
The sum is 127.05 + 12 = 139.05

Multi-Step Agent: Research and Write

research_task = """
Search for information about BERT and GPT transformer models.
Compare them by searching for both separately.
Then write a markdown comparison table to a file called 'llm_comparison.md'
with columns: Model, Type, Pretraining Objective, Best Use Case.
"""

print("Running multi-step research agent:")
result = agent.run(research_task)
print(f"\nFinal result: {result}")

try:
    result = read_file("llm_comparison.md")
    if "error" not in result:
        print(f"\nFile created successfully:")
        print(result["content"])
except:
    pass

Agent Failure Modes: What Goes Wrong

print("\nAgent Failure Modes You Will Encounter:")
print()

failure_modes = {
    "Infinite loops": {
        "description": "Agent keeps calling the same tool expecting different results",
        "example":     "Search fails → search again → search again → max steps",
        "fix":         "Add step counter, detect repeated tool calls, add termination conditions"
    },
    "Tool hallucination": {
        "description": "Agent invents tool parameters that do not match the schema",
        "example":     "Calls calculator({'math': '2+2'}) instead of {'expression': '2+2'}",
        "fix":         "Validate inputs against schema before execution, strict schema definitions"
    },
    "Goal drift": {
        "description": "Agent pursues a sub-goal and forgets the original goal",
        "example":     "Asked to 'find a restaurant', agent spends all steps on dietary research",
        "fix":         "Include original goal in every message, add goal-check in system prompt"
    },
    "Over-tool-use": {
        "description": "Agent calls tools for things it already knows",
        "example":     "Uses calculator to compute 2+2, searches web for 'what is Python'",
        "fix":         "Better system prompt guidance, cost-awareness in tool descriptions"
    },
    "Cascading errors": {
        "description": "Early tool failure propagates through all subsequent steps",
        "example":     "File read fails → all downstream processing fails silently",
        "fix":         "Error handling in tool functions, check for error keys in results"
    },
    "Context window overflow": {
        "description": "Many tool calls accumulate and exceed context limit",
        "example":     "20+ tool calls with large results → API error",
        "fix":         "Summarize tool results, limit result size, truncate old history"
    },
}

for mode, info in failure_modes.items():
    print(f"  {mode}:")
    print(f"    What: {info['description']}")
    print(f"    Example: {info['example']}")
    print(f"    Fix: {info['fix']}")
    print()

Planning Agents: Think Before Acting

PLANNING_SYSTEM = """You are a planning agent. For complex tasks:
1. First create a plan as numbered steps
2. Execute each step using available tools
3. Verify each step succeeded before proceeding
4. If a step fails, adjust the plan

Always show your reasoning before calling tools.
Format thoughts as: 'Thought: [your reasoning]'"""

planning_agent = Agent(
    tools         = TOOLS,
    system_prompt = PLANNING_SYSTEM,
    max_steps     = 15,
    verbose       = True
)

print("Planning agent for complex multi-step task:")
complex_task = """
Calculate the compound interest on $10,000 at 7% annual rate for 10 years.
Then generate a Python script that prints a table showing the balance at the end of each year.
Save the script as 'compound_interest.py'.
"""
result = planning_agent.run(complex_task)

Agent Evaluation

def evaluate_agent(agent, test_cases):
    """Evaluate agent on a set of test cases."""
    results = []
    for case in test_cases:
        start = __import__("time").time()
        try:
            answer = agent.run(case["goal"])
            success = case["check"](answer)
        except Exception as e:
            answer  = str(e)
            success = False
        elapsed = __import__("time").time() - start

        results.append({
            "goal":    case["goal"][:40],
            "success": success,
            "time":    elapsed,
            "steps":   "N/A",
        })

    print("\nAgent Evaluation:")
    print(f"{'Goal':<42} {'Success':>8} {'Time':>8}")
    print("=" * 62)
    for r in results:
        print(f"{r['goal']:<42} {'✓' if r['success'] else '✗':>8} {r['time']:>7.1f}s")

    accuracy = sum(r["success"] for r in results) / len(results)
    avg_time = sum(r["time"] for r in results) / len(results)
    print(f"\nAccuracy: {accuracy:.0%}  |  Avg time: {avg_time:.1f}s")

test_suite = [
    {
        "goal":  "Calculate 17 * 23 + 144",
        "check": lambda a: "535" in a
    },
    {
        "goal":  "Search for Python list comprehension syntax",
        "check": lambda a: any(w in a.lower() for w in ["for", "if", "[", "list"])
    },
    {
        "goal":  "Write 'Hello World' to hello.txt then read it back",
        "check": lambda a: "hello" in a.lower() or "world" in a.lower()
    },
]

evaluate_agent(agent, test_suite)

Reference Links

print("\nEssential Agent Reference Links:")
print()

refs = {
    "Papers": [
        ("ReAct: Reason + Act in LLMs",        "arxiv.org/abs/2210.03629"),
        ("Toolformer: Teaching LLMs to use tools", "arxiv.org/abs/2302.04761"),
        ("AutoGPT: Autonomous agents",          "github.com/Significant-Gravitas/AutoGPT"),
        ("AgentBench: Evaluating agents",       "arxiv.org/abs/2308.03688"),
        ("Chain-of-Thought Prompting",          "arxiv.org/abs/2201.11903"),
    ],
    "Frameworks": [
        ("LangChain Agents",        "python.langchain.com/docs/modules/agents"),
        ("LlamaIndex Agents",       "docs.llamaindex.ai/en/stable/use_cases/agents"),
        ("Anthropic Tool Use",      "docs.anthropic.com/en/docs/build-with-claude/tool-use"),
        ("OpenAI Assistants API",   "platform.openai.com/docs/assistants/overview"),
        ("CrewAI (multi-agent)",    "crewai.com"),
        ("AutoGen (Microsoft)",     "github.com/microsoft/autogen"),
    ],
    "Tutorials": [
        ("Build an AI Agent from Scratch", "towardsdatascience.com/ai-agents-from-scratch"),
        ("Anthropic Cookbook: Agents",     "github.com/anthropics/anthropic-cookbook/tree/main/tool_use"),
        ("DeepLearning.AI Agent Courses",  "learn.deeplearning.ai"),
        ("LangGraph (stateful agents)",    "langchain-ai.github.io/langgraph"),
    ],
    "Cheat Sheets": [
        ("Agent design patterns",           "lilianweng.github.io/posts/2023-06-23-agent"),
        ("Tool use best practices",         "docs.anthropic.com/en/docs/build-with-claude/tool-use"),
        ("Prompt engineering for agents",   "learnprompting.org/docs/advanced/agents"),
    ],
}

for category, links in refs.items():
    print(f"  {category}:")
    for name, url in links:
        print(f"    • {name:<42} {url}")
    print()

Try This

Create agent_practice.py.

Part 1: tool library. Implement at least five tools: calculator, web search (mock), time/date, file read/write, and one domain-specific tool of your choice (weather lookup, stock prices, unit converter). Test each tool function directly before plugging into the agent.

Part 2: single-step tasks. Run the agent on five tasks that require exactly one tool call. Verify it calls the right tool with the right arguments each time.

Part 3: multi-step tasks. Run on three tasks requiring 3-5 tool calls each. Examples: "Search for X, compute a calculation on the result, save to file." Track how many steps each task takes. Does the agent complete them correctly?

Part 4: failure injection. Modify one tool to randomly fail 30% of the time. Run a task that depends on that tool 10 times. Does the agent handle failures gracefully? Does it retry? Adjust the system prompt to make it more resilient.

What's Next

Single agents work alone. Multi-agent systems divide complex tasks between specialized agents: a researcher agent, a writer agent, a code reviewer agent, each doing what it does best, coordinated by an orchestrator. That is the next post.

99. Build a Chatbot With Memory

Akhilesh — Thu, 28 May 2026 05:08:49 +0000

You ask a chatbot: "What's the capital of France?"

It says: "Paris."

You ask: "What's the population there?"

It says: "Where?"

That's a stateless chatbot. Every message is treated as a completely new conversation. It has no idea what "there" refers to. It has no memory.

Real conversation doesn't work like this. Context carries forward. References accumulate. The chatbot needs to know what came before.

This post builds a chatbot with memory. One that knows what you said two messages ago, what topic you're discussing, and what decisions were made earlier.

What You'll Learn Here

Why LLMs are stateless and how to fake memory
The conversation history pattern: how it actually works
Context window limits and why they matter
Sliding window memory: keep the last N messages
Summary memory: compress old conversations
Entity memory: remember specific facts about the user
Building a full multi-turn chatbot with LangChain
Persisting memory across sessions

Why LLMs Are Stateless

Every time you call an LLM API, it starts fresh. It has zero memory of previous calls. The only context it has is what you put in the current prompt.

The trick that makes chatbots work: you include the entire conversation history in every prompt.

Turn 1:
  USER: What's the capital of France?
  → Send to LLM: "User: What's the capital of France?"
  → LLM replies: "Paris"

Turn 2:
  USER: What's the population there?
  → Send to LLM:
      "User: What's the capital of France?
       Assistant: Paris.
       User: What's the population there?"
  → LLM sees full context, knows "there" = Paris

Turn 3:
  → Send EVERYTHING from turns 1, 2, and now 3

Every message appends to a growing list. That list goes into every subsequent prompt. The LLM can refer back to it because it's in the current context.

Simple. But it has a hard limit: the context window.

The Context Window Problem

Every LLM has a maximum number of tokens it can process at once. GPT-3.5-turbo: 16k tokens. GPT-4: 128k tokens. LLaMA-7B: 4k tokens.

A long conversation fills up that window. When the conversation exceeds the limit, you can't just include everything. You need a strategy.

# Estimate token count (rough: 1 token ≈ 4 characters for English)
def estimate_tokens(text: str) -> int:
    return len(text) // 4

def estimate_conversation_tokens(messages: list) -> int:
    total = 0
    for msg in messages:
        total += estimate_tokens(msg['content'])
        total += 4   # overhead per message (role, formatting)
    return total

# Show how fast a conversation fills up
messages = []
example_turns = [
    ("user", "Tell me about machine learning."),
    ("assistant", "Machine learning is a field of artificial intelligence that enables computers to learn from data without being explicitly programmed. It includes supervised learning, where models are trained on labeled examples, unsupervised learning, where patterns are found without labels, and reinforcement learning, where agents learn through trial and error."),
    ("user", "What about deep learning specifically?"),
    ("assistant", "Deep learning is a subset of machine learning that uses neural networks with many layers. These networks learn hierarchical representations of data, making them especially powerful for images, audio, and text. The transformer architecture, introduced in 2017, has become the foundation for most modern deep learning systems."),
    ("user", "Can you give me examples of real applications?"),
    ("assistant", "Sure! Real applications include image classification in medical diagnosis, natural language processing for translation and chatbots, recommendation systems on Netflix and Spotify, fraud detection in banking, and autonomous driving. Deep learning powers most of these through pattern recognition at scale."),
]

print(f"{'Turn':<6} {'New tokens':<14} {'Total tokens':<14} {'% of 4k limit'}")
print("-" * 50)
for role, content in example_turns:
    messages.append({'role': role, 'content': content})
    total = estimate_conversation_tokens(messages)
    new   = estimate_tokens(content)
    print(f"{len(messages):<6} {new:<14} {total:<14} {total/4000:.1%}")

Output:

Turn   New tokens     Total tokens   % of 4k limit
--------------------------------------------------
1      12             16             0.4%
2      73             93             2.3%
3      13             110            2.8%
4      65             179            4.5%
5      15             198            5.0%
6      72             274            6.9%

A long conversation about a complex topic can easily hit 2000-3000 tokens. Add RAG context and system prompts, and you're at the limit fast.

Strategy 1: Sliding Window Memory

Keep only the last N messages. Simple and effective.

from collections import deque
from typing import List, Optional

class SlidingWindowChatbot:
    def __init__(self, model_pipeline, window_size: int = 10,
                 system_prompt: str = "You are a helpful assistant."):
        self.model         = model_pipeline
        self.window_size   = window_size  # max messages to keep
        self.system_prompt = system_prompt
        self.history       = deque(maxlen=window_size)

    def chat(self, user_message: str) -> str:
        # Add user message to history
        self.history.append({'role': 'user', 'content': user_message})

        # Build the prompt with history
        messages = [
            {'role': 'system', 'content': self.system_prompt}
        ] + list(self.history)

        # Call the model (using a simple text format for demo)
        prompt = self._format_prompt(messages)
        response = self.model(prompt)

        # Add assistant response to history
        self.history.append({'role': 'assistant', 'content': response})

        return response

    def _format_prompt(self, messages: List[dict]) -> str:
        formatted = ""
        for msg in messages:
            if msg['role'] == 'system':
                formatted += f"System: {msg['content']}\n\n"
            elif msg['role'] == 'user':
                formatted += f"Human: {msg['content']}\n"
            else:
                formatted += f"Assistant: {msg['content']}\n"
        formatted += "Assistant:"
        return formatted

    def get_history(self) -> list:
        return list(self.history)

    def clear(self):
        self.history.clear()
        print("Conversation history cleared.")

# Simulate a conversation (using a mock model for demo)
def mock_model(prompt: str) -> str:
    # In production: replace with real LLM call
    if "capital of france" in prompt.lower():
        return "The capital of France is Paris."
    elif "population" in prompt.lower() and "paris" in prompt.lower():
        return "Paris has a population of approximately 2.1 million in the city proper, and about 12 million in the greater metropolitan area."
    elif "famous landmark" in prompt.lower():
        return "Paris is famous for the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe."
    elif "eiffel tower" in prompt.lower():
        return "The Eiffel Tower was built between 1887 and 1889, designed by engineer Gustave Eiffel. It stands 330 meters tall."
    else:
        return "I understand. Could you tell me more?"

bot = SlidingWindowChatbot(mock_model, window_size=6)

# Simulate multi-turn conversation
turns = [
    "What's the capital of France?",
    "What's the population there?",
    "What are some famous landmarks in that city?",
    "Tell me more about the Eiffel Tower.",
    "When was it built?",
]

for user_input in turns:
    print(f"\nUser: {user_input}")
    response = bot.chat(user_input)
    print(f"Bot:  {response}")

print(f"\nHistory has {len(bot.get_history())} messages (max {bot.window_size})")

Output:

User: What's the capital of France?
Bot:  The capital of France is Paris.

User: What's the population there?
Bot:  Paris has a population of approximately 2.1 million in the city proper...

User: What are some famous landmarks in that city?
Bot:  Paris is famous for the Eiffel Tower, the Louvre Museum...

User: Tell me more about the Eiffel Tower.
Bot:  The Eiffel Tower was built between 1887 and 1889...

User: When was it built?
Bot:  I understand. Could you tell me more?

History has 6 messages (max 6)

The bot understands "there" (Paris) and "that city" (Paris) from context. The sliding window keeps the last 6 messages.

Strategy 2: Summary Memory

When history gets long, summarize old messages and keep recent ones in full.

class SummaryMemoryChatbot:
    def __init__(self, model_pipeline, summarizer_pipeline,
                 max_recent: int = 6, summary_threshold: int = 10,
                 system_prompt: str = "You are a helpful assistant."):
        self.model       = model_pipeline
        self.summarizer  = summarizer_pipeline
        self.max_recent  = max_recent
        self.threshold   = summary_threshold
        self.system      = system_prompt
        self.history     = []
        self.summary     = ""     # compressed memory of older turns

    def _maybe_summarize(self):
        if len(self.history) < self.threshold:
            return

        # Summarize the oldest half of history
        n_to_summarize = len(self.history) // 2
        old_messages   = self.history[:n_to_summarize]
        self.history   = self.history[n_to_summarize:]

        # Format old messages as text
        old_text = "\n".join([
            f"{m['role'].title()}: {m['content']}"
            for m in old_messages
        ])

        # Summarize (in production, call LLM to summarize)
        new_summary_input = f"{self.summary}\n\n{old_text}" if self.summary else old_text
        self.summary = self._summarize(new_summary_input)

        print(f"[Memory] Summarized {n_to_summarize} messages into summary")

    def _summarize(self, text: str) -> str:
        # In production: call LLM with a summarization prompt
        # Here: mock it
        return f"[Summary of earlier conversation: The user asked about France, Paris, its population (~2.1M), and Paris landmarks including the Eiffel Tower.]"

    def _format_prompt(self) -> str:
        parts = [f"System: {self.system}\n"]

        if self.summary:
            parts.append(f"[Earlier conversation summary]: {self.summary}\n")

        for msg in self.history[-self.max_recent:]:
            role = "Human" if msg['role'] == 'user' else "Assistant"
            parts.append(f"{role}: {msg['content']}")

        parts.append("Assistant:")
        return "\n".join(parts)

    def chat(self, user_message: str) -> str:
        self.history.append({'role': 'user', 'content': user_message})
        self._maybe_summarize()

        prompt   = self._format_prompt()
        response = self.model(prompt)
        self.history.append({'role': 'assistant', 'content': response})

        return response

    def memory_status(self):
        print(f"Summary: {'yes' if self.summary else 'none'}")
        print(f"Recent messages in full: {min(len(self.history), self.max_recent)}")
        print(f"Total history: {len(self.history)}")

summary_bot = SummaryMemoryChatbot(mock_model, None, max_recent=6, summary_threshold=8)

for user_input in turns * 2:  # repeat to trigger summarization
    response = summary_bot.chat(user_input)

summary_bot.memory_status()

Strategy 3: Entity Memory

Extract and store specific facts about the user or conversation entities.

import re
from typing import Dict

class EntityMemoryChatbot:
    def __init__(self, model_pipeline,
                 system_prompt: str = "You are a helpful assistant."):
        self.model   = model_pipeline
        self.system  = system_prompt
        self.history = []
        self.entities: Dict[str, str] = {}   # entity store

    def _extract_entities(self, message: str):
        # Simplified entity extraction (in production: use NER model or LLM)
        patterns = {
            'name':     r"(?:my name is|I am|I'm)\s+([A-Z][a-z]+)",
            'location': r"(?:I live in|I'm from|I'm in)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)",
            'job':      r"(?:I am a|I work as a|I'm a)\s+([a-z]+(?:\s+[a-z]+)?)",
            'topic':    r"(?:I want to learn about|I'm studying|I need help with)\s+([a-z\s]+)"
        }

        for entity_type, pattern in patterns.items():
            match = re.search(pattern, message, re.IGNORECASE)
            if match:
                self.entities[entity_type] = match.group(1).strip()

    def _build_entity_context(self) -> str:
        if not self.entities:
            return ""
        lines = ["Known facts about the user:"]
        for entity, value in self.entities.items():
            lines.append(f"  - {entity}: {value}")
        return "\n".join(lines)

    def _format_prompt(self) -> str:
        parts = [f"System: {self.system}"]

        entity_ctx = self._build_entity_context()
        if entity_ctx:
            parts.append(entity_ctx)

        for msg in self.history[-8:]:
            role = "Human" if msg['role'] == 'user' else "Assistant"
            parts.append(f"{role}: {msg['content']}")

        parts.append("Assistant:")
        return "\n".join(parts)

    def chat(self, user_message: str) -> str:
        self._extract_entities(user_message)
        self.history.append({'role': 'user', 'content': user_message})

        prompt   = self._format_prompt()
        response = self.model(prompt)
        self.history.append({'role': 'assistant', 'content': response})

        return response

# Test entity memory
def entity_mock_model(prompt: str) -> str:
    if "name" in prompt.lower() and "Alex" in prompt:
        return "Nice to meet you, Alex!"
    elif "Alex" in prompt and "recommend" in prompt.lower():
        return "Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn."
    elif "course" in prompt.lower():
        return "For machine learning, the Andrew Ng Coursera course is excellent for beginners."
    else:
        return "Tell me more about what you'd like to learn."

entity_bot = EntityMemoryChatbot(entity_mock_model)

conversations = [
    "Hi, my name is Alex.",
    "I want to learn about machine learning.",
    "Can you recommend something?",
    "Are there any courses?",
]

for user_input in conversations:
    print(f"\nUser: {user_input}")
    response = entity_bot.chat(user_input)
    print(f"Bot:  {response}")

print(f"\nExtracted entities: {entity_bot.entities}")

Output:

User: Hi, my name is Alex.
Bot:  Nice to meet you, Alex!

User: I want to learn about machine learning.
Bot:  Tell me more about what you'd like to learn.

User: Can you recommend something?
Bot:  Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn.

User: Are there any courses?
Bot:  For machine learning, the Andrew Ng Coursera course is excellent for beginners.

Extracted entities: {'name': 'Alex', 'topic': 'machine learning'}

The bot remembers the user's name and topic across all turns.

Full Chatbot With the OpenAI API

import openai
import json
from datetime import datetime

class ProductionChatbot:
    def __init__(
        self,
        system_prompt: str = "You are a helpful AI assistant.",
        model: str = "gpt-3.5-turbo",
        max_history: int = 20,
        max_tokens: int = 500,
        temperature: float = 0.7
    ):
        self.client      = openai.OpenAI()
        self.model       = model
        self.max_history = max_history
        self.max_tokens  = max_tokens
        self.temperature = temperature
        self.history     = []
        self.system      = system_prompt
        self.created_at  = datetime.now()

    def chat(self, user_message: str) -> str:
        self.history.append({'role': 'user', 'content': user_message})

        # Trim history if too long
        if len(self.history) > self.max_history:
            self.history = self.history[-self.max_history:]

        # Build message list for API
        messages = [
            {'role': 'system', 'content': self.system}
        ] + self.history

        # Call API
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
        )

        assistant_message = response.choices[0].message.content
        self.history.append({'role': 'assistant', 'content': assistant_message})

        return assistant_message

    def save_conversation(self, filepath: str):
        data = {
            'created_at': self.created_at.isoformat(),
            'saved_at':   datetime.now().isoformat(),
            'model':      self.model,
            'system':     self.system,
            'messages':   self.history
        }
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)
        print(f"Saved {len(self.history)} messages to {filepath}")

    def load_conversation(self, filepath: str):
        with open(filepath, 'r') as f:
            data = json.load(f)
        self.history = data['messages']
        self.system  = data.get('system', self.system)
        print(f"Loaded {len(self.history)} messages from {filepath}")

    def reset(self):
        self.history = []
        print("Conversation reset.")

    def get_stats(self) -> dict:
        n_user      = sum(1 for m in self.history if m['role'] == 'user')
        n_assistant = sum(1 for m in self.history if m['role'] == 'assistant')
        total_chars = sum(len(m['content']) for m in self.history)

        return {
            'turns':            n_user,
            'total_messages':   len(self.history),
            'estimated_tokens': total_chars // 4,
            'history_depth':    len(self.history)
        }

# Usage
# bot = ProductionChatbot(
#     system_prompt="You are a helpful ML tutor specializing in practical examples.",
#     model="gpt-3.5-turbo",
#     max_history=20
# )
# response = bot.chat("Explain overfitting to me.")
# print(response)
# bot.save_conversation('session_001.json')
print("ProductionChatbot ready (requires OPENAI_API_KEY)")

LangChain Memory: The Easy Way

from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory
from langchain.chains import ConversationChain
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline

# Create LLM
gen_pipe = hf_pipeline('text-generation', model='gpt2', max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=gen_pipe)

# Buffer memory: keeps all messages
buffer_memory = ConversationBufferMemory()

# Summary memory: automatically summarizes when too long
# summary_memory = ConversationSummaryMemory(llm=llm)

# Build conversation chain
conversation = ConversationChain(
    llm=llm,
    memory=buffer_memory,
    verbose=False
)

# Chat
result = conversation.predict(input="Hello, my name is Alex.")
print(f"Bot: {result[:100]}...")

result = conversation.predict(input="What is my name?")
print(f"Bot: {result[:100]}...")

# Inspect memory
print(f"\nMemory buffer:\n{buffer_memory.buffer}")

Persisting Memory Across Sessions

import json
import os

class PersistentChatbot:
    def __init__(self, model_pipeline, session_id: str,
                 storage_dir: str = './chat_sessions',
                 max_history: int = 50):
        self.model       = model_pipeline
        self.session_id  = session_id
        self.storage_dir = storage_dir
        self.max_history = max_history
        self.history     = []
        self.metadata    = {}

        os.makedirs(storage_dir, exist_ok=True)
        self._load_session()

    def _session_path(self) -> str:
        return os.path.join(self.storage_dir, f"{self.session_id}.json")

    def _load_session(self):
        path = self._session_path()
        if os.path.exists(path):
            with open(path, 'r') as f:
                data = json.load(f)
            self.history  = data.get('history', [])
            self.metadata = data.get('metadata', {})
            print(f"Loaded session '{self.session_id}' with {len(self.history)} messages")
        else:
            print(f"New session '{self.session_id}' started")

    def _save_session(self):
        data = {
            'session_id':   self.session_id,
            'last_updated': datetime.now().isoformat(),
            'history':      self.history,
            'metadata':     self.metadata
        }
        with open(self._session_path(), 'w') as f:
            json.dump(data, f, indent=2)

    def chat(self, user_message: str) -> str:
        self.history.append({'role': 'user', 'content': user_message})

        if len(self.history) > self.max_history:
            self.history = self.history[-self.max_history:]

        response = self.model(self._format_prompt())
        self.history.append({'role': 'assistant', 'content': response})
        self._save_session()

        return response

    def _format_prompt(self) -> str:
        parts = []
        for msg in self.history[-10:]:
            role = "Human" if msg['role'] == 'user' else "Assistant"
            parts.append(f"{role}: {msg['content']}")
        parts.append("Assistant:")
        return "\n".join(parts)

    def list_sessions(self) -> list:
        sessions = []
        for f in os.listdir(self.storage_dir):
            if f.endswith('.json'):
                sessions.append(f.replace('.json', ''))
        return sessions

# Usage
persistent_bot = PersistentChatbot(mock_model, session_id='user_alex_001')
persistent_bot.chat("What's the capital of France?")
persistent_bot.chat("What's the population there?")

print(f"\nSaved sessions: {persistent_bot.list_sessions()}")
print(f"History length: {len(persistent_bot.history)} messages")

Chatbot Quality Checklist

checklist = {
    "Memory management": [
        "Does the bot remember context from 5+ turns ago?",
        "Does it handle coreferences correctly? ('there', 'it', 'they')",
        "Does it avoid repeating information the user already gave?"
    ],
    "Context window": [
        "Does it handle very long conversations without breaking?",
        "Is there a graceful fallback when history is too long?",
        "Are summarized messages accurate and not lossy?"
    ],
    "Conversation quality": [
        "Does it stay on topic through the conversation?",
        "Does it refer to earlier decisions correctly?",
        "Does it handle topic switches gracefully?"
    ],
    "Persistence": [
        "Does it save conversations for later use?",
        "Can it resume from a previous session?",
        "Is the storage format readable and debuggable?"
    ],
    "Edge cases": [
        "What happens if the user asks about something not in memory?",
        "What happens if the user contradicts themselves?",
        "Does it handle very short or very long user messages?"
    ]
}

for category, items in checklist.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  [ ] {item}")

Quick Cheat Sheet

Memory type	When to use	How it works
Buffer (all history)	Short conversations	Keep all messages, pass everything
Sliding window	Medium conversations	Keep last N messages only
Summary memory	Long conversations	Summarize old messages, keep recent in full
Entity memory	User-specific facts	Extract and store named entities
Persistent memory	Multi-session chatbots	Save/load from disk or database

Pattern	Code
Add to history	`history.append({'role': 'user', 'content': msg})`
Trim history	`history = history[-max_size:]`
Build messages	`[{'role': 'system', 'content': system}] + history`
Save session	`json.dump({'history': history}, f)`
Load session	`history = json.load(f)['history']`
LangChain buffer	`ConversationBufferMemory()`
LangChain summary	`ConversationSummaryMemory(llm=llm)`

Practice Challenges

Level 1:
Build a SlidingWindowChatbot that talks to GPT-2 locally. Have a 10-turn conversation about a topic of your choice. Print the full history at the end. Verify the bot correctly references things from earlier turns.

Level 2:
Implement SummaryMemoryChatbot with a real summarization call. After every 8 turns, summarize the first half using a small T5 model. Test with a 20-turn conversation. Print the summary after it triggers. Is the summary accurate?

Level 3:
Build PersistentChatbot that stores conversations to disk. Start a conversation, close it, restart the program, load the session, and continue the conversation. Verify the bot remembers what was said in the previous session. Add a /history command that prints a summary of previous sessions.

References

Final post, Post 100: OpenAI API: Build With GPT-4. API setup, chat completions, function calling, streaming, and cost management. The last post in the series wraps everything together.

100. OpenAI API: Build With GPT-4 (Post 100: The Final Chapter)

Akhilesh — Thu, 28 May 2026 05:08:19 +0000

Post 1 was Python variables.

Post 100 is GPT-4.

You've come from writing your first for loop to understanding transformer architectures, building neural networks from scratch, fine-tuning LLMs with LoRA, building RAG pipelines, and creating chatbots with memory.

This final post puts it all together. The OpenAI API is how most people actually ship AI products. Chat completions, function calling, streaming, vision, embeddings. Everything you need to build something real.

Let's finish strong.

What You'll Learn Here

API setup and authentication
Chat completions: the core pattern
System prompts: controlling model behavior
Function calling: giving LLMs tools
Streaming: responses that appear word by word
Vision: analyzing images with GPT-4V
Embeddings via API: fast, high quality
Token counting and cost management
Rate limits and error handling
A complete project: an AI assistant with tools

Setup

pip install openai tiktoken

import openai
import os

# Set your API key
# Option 1: environment variable (recommended)
# export OPENAI_API_KEY='sk-...'

# Option 2: set directly (never commit this to git)
# openai.api_key = 'sk-...'

client = openai.OpenAI()  # reads OPENAI_API_KEY from environment

# Test the connection
models = client.models.list()
print("Connected to OpenAI API")
print(f"Available models (sample): {[m.id for m in list(models)[:5]]}")

Chat Completions: The Core Pattern

Every interaction with GPT-4 goes through the same API call. Messages are a list of role-content pairs: system, user, and assistant.

# Simplest possible call
response = client.chat.completions.create(
    model="gpt-4o-mini",   # fast and cheap for most tasks
    messages=[
        {"role": "user", "content": "What is machine learning in one sentence?"}
    ]
)

print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
print(f"Model: {response.model}")

Output:

Machine learning is a field of AI where computers learn patterns from data to make predictions or decisions without being explicitly programmed for each task.

Tokens used: 52
Model: gpt-4o-mini

# With system prompt and multiple turns
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a senior ML engineer who explains concepts clearly and concisely. Use analogies when helpful. Never use jargon without explaining it."
        },
        {
            "role": "user",
            "content": "Explain overfitting."
        }
    ],
    temperature=0.7,        # creativity (0=deterministic, 2=very random)
    max_tokens=300,         # cap the response length
    top_p=0.95,             # nucleus sampling
    frequency_penalty=0.1,  # penalize repeated tokens
    presence_penalty=0.1,   # penalize already-mentioned topics
)

print(response.choices[0].message.content)

System Prompts: Controlling Model Behavior

The system prompt is the single most powerful tool for shaping GPT-4's behavior.

system_prompts = {
    "concise_explainer": """
You explain technical concepts in 3 sentences or less.
Always use a concrete real-world example.
Never use bullet points.
""",
    "code_reviewer": """
You are a senior Python engineer doing code review.
Point out bugs, style issues, and performance problems.
Format your response as:
BUGS: (list any bugs)
STYLE: (list style issues)
PERFORMANCE: (list performance concerns)
SUGGESTIONS: (overall recommendation)
""",
    "socratic_tutor": """
You are a Socratic tutor. Never give direct answers.
Instead, guide the student with questions that help them discover the answer themselves.
When they get something right, affirm it and ask a deeper follow-up question.
""",
    "strict_json": """
You always respond in valid JSON format only.
No markdown. No explanation. Just raw JSON.
Never include anything outside the JSON structure.
"""
}

# Test the code reviewer persona
code_to_review = """
def get_user_data(user_ids):
    results = []
    for id in user_ids:
        data = database.query(f"SELECT * FROM users WHERE id = {id}")
        results.append(data)
    return results
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system",  "content": system_prompts["code_reviewer"]},
        {"role": "user",    "content": f"Review this code:\n```
{% endraw %}
python{code_to_review}
{% raw %}
```"}
    ]
)
print(response.choices[0].message.content)

Structured Output: JSON Mode

When you need JSON responses you can parse reliably.

import json

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "Extract structured information from the text. Return JSON only."
        },
        {
            "role": "user",
            "content": """
Extract the following from this text:
- person_name
- job_title
- company
- key_skills (list)

Text: "Sarah Chen is a Senior ML Engineer at Anthropic. She specializes in 
transformer architectures, reinforcement learning from human feedback, and 
large-scale distributed training."
"""
        }
    ],
    response_format={"type": "json_object"}   # forces JSON output
)

data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))

Output:

{
  "person_name": "Sarah Chen",
  "job_title": "Senior ML Engineer",
  "company": "Anthropic",
  "key_skills": [
    "transformer architectures",
    "reinforcement learning from human feedback",
    "large-scale distributed training"
  ]
}

Function Calling: Giving GPT-4 Tools

Function calling lets the model request tools (functions) that you define. The model decides when to call a function and with what arguments. You execute it and send results back.

This is how AI agents work.

import json

# Define tools the model can use
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'London' or 'Tokyo'"
                    },
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature units"
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_documents",
            "description": "Search the company knowledge base for relevant documents",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query"
                    },
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum number of results to return",
                        "default": 3
                    }
                },
                "required": ["query"]
            }
        }
    }
]

# Mock function implementations
def get_weather(city: str, units: str = "celsius") -> dict:
    # In production: call a real weather API
    return {
        "city": city,
        "temperature": 18,
        "units": units,
        "condition": "partly cloudy",
        "humidity": "65%"
    }

def search_documents(query: str, max_results: int = 3) -> list:
    # In production: call your vector database
    return [
        {"title": f"Document about {query}", "snippet": f"Relevant content for: {query}", "score": 0.92},
        {"title": f"Guide to {query}", "snippet": f"Comprehensive overview of {query}", "score": 0.87},
    ][:max_results]

# Dispatch function calls
def execute_function(name: str, arguments: dict):
    if name == "get_weather":
        return get_weather(**arguments)
    elif name == "search_documents":
        return search_documents(**arguments)
    else:
        return {"error": f"Unknown function: {name}"}

# Full function calling loop
def agent_chat(user_message: str) -> str:
    messages = [
        {"role": "system",  "content": "You are a helpful assistant with access to weather data and a document search tool. Use tools when needed."},
        {"role": "user",    "content": user_message}
    ]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=tools,
            tool_choice="auto"   # model decides when to use tools
        )

        message = response.choices[0].message

        # If no tool calls: we have the final answer
        if not message.tool_calls:
            return message.content

        # Process tool calls
        messages.append(message)   # add assistant message with tool_calls

        for tool_call in message.tool_calls:
            function_name = tool_call.function.name
            arguments     = json.loads(tool_call.function.arguments)

            print(f"  [Tool call] {function_name}({arguments})")

            result = execute_function(function_name, arguments)

            messages.append({
                "role":         "tool",
                "tool_call_id": tool_call.id,
                "content":      json.dumps(result)
            })

# Test it
queries = [
    "What's the weather like in Tokyo right now?",
    "Search for documents about machine learning best practices.",
    "What's the weather in Paris and London? Compare them.",
]

for query in queries:
    print(f"\nUser: {query}")
    answer = agent_chat(query)
    print(f"Bot:  {answer}")

Output:

User: What's the weather like in Tokyo right now?
  [Tool call] get_weather({'city': 'Tokyo', 'units': 'celsius'})
Bot:  The current weather in Tokyo is 18°C and partly cloudy, with humidity at 65%.

User: Search for documents about machine learning best practices.
  [Tool call] search_documents({'query': 'machine learning best practices', 'max_results': 3})
Bot:  I found 2 relevant documents about machine learning best practices...

User: What's the weather in Paris and London? Compare them.
  [Tool call] get_weather({'city': 'Paris', 'units': 'celsius'})
  [Tool call] get_weather({'city': 'London', 'units': 'celsius'})
Bot:  Both Paris and London currently show 18°C with partly cloudy conditions...

Streaming: Word-by-Word Responses

Instead of waiting for the full response, stream it token by token. Makes the UI feel much faster.

import sys

def stream_response(user_message: str, system: str = "You are a helpful assistant."):
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": user_message}
        ],
        stream=True   # enable streaming
    )

    full_response = ""
    print("Bot: ", end="", flush=True)

    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            token = chunk.choices[0].delta.content
            print(token, end="", flush=True)
            full_response += token

    print()   # newline at end
    return full_response

response = stream_response("Explain gradient descent in 3 sentences.")

Vision: Analyze Images With GPT-4V

import base64
from pathlib import Path

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_image(image_path: str, question: str) -> str:
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",   # vision requires gpt-4o or gpt-4-vision-preview
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"   # or "low" for faster/cheaper
                        }
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ],
        max_tokens=300
    )

    return response.choices[0].message.content

# Also works with image URLs directly
def analyze_image_url(url: str, question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": url}},
                    {"type": "text",      "text": question}
                ]
            }
        ]
    )
    return response.choices[0].message.content

# Example
# result = analyze_image_url(
#     "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Eiffel_Tower.jpg/640px-Eiffel_Tower.jpg",
#     "What is in this image? Describe it in detail."
# )
print("Vision API ready - pass an image path or URL with your question")

Embeddings via API

# High-quality embeddings from OpenAI
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list:
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

def get_embeddings_batch(texts: list, model: str = "text-embedding-3-small") -> list:
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

# Available embedding models
embedding_models = {
    "text-embedding-3-small": {"dim": 1536, "cost": "$0.02/1M tokens", "quality": "good"},
    "text-embedding-3-large": {"dim": 3072, "cost": "$0.13/1M tokens", "quality": "best"},
    "text-embedding-ada-002": {"dim": 1536, "cost": "$0.10/1M tokens", "quality": "older"},
}

print("OpenAI Embedding Models:")
for model, info in embedding_models.items():
    print(f"  {model}: dim={info['dim']}, cost={info['cost']}, quality={info['quality']}")

# Example usage
# embedding = get_embedding("Machine learning is fascinating.")
# print(f"Embedding dimension: {len(embedding)}")

Token Counting and Cost Management

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

def count_message_tokens(messages: list, model: str = "gpt-4o-mini") -> int:
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")

    tokens_per_message = 3   # every message has role + content overhead
    tokens_per_name    = 1

    total = 0
    for message in messages:
        total += tokens_per_message
        for key, value in message.items():
            total += len(encoding.encode(str(value)))
            if key == "name":
                total += tokens_per_name

    total += 3   # every reply primed with <|start|>assistant<|message|>
    return total

# Pricing (as of 2024, check openai.com/pricing for current rates)
pricing = {
    "gpt-4o":           {"input": 5.00,  "output": 15.00},   # per 1M tokens
    "gpt-4o-mini":      {"input": 0.15,  "output": 0.60},
    "gpt-3.5-turbo":    {"input": 0.50,  "output": 1.50},
    "text-embedding-3-small": {"input": 0.02, "output": 0},
}

def estimate_cost(n_input_tokens: int, n_output_tokens: int, model: str) -> float:
    if model not in pricing:
        return 0
    p        = pricing[model]
    cost_in  = (n_input_tokens / 1_000_000) * p["input"]
    cost_out = (n_output_tokens / 1_000_000) * p["output"]
    return cost_in + cost_out

# Example: estimate cost before making a call
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Explain transformer architecture in detail."}
]
model = "gpt-4o-mini"

input_tokens     = count_message_tokens(messages, model)
estimated_output = 300   # estimate based on max_tokens

cost = estimate_cost(input_tokens, estimated_output, model)

print(f"Input tokens:    {input_tokens}")
print(f"Estimated output: {estimated_output}")
print(f"Estimated cost:  ${cost:.6f}")

# Track actual usage across calls
class UsageTracker:
    def __init__(self):
        self.total_input_tokens  = 0
        self.total_output_tokens = 0
        self.total_calls         = 0
        self.model_usage         = {}

    def track(self, response, model: str):
        usage = response.usage
        self.total_input_tokens  += usage.prompt_tokens
        self.total_output_tokens += usage.completion_tokens
        self.total_calls         += 1

        if model not in self.model_usage:
            self.model_usage[model] = {'input': 0, 'output': 0, 'calls': 0}
        self.model_usage[model]['input']  += usage.prompt_tokens
        self.model_usage[model]['output'] += usage.completion_tokens
        self.model_usage[model]['calls']  += 1

    def report(self):
        print(f"Total API calls:    {self.total_calls}")
        print(f"Total input tokens: {self.total_input_tokens:,}")
        print(f"Total output tokens:{self.total_output_tokens:,}")
        total_cost = sum(
            estimate_cost(info['input'], info['output'], model)
            for model, info in self.model_usage.items()
        )
        print(f"Estimated cost:     ${total_cost:.4f}")

tracker = UsageTracker()

Error Handling and Rate Limits

import time
import random
from openai import RateLimitError, APIError, APIConnectionError

def robust_api_call(messages: list, model: str = "gpt-4o-mini",
                    max_retries: int = 3, **kwargs) -> str:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                **kwargs
            )
            return response.choices[0].message.content

        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise

        except APIConnectionError as e:
            if attempt < max_retries - 1:
                print(f"Connection error. Retrying... ({attempt+1}/{max_retries})")
                time.sleep(2)
            else:
                raise

        except APIError as e:
            if e.status_code == 500 and attempt < max_retries - 1:
                print(f"Server error. Retrying...")
                time.sleep(1)
            else:
                raise

    raise Exception("Max retries exceeded")

# Usage
# result = robust_api_call(
#     messages=[{"role": "user", "content": "Hello"}],
#     model="gpt-4o-mini",
#     temperature=0.7
# )
print("Robust API call function ready with exponential backoff")

Complete Project: An AI Assistant With Tools and Memory

Bringing it all together. A production-ready assistant that remembers, uses tools, and handles errors.

import json
import time
from typing import List, Optional
from collections import deque

class AIAssistant:
    def __init__(
        self,
        name: str = "Assistant",
        system_prompt: str = "You are a helpful AI assistant.",
        model: str = "gpt-4o-mini",
        max_history: int = 20,
        tools: Optional[list] = None,
        temperature: float = 0.7
    ):
        self.client      = openai.OpenAI()
        self.name        = name
        self.model       = model
        self.temperature = temperature
        self.tools       = tools or []
        self.history     = deque(maxlen=max_history)
        self.system      = system_prompt
        self.usage       = {"calls": 0, "tokens": 0}

    def _get_messages(self) -> list:
        return [{"role": "system", "content": self.system}] + list(self.history)

    def _execute_tool(self, name: str, args: dict) -> str:
        # Override this method to add your own tools
        return json.dumps({"error": f"Tool '{name}' not implemented"})

    def chat(self, user_message: str, stream: bool = False) -> str:
        self.history.append({"role": "user", "content": user_message})

        kwargs = {
            "model":       self.model,
            "messages":    self._get_messages(),
            "temperature": self.temperature,
        }
        if self.tools:
            kwargs["tools"]        = self.tools
            kwargs["tool_choice"]  = "auto"
        if stream:
            kwargs["stream"] = True

        # Handle function calling loop
        while True:
            if stream and not self.tools:
                # Stream without tools
                response_text = ""
                stream_resp   = self.client.chat.completions.create(**kwargs)
                print(f"{self.name}: ", end="", flush=True)
                for chunk in stream_resp:
                    if chunk.choices[0].delta.content:
                        token = chunk.choices[0].delta.content
                        print(token, end="", flush=True)
                        response_text += token
                print()
                self.history.append({"role": "assistant", "content": response_text})
                return response_text

            # Non-streaming or tools
            response = self.client.chat.completions.create(**kwargs)
            message  = response.choices[0].message

            self.usage["calls"]  += 1
            self.usage["tokens"] += response.usage.total_tokens

            if not message.tool_calls:
                self.history.append({"role": "assistant", "content": message.content})
                return message.content

            # Process tool calls
            self.history.append(message)
            kwargs["messages"] = self._get_messages()

            for tool_call in message.tool_calls:
                fn_name = tool_call.function.name
                fn_args = json.loads(tool_call.function.arguments)

                result = self._execute_tool(fn_name, fn_args)

                self.history.append({
                    "role":         "tool",
                    "tool_call_id": tool_call.id,
                    "content":      result
                })

            kwargs["messages"] = self._get_messages()

    def summarize_history(self) -> str:
        if not self.history:
            return "No conversation history."

        summary_prompt = f"Summarize this conversation in 2-3 sentences:\n\n"
        for msg in self.history:
            if isinstance(msg, dict) and msg.get('role') in ['user', 'assistant']:
                role = msg['role'].title()
                content = msg.get('content', '')
                if content:
                    summary_prompt += f"{role}: {content[:100]}...\n"

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=150
        )
        return response.choices[0].message.content

    def clear(self):
        self.history.clear()

    def stats(self) -> dict:
        return {
            "name":    self.name,
            "model":   self.model,
            "calls":   self.usage["calls"],
            "tokens":  self.usage["tokens"],
            "history": len(self.history),
        }


# Concrete assistant with tools
class MLTutorAssistant(AIAssistant):
    def __init__(self):
        super().__init__(
            name="ML Tutor",
            system_prompt="""You are an expert ML tutor who has read 'How Machines Learn: A Complete Guide from Zero to AI Engineer'.
You teach clearly with concrete examples and code snippets.
You remember what the student has learned so far in this session.
When relevant, reference concepts from earlier in the conversation.""",
            model="gpt-4o-mini",
            max_history=20,
            tools=[
                {
                    "type": "function",
                    "function": {
                        "name": "get_code_example",
                        "description": "Get a working Python code example for an ML concept",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "concept": {"type": "string", "description": "The ML concept to get code for"},
                                "difficulty": {"type": "string", "enum": ["beginner", "intermediate", "advanced"]}
                            },
                            "required": ["concept"]
                        }
                    }
                }
            ]
        )

    def _execute_tool(self, name: str, args: dict) -> str:
        if name == "get_code_example":
            concept    = args.get("concept", "")
            difficulty = args.get("difficulty", "beginner")
            # In production: pull from a real code database
            example = {
                "concept":    concept,
                "difficulty": difficulty,
                "code": f"# Example: {concept}\nimport sklearn\n# ... working code here",
                "explanation": f"This code demonstrates {concept} at {difficulty} level."
            }
            return json.dumps(example)
        return json.dumps({"error": f"Unknown tool: {name}"})


# Demo the complete assistant
print("=" * 60)
print("ML Tutor Assistant Demo")
print("=" * 60)

tutor = MLTutorAssistant()

demo_questions = [
    "What is overfitting and how do I detect it?",
    "Can you give me a code example for that?",
    "How is this related to the bias-variance tradeoff?",
]

for question in demo_questions:
    print(f"\nStudent: {question}")
    answer = tutor.chat(question)
    print(f"Tutor: {answer[:200]}...")

print(f"\n{tutor.stats()}")

Cost Optimization Tips

cost_tips = {
    "1. Model selection": {
        "tip": "Use gpt-4o-mini for most tasks. Only upgrade to gpt-4o when quality is genuinely insufficient.",
        "saving": "95% cost reduction vs gpt-4o"
    },
    "2. Prompt caching": {
        "tip": "OpenAI automatically caches prompts > 1024 tokens. Long system prompts get cached.",
        "saving": "50% discount on cached tokens"
    },
    "3. Batch API": {
        "tip": "Use the Batch API for tasks that don't need real-time responses.",
        "saving": "50% discount vs real-time"
    },
    "4. Token counting": {
        "tip": "Count tokens before sending. Trim unnecessary context. Remove long system prompts when not needed.",
        "saving": "10-40% depending on your prompts"
    },
    "5. Max tokens": {
        "tip": "Always set max_tokens. Without it, the model can generate very long responses.",
        "saving": "Prevents runaway costs"
    },
    "6. Temperature for deterministic tasks": {
        "tip": "Use temperature=0 for classification, extraction, formatting. Deterministic = consistent = cacheable.",
        "saving": "Better cache hit rates"
    },
    "7. Local models for testing": {
        "tip": "Use Ollama or llama.cpp during development. Only hit the API when testing production behavior.",
        "saving": "90%+ during development"
    },
}

print("Cost Optimization Tips:")
for tip_name, info in cost_tips.items():
    print(f"\n{tip_name}")
    print(f"  Tip:    {info['tip']}")
    print(f"  Saving: {info['saving']}")

The Complete Journey: 100 Posts in One View

PHASE 1: Python That Actually Works (Posts 1-10)
  Variables, functions, OOP, error handling, file I/O

PHASE 2: Math for ML (Posts 11-20)
  Linear algebra, calculus, probability, statistics

PHASE 3: Data Wrangling Tools (Posts 27-39)
  NumPy, Pandas, Matplotlib, Seaborn, EDA

PHASE 4: SQL for Data (Posts 40-45)
  SELECT, JOINs, window functions, Python + SQL

PHASE 5: Dev Tools (Posts 46-50)
  Git, GitHub, Jupyter, Colab, virtual environments

PHASE 6: Machine Learning Core (Posts 51-71)
  Linear/logistic regression, trees, XGBoost, SVM,
  KNN, Naive Bayes, evaluation metrics, clustering,
  PCA, feature engineering, hyperparameter tuning

PHASE 7: Deep Learning (Posts 72-86)
  Neural networks, backprop, PyTorch, training loops,
  GPUs, CNNs, transfer learning, RNNs, autoencoders, GANs

PHASE 8: NLP and LLMs (Posts 87-100)
  Text preprocessing, tokenization, embeddings, attention,
  transformers, BERT, GPT, HuggingFace, fine-tuning,
  LoRA, vector search, RAG, chatbots, OpenAI API

What You Can Build Now

After 100 posts, you can build:

Classification systems for any domain
Regression models for prediction problems
Document intelligence pipelines with RAG
Custom chatbots with memory and tools
Image classification with CNNs
Fine-tuned domain models with LoRA
Semantic search engines
End-to-end ML pipelines with proper evaluation

The fundamentals don't change. Models come and go. APIs change. Architectures evolve. But gradient descent, overfitting, precision vs recall, the training loop, attention mechanisms: these ideas will still matter in 10 years.

You now understand them. Not just how to use them. Why they work.

Quick Cheat Sheet: OpenAI API

Task	Code
Basic chat	`client.chat.completions.create(model=..., messages=[...])`
System prompt	`{"role": "system", "content": "..."}` in messages
JSON output	`response_format={"type": "json_object"}`
Function calling	`tools=[...]`, `tool_choice="auto"`
Streaming	`stream=True`, iterate over chunks
Vision	Add `{"type": "image_url", "image_url": {"url": "..."}}` to content
Embeddings	`client.embeddings.create(input=text, model="text-embedding-3-small")`
Count tokens	`tiktoken.encoding_for_model(model).encode(text)`
Cost estimate	tokens / 1M * price_per_million
Retry on error	Catch `RateLimitError`, exponential backoff

Practice Challenges

Level 1:
Build a simple Q&A bot using the OpenAI API. Give it a custom system prompt that defines a persona. Test it with 10 questions and evaluate response quality.

Level 2:
Add function calling to the assistant. Define at least two tools: one that retrieves weather and one that searches Wikipedia. Verify the model correctly decides when to call each tool.

Level 3:
Build a complete AI assistant that combines: a custom system prompt, conversation memory (sliding window), RAG (ChromaDB with 20+ documents), at least 2 function tools, streaming output, and usage/cost tracking. Deploy it as a simple command-line chatbot.

References

This Is Post 100.

You started from zero. You learned Python, math, data wrangling, machine learning, deep learning, and large language models. You built classifiers, regressors, neural networks, transformers, RAG pipelines, and chatbots.

One hundred posts. One complete journey.

The field will keep moving. New architectures will appear. Better models will ship. Benchmarks will fall. But the person who understands why things work is never left behind by what comes next.

That's you now.

Go build something.

98. RAG: Give Your AI Access to Your Documents

Akhilesh — Tue, 26 May 2026 20:23:00 +0000

You ask ChatGPT about your company's internal policies. It makes something up. It sounds confident. It's wrong.

That's the hallucination problem. LLMs generate text based on what they learned during training. If the answer wasn't in the training data, they fabricate one that sounds plausible.

RAG (Retrieval Augmented Generation) fixes this. Before generating, the system retrieves relevant documents from your own knowledge base. The LLM reads those documents and generates an answer grounded in real content.

Your documents. Your data. Accurate answers.

What You'll Learn Here

Why RAG beats fine-tuning for knowledge-heavy tasks
The complete RAG pipeline: chunk, embed, retrieve, generate
Chunking strategies that actually work
Building RAG from scratch with sentence-transformers and a local LLM
Building RAG with LangChain for real projects
Evaluating RAG: what good looks like and what breaks it
Common failure modes and how to fix them

RAG vs Fine-Tuning: When to Use Which

Both give LLMs access to new knowledge. They're solving different problems.

Fine-tuning:
  - Best for: teaching style, format, behavior
  - Updates model weights
  - Needs retraining when data changes
  - Can't cite sources easily
  - Expensive to update frequently

RAG:
  - Best for: factual knowledge, documents, databases
  - No weight updates
  - Update knowledge base anytime, instantly
  - Can cite exact source passages
  - Perfect for private or frequently changing data

Rule of thumb:
  Behavior/style change → fine-tune
  Knowledge/facts/documents → RAG
  Both → fine-tune + RAG

The Complete RAG Pipeline

1. INDEXING (done once, offline)
   Load documents
   → Split into chunks
   → Embed each chunk
   → Store in vector database

2. RETRIEVAL (done at query time)
   User sends question
   → Embed the question
   → Find top-k similar chunks
   → Return chunks as context

3. GENERATION (done at query time)
   Build prompt: question + retrieved chunks
   → Send to LLM
   → LLM generates answer grounded in chunks
   → Return answer to user

Step 1: Chunking Documents

The most underrated step. How you split documents dramatically affects retrieval quality.

import re
from typing import List

# Strategy 1: Fixed-size chunking
def chunk_fixed(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    chunks = []
    start  = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap   # overlap to preserve context at boundaries
    return chunks

# Strategy 2: Sentence-aware chunking (better)
def chunk_by_sentences(text: str, max_chunk_size: int = 500) -> List[str]:
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks    = []
    current   = ""

    for sentence in sentences:
        if len(current) + len(sentence) <= max_chunk_size:
            current += " " + sentence if current else sentence
        else:
            if current:
                chunks.append(current.strip())
            current = sentence

    if current:
        chunks.append(current.strip())

    return chunks

# Strategy 3: Paragraph-aware chunking (often best for structured docs)
def chunk_by_paragraphs(text: str, max_chunk_size: int = 800) -> List[str]:
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    chunks     = []
    current    = ""

    for para in paragraphs:
        if len(current) + len(para) + 2 <= max_chunk_size:
            current += "\n\n" + para if current else para
        else:
            if current:
                chunks.append(current.strip())
            current = para

    if current:
        chunks.append(current.strip())

    return chunks

# Test on sample text
sample_text = """
Machine learning is a branch of artificial intelligence that enables computers to learn from data. 
It has three main types: supervised, unsupervised, and reinforcement learning.

Supervised learning uses labeled examples where the correct answers are known. 
The model learns to map inputs to outputs by minimizing error on training data. 
Common algorithms include linear regression, decision trees, and neural networks.

Unsupervised learning finds patterns in data without labels. 
Clustering algorithms group similar examples together. 
Dimensionality reduction simplifies data while preserving structure.

Reinforcement learning trains an agent to take actions in an environment to maximize reward.
It learns through trial and error, receiving feedback from the environment.
Applications include game playing, robotics, and recommendation systems.
"""

chunks_fixed = chunk_fixed(sample_text, chunk_size=200, overlap=30)
chunks_sents = chunk_by_sentences(sample_text, max_chunk_size=300)
chunks_paras = chunk_by_paragraphs(sample_text, max_chunk_size=400)

print(f"Fixed chunks:     {len(chunks_fixed)}")
print(f"Sentence chunks:  {len(chunks_sents)}")
print(f"Paragraph chunks: {len(chunks_paras)}")

print(f"\nParagraph chunk 1:\n'{chunks_paras[0]}'")
print(f"\nParagraph chunk 2:\n'{chunks_paras[1]}'")

Output:

Fixed chunks:     7
Sentence chunks:  4
Paragraph chunks: 4

Paragraph chunk 1:
'Machine learning is a branch of artificial intelligence that enables computers to learn from data. 
It has three main types: supervised, unsupervised, and reinforcement learning.'

Paragraph chunk 2:
'Supervised learning uses labeled examples where the correct answers are known. 
The model learns to map inputs to outputs by minimizing error on training data. 
Common algorithms include linear regression, decision trees, and neural networks.'

Chunking guidelines:

Chunk size 300-600 characters works well for most use cases
Always include overlap (50-100 chars) so context isn't lost at boundaries
Paragraph chunking preserves semantic units better than fixed-size
Smaller chunks: better precision (more specific retrieval)
Larger chunks: better recall (more context per chunk)

Step 2: Building the Index

from sentence_transformers import SentenceTransformer
import chromadb
import numpy as np

# Knowledge base: a collection of ML documentation
knowledge_base = {
    'doc1.txt': """
        Linear regression predicts a continuous output variable from input features.
        It fits a straight line (or hyperplane in multiple dimensions) through the data.
        The model minimizes the mean squared error between predictions and true values.
        The learned equation is: y = w1*x1 + w2*x2 + ... + b
        Used for: house price prediction, sales forecasting, temperature prediction.
    """,
    'doc2.txt': """
        Logistic regression is used for binary classification despite its name.
        It applies a sigmoid function to the linear combination of features.
        Output is a probability between 0 and 1.
        The decision boundary is where the probability equals 0.5.
        Used for: spam detection, disease diagnosis, fraud detection.
    """,
    'doc3.txt': """
        Random forests combine many decision trees to reduce overfitting.
        Each tree is trained on a random subset of data (bagging).
        Each split considers a random subset of features.
        Final prediction is the majority vote (classification) or average (regression).
        Feature importance can be extracted from the forest.
    """,
    'doc4.txt': """
        XGBoost builds trees sequentially, each one correcting errors from the previous.
        It uses gradient boosting with regularization to prevent overfitting.
        Learning rate controls how much each tree contributes.
        Early stopping prevents overtraining.
        Dominates Kaggle competitions on tabular data.
    """,
    'doc5.txt': """
        Cross-validation gives a reliable estimate of model performance.
        K-fold CV splits data into k equal parts, trains on k-1, tests on 1.
        This is repeated k times with different test sets.
        Average score across folds is the final estimate.
        Prevents optimistic bias from a single train/test split.
    """,
    'doc6.txt': """
        The confusion matrix shows all four prediction outcomes.
        True positives: correctly predicted positive.
        True negatives: correctly predicted negative.
        False positives: incorrectly predicted positive (Type I error).
        False negatives: incorrectly predicted negative (Type II error).
        Precision = TP / (TP + FP). Recall = TP / (TP + FN).
    """,
    'doc7.txt': """
        Overfitting occurs when a model performs well on training data but poorly on test data.
        Signs: large gap between train and validation accuracy.
        Causes: model too complex, too little data, training too long.
        Fixes: regularization, dropout, more data, early stopping, simpler model.
        The bias-variance tradeoff describes the fundamental tension.
    """,
    'doc8.txt': """
        Transformers use self-attention to process sequences in parallel.
        Self-attention computes relationships between all token pairs simultaneously.
        Multi-head attention runs several attention operations in parallel.
        Positional encoding adds position information to token embeddings.
        Layer normalization and residual connections stabilize training.
    """,
}

class RAGIndexer:
    def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
        self.model       = SentenceTransformer(model_name)
        self.chroma      = chromadb.Client()
        self.collection  = self.chroma.create_collection(
            name='rag_knowledge_base',
            metadata={'hnsw:space': 'cosine'}
        )

    def index_documents(self, documents: dict, chunk_size: int = 400):
        all_chunks = []
        all_ids    = []
        all_meta   = []

        for doc_name, content in documents.items():
            chunks = chunk_by_sentences(content, max_chunk_size=chunk_size)
            for i, chunk in enumerate(chunks):
                if len(chunk.strip()) < 30:   # skip tiny chunks
                    continue
                chunk_id = f"{doc_name}_chunk{i}"
                all_chunks.append(chunk.strip())
                all_ids.append(chunk_id)
                all_meta.append({'source': doc_name, 'chunk_idx': i})

        if not all_chunks:
            return

        # Encode all chunks
        print(f"Encoding {len(all_chunks)} chunks...")
        embeddings = self.model.encode(all_chunks, show_progress_bar=False)

        # Add to ChromaDB
        self.collection.add(
            ids        = all_ids,
            documents  = all_chunks,
            embeddings = [e.tolist() for e in embeddings],
            metadatas  = all_meta
        )
        print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")

    def retrieve(self, query: str, top_k: int = 3) -> List[dict]:
        query_embedding = self.model.encode([query])[0].tolist()

        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )

        retrieved = []
        for doc, meta, dist in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ):
            retrieved.append({
                'text':       doc,
                'source':     meta['source'],
                'similarity': 1 - dist   # ChromaDB returns distance, not similarity
            })

        return retrieved

# Build the index
indexer = RAGIndexer()
indexer.index_documents(knowledge_base)

# Test retrieval
query   = "How do I prevent a model from overfitting?"
results = indexer.retrieve(query, top_k=3)

print(f"\nQuery: '{query}'")
print("-" * 60)
for i, r in enumerate(results):
    print(f"\n{i+1}. [{r['similarity']:.3f}] From: {r['source']}")
    print(f"   {r['text'][:150]}...")

Output:

Indexed 16 chunks from 8 documents

Query: 'How do I prevent a model from overfitting?'
------------------------------------------------------------

1. [0.712] From: doc7.txt
   Overfitting occurs when a model performs well on training data but poorly on test data...

2. [0.531] From: doc4.txt
   XGBoost builds trees sequentially, each one correcting errors from the previous...

3. [0.489] From: doc3.txt
   Random forests combine many decision trees to reduce overfitting...

Step 3: Generation With Retrieved Context

# Using a local model via HuggingFace Transformers
from transformers import pipeline

# For a real project: use 'google/flan-t5-base' or connect to OpenAI API
generator = pipeline(
    'text2text-generation',
    model='google/flan-t5-base',
    max_new_tokens=200
)

def build_rag_prompt(question: str, context_chunks: List[dict]) -> str:
    context = "\n\n".join([
        f"[Source: {c['source']}]\n{c['text']}"
        for c in context_chunks
    ])

    prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer this."

Context:
{context}

Question: {question}

Answer:"""

    return prompt

class RAGPipeline:
    def __init__(self, indexer: RAGIndexer, generator_pipeline):
        self.indexer   = indexer
        self.generator = generator_pipeline

    def answer(self, question: str, top_k: int = 3, verbose: bool = False) -> dict:
        # Step 1: Retrieve relevant chunks
        chunks = self.indexer.retrieve(question, top_k=top_k)

        # Step 2: Build prompt
        prompt = build_rag_prompt(question, chunks)

        if verbose:
            print("=== RETRIEVED CONTEXT ===")
            for c in chunks:
                print(f"[{c['source']}] sim={c['similarity']:.3f}: {c['text'][:100]}...")
            print("\n=== PROMPT ===")
            print(prompt[:500] + "...")

        # Step 3: Generate answer
        result = self.generator(prompt)[0]['generated_text']

        return {
            'question': question,
            'answer':   result.strip(),
            'sources':  [c['source'] for c in chunks],
            'chunks':   chunks
        }

# Build the RAG pipeline
rag = RAGPipeline(indexer, generator)

# Ask questions
questions = [
    "What causes overfitting and how do I fix it?",
    "How is precision different from recall?",
    "What makes XGBoost good for competitions?",
    "How do transformers process sequences?",
]

for question in questions:
    result = rag.answer(question)
    print(f"\nQ: {question}")
    print(f"A: {result['answer']}")
    print(f"Sources: {result['sources']}")
    print("-" * 60)

Using the OpenAI API for Better Generation

For production quality, use a real LLM API. The retrieval stays the same. Only the generation step changes.

# Replace the generator with OpenAI API
# pip install openai

import openai

def generate_with_openai(prompt: str, model: str = 'gpt-3.5-turbo') -> str:
    client = openai.OpenAI()   # reads OPENAI_API_KEY from environment

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'system',
                'content': 'You are a helpful assistant. Answer questions based only on the provided context. If the context does not contain enough information, say so clearly.'
            },
            {
                'role': 'user',
                'content': prompt
            }
        ],
        temperature=0.1,   # low temperature for factual answers
        max_tokens=300
    )

    return response.choices[0].message.content

# Integrate into RAG pipeline
class RAGWithOpenAI:
    def __init__(self, indexer: RAGIndexer):
        self.indexer = indexer

    def answer(self, question: str, top_k: int = 3) -> dict:
        chunks = self.indexer.retrieve(question, top_k=top_k)
        prompt = build_rag_prompt(question, chunks)
        answer = generate_with_openai(prompt)

        return {
            'question': question,
            'answer':   answer,
            'sources':  list(set(c['source'] for c in chunks))
        }

# rag_openai = RAGWithOpenAI(indexer)
# result = rag_openai.answer("What causes overfitting?")
print("OpenAI RAG pipeline ready (requires OPENAI_API_KEY)")

LangChain: RAG in 30 Lines

LangChain abstracts the entire RAG pipeline into composable components.

pip install langchain langchain-community langchain-chroma

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline

# 1. Load documents
from langchain.schema import Document

docs = [
    Document(
        page_content=content,
        metadata={'source': name}
    )
    for name, content in knowledge_base.items()
]

# 2. Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ', '']
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

# 3. Embed and store
embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever   = vectorstore.as_retriever(search_kwargs={'k': 3})

# 4. Generation model
gen_pipe = hf_pipeline('text2text-generation', model='google/flan-t5-base', max_new_tokens=200)
llm      = HuggingFacePipeline(pipeline=gen_pipe)

# 5. Chain it together
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type='stuff',            # stuff all chunks into one prompt
    return_source_documents=True
)

# 6. Ask questions
result = qa_chain({'query': 'What causes overfitting?'})
print(f"Answer: {result['result']}")
print(f"Sources: {[d.metadata['source'] for d in result['source_documents']]}")

Common RAG Failure Modes and Fixes

failures = {
    "Retrieval finds wrong chunks": {
        "symptoms": "Answer is off-topic or doesn't address the question",
        "causes":   ["Chunk too large (contains many topics)", "Poor embedding model for domain"],
        "fixes":    ["Smaller chunks (200-400 chars)", "Domain-specific embedding model",
                     "Hybrid search (keyword + semantic)"]
    },
    "Chunks miss key information": {
        "symptoms": "Model says 'I don't know' but answer is in the documents",
        "causes":   ["Chunk boundary cut the relevant sentence",
                     "top_k too small", "Query and document phrasing too different"],
        "fixes":    ["Add overlap between chunks", "Increase top_k to 5-7",
                     "Query expansion (rephrase query multiple ways and merge results)"]
    },
    "Model ignores retrieved context": {
        "symptoms": "Answer doesn't match the retrieved chunks at all",
        "causes":   ["LLM is too small", "Prompt not clear about using only context"],
        "fixes":    ["Use larger/better LLM", "Stronger prompt instructions",
                     "Lower temperature"]
    },
    "Too much irrelevant context": {
        "symptoms": "Model is confused, answer is vague",
        "causes":   ["top_k too high", "All chunks have low similarity scores"],
        "fixes":    ["Filter chunks below similarity threshold",
                     "Reduce top_k to 2-3", "Check if query is answerable"]
    },
    "Hallucination despite retrieval": {
        "symptoms": "Model generates facts not in the retrieved context",
        "causes":   ["Model overrides context with training knowledge",
                     "Prompt not clear enough"],
        "fixes":    ["Explicit 'only use context' instruction in system prompt",
                     "Ask model to quote from context", "Use smaller, less opinionated LLM"]
    }
}

for failure, info in failures.items():
    print(f"\n{failure}")
    print(f"  Symptoms: {info['symptoms']}")
    print(f"  Fixes:")
    for fix in info['fixes']:
        print(f"    - {fix}")

Evaluating RAG Quality

# Simple evaluation: does the answer contain key information?
def evaluate_rag_answer(answer: str, expected_keywords: List[str]) -> dict:
    answer_lower   = answer.lower()
    found_keywords = [k for k in expected_keywords if k.lower() in answer_lower]
    coverage       = len(found_keywords) / len(expected_keywords)

    return {
        'coverage':         coverage,
        'found_keywords':   found_keywords,
        'missing_keywords': [k for k in expected_keywords if k not in found_keywords]
    }

# Test cases
test_cases = [
    {
        'question': "What causes overfitting?",
        'keywords': ['complex', 'training', 'gap', 'regularization']
    },
    {
        'question': "How does cross-validation work?",
        'keywords': ['k-fold', 'split', 'average', 'estimate']
    },
]

print("RAG Evaluation Results:")
print("-" * 60)
for test in test_cases:
    result = rag.answer(test['question'])
    eval_  = evaluate_rag_answer(result['answer'], test['keywords'])

    print(f"\nQ: {test['question']}")
    print(f"A: {result['answer'][:150]}...")
    print(f"Keyword coverage: {eval_['coverage']:.1%}")
    print(f"Missing: {eval_['missing_keywords']}")

For production, use RAGAS (Retrieval Augmented Generation Assessment) which evaluates faithfulness, answer relevancy, and context precision automatically.

Quick Cheat Sheet

Step	What it does	Key decision
Chunking	Split docs into pieces	Size 300-600 chars, overlap 50-100
Embedding	Convert chunks to vectors	all-MiniLM-L6-v2 to start
Indexing	Store in vector DB	ChromaDB for dev, Pinecone for prod
Retrieval	Find top-k similar chunks	k=3 to 5 usually works
Generation	Build prompt + call LLM	Include retrieved context explicitly

Problem	Quick fix
Wrong chunks retrieved	Smaller chunks, better embedding model
Answer not in chunks	Add overlap, increase top-k
Model ignores context	Stronger prompt, lower temperature
Too slow	Smaller embedding model, FAISS ANN index
Hallucinations	Explicit "only use context" in system prompt

Practice Challenges

Level 1:
Pick any 10 Wikipedia articles on a topic you know. Chunk them, embed them, and store in ChromaDB. Ask 5 questions where you already know the answer. Did RAG get them right?

Level 2:
Compare three chunking strategies (fixed-size, sentence-aware, paragraph-aware) on the same document set. For each strategy, retrieve the top-3 chunks for 5 queries. Which strategy retrieves more relevant chunks by eye?

Level 3:
Build a complete RAG pipeline with source citations. For each answer, show which document chunk it came from and highlight the specific sentence that grounded the answer. Add a similarity threshold: if the top-k chunks all score below 0.4, return "I don't have information about this" instead of guessing.

References

Next up, Post 99: Build a Chatbot With Memory. Conversation history, context management, multi-turn dialogue. We build a chatbot that actually remembers what you said earlier in the conversation.

97. Embeddings and Vector Search: Semantic Search That Works

Akhilesh — Mon, 25 May 2026 18:04:54 +0000

Traditional search works on keywords. You type "cheap hotel", it looks for documents containing those exact words.

Someone asks "affordable accommodation near the beach". Your documents say "budget-friendly lodging by the coast". Zero keyword overlap. Zero results. Search fails.

Embeddings fix this. They convert text into vectors of numbers where similar meanings end up geometrically close. "Cheap" and "affordable" land near each other in vector space. "Hotel" and "accommodation" land near each other. Semantic similarity becomes distance.

This powers every modern search system. ChatGPT's memory. Notion AI. GitHub Copilot context. All of them.

What You'll Learn Here

What embeddings are and how they encode meaning
Cosine similarity: measuring how close two vectors are
Sentence transformers: the right models for semantic search
Building a semantic search engine from scratch
FAISS: fast approximate nearest neighbor search at scale
ChromaDB: a vector database for production use
Practical patterns for document retrieval

What Embeddings Actually Are

An embedding is a dense vector of floating point numbers. Every piece of text maps to one vector.

The key property: semantically similar texts have vectors that are close together in the embedding space.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a sentence embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Embed some sentences
sentences = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    "Dogs love to play fetch.",
    "Machine learning is a subset of AI.",
    "Artificial intelligence includes ML.",
]

embeddings = model.encode(sentences)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each sentence → {embeddings.shape[1]}-dimensional vector")
print(f"\nFirst embedding (first 8 dims): {embeddings[0][:8].round(4)}")

Output:

Embedding shape: (5, 384)
Each sentence → 384-dimensional vector

First embedding (first 8 dims): [ 0.0234 -0.1823  0.0912  0.3421 -0.0541  0.2134 -0.0823  0.1234]

384 numbers represent the meaning of an entire sentence. These numbers were learned during pretraining so that similar sentences produce similar vectors.

Cosine Similarity: Measuring Semantic Distance

Raw Euclidean distance doesn't work well for text embeddings. Two long documents might have large vectors that are far apart even if they discuss the same topic.

Cosine similarity measures the angle between vectors, not their magnitude. It ranges from -1 to 1. Same direction = 1. Perpendicular = 0. Opposite = -1.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare all pairs
print("Cosine similarity between sentences:")
print(f"{'Pair':<55} {'Similarity'}")
print("-" * 70)

pairs = [
    (0, 1, "cat on mat vs feline on rug"),
    (0, 2, "cat on mat vs dogs play fetch"),
    (3, 4, "ML subset AI vs AI includes ML"),
    (0, 3, "cat on mat vs ML is AI"),
]

for i, j, desc in pairs:
    sim = cosine_sim(embeddings[i], embeddings[j])
    print(f"{desc:<55} {sim:.4f}")

Output:

Cosine similarity between sentences:
Pair                                                    Similarity
----------------------------------------------------------------------
cat on mat vs feline on rug                             0.8341
cat on mat vs dogs play fetch                           0.4123
ML subset AI vs AI includes ML                          0.8912
cat on mat vs ML is AI                                  0.1234

"Cat on mat" and "feline on rug" score 0.83. Same concept, different words. "ML subset AI" and "AI includes ML" score 0.89. Semantically equivalent.

"Cat on mat" and "ML is AI" score 0.12. Completely different topics.

Sentence Transformers: The Right Models

Word-level models like Word2Vec average word embeddings. That loses sentence structure. Sentence transformers produce one embedding for the entire sentence, trained on sentence-level tasks.

from sentence_transformers import SentenceTransformer

# Popular embedding models

models_info = {
    'all-MiniLM-L6-v2': {
        'dim': 384,
        'size': '80MB',
        'speed': 'very fast',
        'quality': 'good',
        'note': 'Best starting point. Fast and accurate.'
    },
    'all-mpnet-base-v2': {
        'dim': 768,
        'size': '420MB',
        'speed': 'medium',
        'quality': 'excellent',
        'note': 'Best quality for semantic search.'
    },
    'paraphrase-multilingual-MiniLM-L12-v2': {
        'dim': 384,
        'size': '470MB',
        'speed': 'fast',
        'quality': 'good',
        'note': 'Supports 50+ languages.'
    },
    'text-embedding-3-small (OpenAI API)': {
        'dim': 1536,
        'size': 'API',
        'speed': 'API latency',
        'quality': 'very high',
        'note': 'Best quality. Costs per token.'
    }
}

print(f"{'Model':<45} {'Dim':<6} {'Size':<10} {'Quality'}")
print("-" * 70)
for name, info in models_info.items():
    print(f"{name:<45} {info['dim']:<6} {info['size']:<10} {info['quality']}")

# Load the recommended default
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Building a Semantic Search Engine From Scratch

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# A knowledge base of documents
documents = [
    "Python is a high-level programming language known for its simplicity and readability.",
    "Machine learning algorithms learn patterns from data without being explicitly programmed.",
    "Neural networks are computing systems inspired by biological neural networks.",
    "The transformer architecture uses self-attention mechanisms to process sequential data.",
    "BERT is a bidirectional transformer pretrained on masked language modeling.",
    "GPT uses a decoder-only transformer trained on next-token prediction.",
    "Fine-tuning adapts a pretrained model to a specific task using domain data.",
    "LoRA reduces the number of trainable parameters by using low-rank decomposition.",
    "Vector databases store embeddings and support fast nearest-neighbor search.",
    "RAG combines retrieval with generation to give LLMs access to external knowledge.",
    "Cosine similarity measures the angle between two vectors in embedding space.",
    "Tokenization breaks text into smaller units called tokens before feeding to a model.",
    "Backpropagation computes gradients by applying the chain rule backward through a network.",
    "Overfitting occurs when a model learns the training data too well and fails on new data.",
    "Cross-validation gives a more reliable estimate of model performance than a single split.",
]

class SemanticSearch:
    def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
        self.model     = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index(self, documents):
        self.documents  = documents
        print(f"Encoding {len(documents)} documents...")
        self.embeddings = self.model.encode(documents, show_progress_bar=True)
        print(f"Indexed {len(documents)} documents. Embedding shape: {self.embeddings.shape}")

    def search(self, query, top_k=3):
        # Encode the query
        query_embedding = self.model.encode([query])

        # Compute cosine similarity with all documents
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]

        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'score':    similarities[idx],
                'index':    idx
            })
        return results

# Build the search engine
search_engine = SemanticSearch()
search_engine.index(documents)

# Test queries
queries = [
    "How do transformers work?",
    "What is the difference between BERT and GPT?",
    "How can I make training more efficient?",
    "What happens when a model memorizes training data?",
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("-" * 60)
    results = search_engine.search(query, top_k=3)
    for i, r in enumerate(results):
        print(f"  {i+1}. [{r['score']:.3f}] {r['document'][:80]}...")

Output:

Query: 'How do transformers work?'
------------------------------------------------------------
  1. [0.712] The transformer architecture uses self-attention mechanisms...
  2. [0.634] BERT is a bidirectional transformer pretrained on masked...
  3. [0.601] GPT uses a decoder-only transformer trained on next-token...

Query: 'What is the difference between BERT and GPT?'
------------------------------------------------------------
  1. [0.823] BERT is a bidirectional transformer pretrained on masked...
  2. [0.798] GPT uses a decoder-only transformer trained on next-token...
  3. [0.612] The transformer architecture uses self-attention mechanisms...

Query: 'How can I make training more efficient?'
------------------------------------------------------------
  1. [0.651] LoRA reduces the number of trainable parameters by using...
  2. [0.589] Fine-tuning adapts a pretrained model to a specific task...
  3. [0.534] Machine learning algorithms learn patterns from data...

Query: 'What happens when a model memorizes training data?'
------------------------------------------------------------
  1. [0.714] Overfitting occurs when a model learns the training data...
  2. [0.543] Cross-validation gives a more reliable estimate of model...
  3. [0.498] Fine-tuning adapts a pretrained model to a specific task...

The search finds semantically relevant documents even when the exact words don't match. "Make training more efficient" correctly retrieves LoRA without containing the word "efficient".

FAISS: Fast Search at Scale

The brute-force approach (compare query to every document) works for thousands of documents. For millions, you need approximate nearest neighbor (ANN) search. FAISS (Facebook AI Similarity Search) is the standard tool.

pip install faiss-cpu   # or faiss-gpu for GPU support

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Generate sample embeddings (simulating a large corpus)
model       = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
dimension   = 384   # all-MiniLM-L6-v2 embedding size

# Simulate 10,000 documents
np.random.seed(42)
fake_embeddings = np.random.randn(10000, dimension).astype('float32')
# Normalize for cosine similarity (FAISS uses inner product)
faiss.normalize_L2(fake_embeddings)

# Build FAISS index
# IndexFlatIP: exact inner product search (cosine similarity after L2 normalization)
index = faiss.IndexFlatIP(dimension)
index.add(fake_embeddings)
print(f"FAISS index size: {index.ntotal} vectors")

# Search
query_embedding = np.random.randn(1, dimension).astype('float32')
faiss.normalize_L2(query_embedding)

k = 5
distances, indices = index.search(query_embedding, k)

print(f"\nTop {k} nearest neighbors:")
for dist, idx in zip(distances[0], indices[0]):
    print(f"  Index {idx}: similarity={dist:.4f}")

# For very large datasets: use IVF index (approximate, faster)
# IVF = Inverted File Index, partitions space into clusters

n_clusters = 100   # number of partitions (sqrt of dataset size is a good rule)
quantizer  = faiss.IndexFlatIP(dimension)
ivf_index  = faiss.IndexIVFFlat(quantizer, dimension, n_clusters, faiss.METRIC_INNER_PRODUCT)

# Must train IVF index before adding vectors
ivf_index.train(fake_embeddings)
ivf_index.add(fake_embeddings)

# Tune nprobe: how many clusters to search (higher = more accurate, slower)
ivf_index.nprobe = 10

distances_ivf, indices_ivf = ivf_index.search(query_embedding, k)
print(f"\nIVF index results (approximate but faster):")
for dist, idx in zip(distances_ivf[0], indices_ivf[0]):
    print(f"  Index {idx}: similarity={dist:.4f}")

# Benchmark: exact vs approximate
import time

# Exact search
start = time.time()
for _ in range(100):
    index.search(query_embedding, k)
exact_time = (time.time() - start) / 100

# Approximate search
start = time.time()
for _ in range(100):
    ivf_index.search(query_embedding, k)
approx_time = (time.time() - start) / 100

print(f"\nSearch time per query:")
print(f"  Exact (IndexFlatIP): {exact_time*1000:.2f}ms")
print(f"  Approximate (IVF):   {approx_time*1000:.2f}ms")
print(f"  Speedup: {exact_time/approx_time:.1f}x")

ChromaDB: A Vector Database for Real Projects

FAISS is powerful but low-level. ChromaDB adds persistence, metadata filtering, and a clean API. Good for production use.

pip install chromadb

import chromadb
from sentence_transformers import SentenceTransformer

# Create a ChromaDB client
client = chromadb.Client()   # in-memory; use chromadb.PersistentClient('./chroma_db') for persistence

# Create a collection
collection = client.create_collection(
    name='ml_knowledge_base',
    metadata={'hnsw:space': 'cosine'}   # use cosine similarity
)

# Your documents with metadata
docs = [
    {
        'id': 'doc1',
        'text': 'Python is a high-level programming language known for simplicity.',
        'metadata': {'topic': 'programming', 'difficulty': 'beginner'}
    },
    {
        'id': 'doc2',
        'text': 'Machine learning algorithms learn patterns from data.',
        'metadata': {'topic': 'ml', 'difficulty': 'intermediate'}
    },
    {
        'id': 'doc3',
        'text': 'Neural networks are inspired by biological neural networks.',
        'metadata': {'topic': 'deep_learning', 'difficulty': 'intermediate'}
    },
    {
        'id': 'doc4',
        'text': 'BERT is a bidirectional transformer pretrained on MLM.',
        'metadata': {'topic': 'nlp', 'difficulty': 'advanced'}
    },
    {
        'id': 'doc5',
        'text': 'LoRA reduces trainable parameters using low-rank decomposition.',
        'metadata': {'topic': 'fine_tuning', 'difficulty': 'advanced'}
    },
]

# Add documents (ChromaDB can use its own embedding model or you provide embeddings)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

collection.add(
    ids       = [d['id'] for d in docs],
    documents = [d['text'] for d in docs],
    embeddings= [model.encode(d['text']).tolist() for d in docs],
    metadatas = [d['metadata'] for d in docs]
)

print(f"Collection size: {collection.count()}")

# Basic query
results = collection.query(
    query_embeddings=[model.encode("How do transformers work?").tolist()],
    n_results=3
)

print("\nQuery: 'How do transformers work?'")
for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"  {i+1}. [{1-dist:.3f}] {doc}")   # ChromaDB returns distance, convert to similarity

# Filter by metadata
results_filtered = collection.query(
    query_embeddings=[model.encode("machine learning concepts").tolist()],
    n_results=3,
    where={'difficulty': 'advanced'}   # only return advanced documents
)

print("\nQuery with filter (difficulty=advanced):")
for doc, meta in zip(results_filtered['documents'][0], results_filtered['metadatas'][0]):
    print(f"  [{meta['topic']}] {doc}")

# Update and delete
collection.update(
    ids=['doc1'],
    documents=['Python is a versatile high-level programming language.'],
    embeddings=[model.encode('Python is a versatile high-level programming language.').tolist()]
)

collection.delete(ids=['doc5'])
print(f"\nAfter update and delete: {collection.count()} documents")

Batch Encoding: Processing Large Datasets Efficiently

from sentence_transformers import SentenceTransformer
import numpy as np
import time

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Simulate a large dataset
large_corpus = [f"This is document number {i} about topic {i % 10}." for i in range(5000)]

# Efficient batch encoding
print("Encoding 5000 documents...")
start = time.time()

embeddings = model.encode(
    large_corpus,
    batch_size=64,           # process 64 at a time
    show_progress_bar=True,
    normalize_embeddings=True  # L2 normalize for cosine similarity
)

elapsed = time.time() - start
print(f"\nDone in {elapsed:.1f}s")
print(f"Speed: {len(large_corpus)/elapsed:.0f} docs/second")
print(f"Embeddings shape: {embeddings.shape}")

Evaluating Embedding Quality

Not all embedding models perform equally on all tasks. Test before committing.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def evaluate_embeddings(model_name, test_pairs):
    """
    test_pairs: list of (sent1, sent2, label) where label=1 means similar, 0 means different
    """
    model = SentenceTransformer(model_name)

    sents1 = [p[0] for p in test_pairs]
    sents2 = [p[1] for p in test_pairs]
    labels = [p[2] for p in test_pairs]

    emb1 = model.encode(sents1)
    emb2 = model.encode(sents2)

    similarities = [cosine_similarity([e1], [e2])[0][0] for e1, e2 in zip(emb1, emb2)]

    # Threshold at 0.5 to predict similar/different
    preds = [1 if s > 0.5 else 0 for s in similarities]
    accuracy = sum(p == l for p, l in zip(preds, labels)) / len(labels)

    return accuracy, similarities

test_pairs = [
    ("cheap hotel", "affordable accommodation", 1),
    ("machine learning", "artificial intelligence", 1),
    ("cat on the mat", "deep learning model", 0),
    ("how to code in python", "python programming tutorial", 1),
    ("stock market crash", "cooking recipes", 0),
    ("neural network", "deep learning", 1),
    ("fix bug in code", "debug software", 1),
    ("the weather today", "quantum physics research", 0),
]

for model_name in ['sentence-transformers/all-MiniLM-L6-v2',
                    'sentence-transformers/all-mpnet-base-v2']:
    acc, sims = evaluate_embeddings(model_name, test_pairs)
    print(f"\n{model_name.split('/')[-1]}:")
    print(f"  Accuracy on test pairs: {acc:.1%}")
    for (s1, s2, label), sim in zip(test_pairs, sims):
        status = 'correct' if (sim > 0.5) == label else 'WRONG'
        print(f"  [{status}] sim={sim:.3f} | '{s1[:25]}' vs '{s2[:25]}'")

Common Embedding Patterns

# Pattern 1: Asymmetric search (query and documents use different models)
# Useful when queries are short questions and documents are long passages

from sentence_transformers import SentenceTransformer

bi_encoder = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-v4')

# Documents
passages = [
    "LoRA stands for Low-Rank Adaptation and is used for efficient fine-tuning.",
    "The Eiffel Tower is a famous landmark in Paris, France.",
    "Python was created by Guido van Rossum and first released in 1991.",
]

# Short query
query = "What is LoRA?"

query_emb    = bi_encoder.encode(query)
passage_embs = bi_encoder.encode(passages)

sims = cosine_similarity([query_emb], passage_embs)[0]
top  = np.argmax(sims)
print(f"Query: '{query}'")
print(f"Best match [{sims[top]:.3f}]: '{passages[top]}'")

# Pattern 2: Clustering embeddings to find topics
from sklearn.cluster import KMeans

sentences = [
    "Python is great for data science.",
    "R is used for statistical computing.",
    "Machine learning requires lots of data.",
    "Deep learning uses neural networks.",
    "Java is widely used in enterprise software.",
    "JavaScript powers the web frontend.",
    "Supervised learning uses labeled data.",
    "Unsupervised learning finds hidden patterns.",
]

model      = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

print("\nClustered sentences:")
for cluster_id in range(3):
    print(f"\nCluster {cluster_id}:")
    for sent, label in zip(sentences, labels):
        if label == cluster_id:
            print(f"  - {sent}")

Quick Cheat Sheet

Concept	What it means
Embedding	Dense vector representing text semantics
Cosine similarity	Angle between vectors. 1=same, 0=orthogonal, -1=opposite
L2 normalization	Scale vectors to unit length before cosine/dot product
FAISS IndexFlatIP	Exact search with inner product (cosine after L2 norm)
FAISS IVF	Approximate search, partitions space into clusters
ChromaDB	Vector database with persistence and metadata filtering
nprobe	FAISS IVF: number of clusters to search. Higher=more accurate
Batch encoding	Encode many texts at once for efficiency

Task	Code
Load model	`SentenceTransformer('all-MiniLM-L6-v2')`
Encode text	`model.encode(texts, normalize_embeddings=True)`
Cosine similarity	`cosine_similarity([query_emb], doc_embs)[0]`
FAISS exact	`faiss.IndexFlatIP(dim)`
FAISS approximate	`faiss.IndexIVFFlat(quantizer, dim, n_clusters)`
ChromaDB add	`collection.add(ids, documents, embeddings, metadatas)`
ChromaDB search	`collection.query(query_embeddings, n_results=5)`
Top-k results	`np.argsort(similarities)[::-1][:k]`

Practice Challenges

Level 1:
Build a semantic search engine on a topic you care about. Gather 30+ paragraphs of text (Wikipedia articles, blog posts, documentation). Encode them with all-MiniLM-L6-v2. Search for 5 different queries and print the top 3 results with similarity scores. Are the results actually relevant?

Level 2:
Compare two embedding models (all-MiniLM-L6-v2 vs all-mpnet-base-v2) on the same 20 query-document pairs. Which one finds more relevant results? Is the quality difference worth the size difference?

Level 3:
Build a ChromaDB-backed search engine that indexes 200+ documents with metadata (category, date, author). Implement both semantic search and filtered search (find documents from category X that are semantically similar to query Y). Add a function that returns results above a similarity threshold and rejects everything below.

References

Next up, Post 98: RAG: Give Your AI Access to Your Documents. Retrieval Augmented Generation combines semantic search with LLM generation. Ask questions about any document and get accurate, grounded answers.

96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop

Akhilesh — Sun, 24 May 2026 18:10:28 +0000

GPT-2 has 117M parameters. LLaMA-2 has 7B. GPT-3 has 175B.

Full fine-tuning means updating every single parameter. For GPT-2 that's manageable. For LLaMA-2 it needs 28GB of GPU memory just to store the gradients. For GPT-3 it's basically impossible without a cluster.

LoRA (Low-Rank Adaptation) solves this. Instead of updating the full weight matrices, it adds tiny trainable modules next to them. The original weights stay frozen. Only the tiny modules train. At the end you merge them back.

You go from needing 8 A100s to needing a consumer GPU. Or sometimes just a CPU.

What You'll Learn Here

Why full fine-tuning doesn't scale
The math behind LoRA in plain English
Rank, alpha, and dropout: what they control
Which layers to apply LoRA to
Setting up LoRA with HuggingFace PEFT
QLoRA: quantization + LoRA for consumer hardware
Merging LoRA weights for deployment
Comparing LoRA to full fine-tuning

The Problem With Full Fine-Tuning at Scale

# Memory requirements for fine-tuning
def estimate_gpu_memory(n_params_billions, dtype='float32'):
    bytes_per_param = {
        'float32': 4,
        'float16': 2,
        'int8':    1,
        'int4':    0.5
    }

    bpp       = bytes_per_param[dtype]
    model_gb  = n_params_billions * 1e9 * bpp / 1e9

    # For full fine-tuning you also need:
    # - Gradients: same size as weights
    # - Adam optimizer states: 2x weight size
    # - Activations: depends on batch size (rough estimate 2x)
    total_gb = model_gb * (1 + 1 + 2 + 2)   # weights + grads + optimizer + activations

    return model_gb, total_gb

print(f"{'Model':<15} {'Params':<10} {'Weights':<12} {'Full FT Memory'}")
print("-" * 50)
for name, params in [('GPT-2', 0.117), ('LLaMA-7B', 7), ('LLaMA-13B', 13), ('GPT-3', 175)]:
    w_gb, total = estimate_gpu_memory(params, 'float32')
    print(f"{name:<15} {params:<10} {w_gb:.1f} GB      {total:.0f} GB")

Output:

Model           Params     Weights      Full FT Memory
--------------------------------------------------
GPT-2           0.117      0.5 GB      2 GB
LLaMA-7B        7          28.0 GB     168 GB
LLaMA-13B       13         52.0 GB     312 GB
GPT-3           175        700.0 GB    4200 GB

LLaMA-7B full fine-tuning needs 168GB of GPU memory. A single A100 has 80GB. You need at least 3 of them for $30,000+.

LoRA changes this dramatically.

How LoRA Works: The Math

A pretrained weight matrix W has shape (d_out, d_in). Full fine-tuning updates W directly:

W_new = W_pretrained + ΔW

ΔW has the same shape as W. That's the problem. It's huge.

LoRA's insight: the update ΔW doesn't need to have full rank. Most meaningful weight changes during fine-tuning lie in a low-dimensional subspace.

Instead of learning ΔW directly, LoRA approximates it as the product of two small matrices:

ΔW ≈ B × A

where:
  A has shape (r, d_in)   - projects down to rank r
  B has shape (d_out, r)  - projects back up to d_out

r << min(d_in, d_out)

During forward pass:

output = x @ W^T + x @ A^T @ B^T × (alpha/r)
       = (pretrained part) + (LoRA part)

W stays frozen. Only A and B train. Total parameters: r * (d_in + d_out) instead of d_in * d_out.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, original_layer, rank=8, alpha=16, dropout=0.1):
        super().__init__()

        self.original = original_layer
        self.rank     = rank
        self.alpha    = alpha
        self.scaling  = alpha / rank

        # Freeze the original layer
        for param in self.original.parameters():
            param.requires_grad = False

        # LoRA matrices A and B
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        self.lora_A = nn.Linear(in_features,  rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)

        self.dropout = nn.Dropout(dropout)

        # Initialize: A with Gaussian, B with zeros
        # B=0 means LoRA starts as identity (no change at init)
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        # Original output (frozen)
        original_out = self.original(x)

        # LoRA delta
        lora_out = self.lora_B(self.lora_A(self.dropout(x))) * self.scaling

        return original_out + lora_out

    def parameter_count(self):
        original_params = sum(p.numel() for p in self.original.parameters())
        lora_params     = sum(p.numel() for p in self.lora_A.parameters()) + \
                          sum(p.numel() for p in self.lora_B.parameters())
        return original_params, lora_params

# Test LoRA layer
original_linear = nn.Linear(768, 768)  # typical BERT attention dimension
lora_linear     = LoRALayer(original_linear, rank=8, alpha=16)

original_params, lora_params = lora_linear.parameter_count()
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters:     {lora_params:,}")
print(f"Parameter reduction: {lora_params/original_params:.1%} of original")

x   = torch.randn(2, 10, 768)
out = lora_linear(x)
print(f"\nInput shape:  {x.shape}")
print(f"Output shape: {out.shape}")

Output:

Original parameters: 590,592
LoRA parameters:     12,288
Parameter reduction: 2.1% of original

Input shape:  torch.Size([2, 10, 768])
Output shape: torch.Size([2, 10, 768])

12,288 parameters instead of 590,592. Same output shape. 2.1% of the original.

Rank, Alpha, and What They Control

import pandas as pd

# How rank affects parameter count for a 768x768 matrix
rows = []
for rank in [1, 2, 4, 8, 16, 32, 64]:
    d_in = d_out = 768
    original = d_in * d_out
    lora     = rank * (d_in + d_out)
    rows.append({
        'Rank': rank,
        'LoRA params': lora,
        'Original params': original,
        '% of original': f"{lora/original:.2%}",
        'Reduction factor': f"{original//lora}x"
    })

print(pd.DataFrame(rows).to_string(index=False))

Output:

 Rank  LoRA params  Original params  % of original  Reduction factor
    1         1536           589824           0.26%            384x
    2         3072           589824           0.52%            192x
    4         6144           589824           1.04%             96x
    8        12288           589824           2.08%             48x
   16        24576           589824           4.17%             24x
   32        49152           589824           8.33%             12x
   64        98304           589824          16.67%              6x

Rank (r): how many dimensions to use in the low-rank approximation. Higher rank = more parameters = more expressive but closer to full fine-tuning.

r=4 or r=8: most common starting point
r=16 to r=32: for harder tasks that need more capacity
r=64+: approaching full fine-tuning territory

Alpha (α): scaling factor for the LoRA output. Controls how much influence LoRA has relative to the frozen model.

Usually set to alpha = rank (scaling = 1.0)
Or alpha = 2 * rank (scaling = 2.0, LoRA has more influence)
Common: rank=8, alpha=16 (scaling=2)

Dropout: regularization inside LoRA. Typically 0.05 to 0.1.

Which Layers to Apply LoRA To

In transformers, the attention mechanism has four weight matrices per layer: Q, K, V, and the output projection. The feed-forward layers have two more.

# Common LoRA target modules for different architectures

lora_targets = {
    'BERT / RoBERTa': {
        'targets': ['query', 'key', 'value', 'dense'],
        'note': 'All attention projections'
    },
    'GPT-2': {
        'targets': ['c_attn', 'c_proj'],
        'note': 'Combined QKV and output projection'
    },
    'LLaMA / Mistral': {
        'targets': ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
        'note': 'All attention projections, sometimes gate_proj too'
    },
    'Minimal (fastest)': {
        'targets': ['q_proj', 'v_proj'],
        'note': 'Only query and value, fewer params but often enough'
    }
}

for arch, info in lora_targets.items():
    print(f"\n{arch}:")
    print(f"  Targets: {info['targets']}")
    print(f"  Note:    {info['note']}")

Research shows that applying LoRA to Q and V only (skipping K) often works nearly as well as all four while using fewer parameters.

LoRA With HuggingFace PEFT

pip install peft

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = 'roberta-base'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,       # sequence classification
    r=8,                               # rank
    lora_alpha=16,                     # alpha
    lora_dropout=0.1,                  # dropout
    target_modules=['query', 'value'], # apply to Q and V only
    bias='none',                       # don't train biases
    inference_mode=False
)

# Wrap the model with LoRA
model = get_peft_model(base_model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

Output:

trainable params: 629,764 || all params: 125,277,444 || trainable%: 0.5025

0.5% of parameters. Everything else is frozen.

# Training with LoRA is identical to regular fine-tuning
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import load_dataset
import evaluate
import numpy as np

# Load data
dataset   = load_dataset('imdb')
small_train = dataset['train'].select(range(2000))
small_val   = dataset['test'].select(range(500))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=256)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

training_args = TrainingArguments(
    output_dir='./lora_model',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-4,          # LoRA can use higher LR than full fine-tuning
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"LoRA fine-tuning accuracy: {results['eval_accuracy']:.3f}")

Saving and Loading LoRA Weights

LoRA's other big advantage: the saved checkpoint is tiny. You only save the LoRA matrices, not the full model.

from peft import PeftModel

# Save only the LoRA weights
model.save_pretrained('./lora_weights')   # saves adapter_config.json and adapter_model.bin
print("LoRA weights saved")

import os
for f in os.listdir('./lora_weights'):
    size = os.path.getsize(f'./lora_weights/{f}') / 1e6
    print(f"  {f}: {size:.1f} MB")

Output:

LoRA weights saved
  adapter_config.json: 0.001 MB
  adapter_model.bin: 2.4 MB     <- only 2.4 MB instead of 500+ MB!

# Load: start with base model, then load LoRA adapter
base_model_for_load = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)
loaded_lora_model = PeftModel.from_pretrained(base_model_for_load, './lora_weights')
loaded_lora_model.eval()
print("LoRA model loaded successfully")

Merging LoRA for Deployment

After training, you can merge the LoRA weights into the base model. Then you have one clean model with no overhead at inference time.

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Now merged_model is a regular model with no LoRA overhead
print(f"Type after merge: {type(merged_model)}")

# Save the merged model
merged_model.save_pretrained('./merged_model')
tokenizer.save_pretrained('./merged_model')

# Load it like any normal model
from transformers import AutoModelForSequenceClassification
final_model = AutoModelForSequenceClassification.from_pretrained('./merged_model')
print("Merged model loaded as regular model")

# Check: no LoRA parameters, just the full model
n_params = sum(p.numel() for p in final_model.parameters())
print(f"Parameters: {n_params:,}")

QLoRA: 4-bit Quantization + LoRA

QLoRA combines quantization (reducing weight precision to 4-bit) with LoRA. This lets you fine-tune 7B+ models on a single consumer GPU.

pip install bitsandbytes

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # quantize to 4-bit
    bnb_4bit_quant_type='nf4',            # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16, # compute in fp16
    bnb_4bit_use_double_quant=True        # double quantization (saves more memory)
)

# Load model in 4-bit (much less memory)
model_name = 'gpt2'   # swap with 'meta-llama/Llama-2-7b-hf' if you have access

qlora_base = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map='auto'          # automatically handles multi-GPU or CPU offload
)

# Required for 4-bit training
qlora_base.config.use_cache           = False
qlora_base.config.pretraining_tp      = 1

# Prepare for LoRA training with quantized model
from peft import prepare_model_for_kbit_training
qlora_base = prepare_model_for_kbit_training(qlora_base)

# Apply LoRA config
qlora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],  # GPT-2 specific
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

qlora_model = get_peft_model(qlora_base, qlora_config)
qlora_model.print_trainable_parameters()

Output:

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364

# Memory savings with QLoRA
memory_estimates = {
    'Full fine-tuning (fp32)':     '~28 GB for 7B model',
    'Full fine-tuning (fp16)':     '~14 GB for 7B model',
    'LoRA (fp16)':                 '~8 GB for 7B model',
    'QLoRA (4-bit + LoRA)':       '~4 GB for 7B model',
}

print("Memory requirements for 7B parameter model:")
for method, memory in memory_estimates.items():
    print(f"  {method:<35}: {memory}")

LoRA vs Full Fine-Tuning: Benchmark Comparison

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer, TrainingArguments, Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import evaluate, numpy as np, time, torch

model_name = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_name)

dataset    = load_dataset('imdb')
small_train = dataset['train'].select(range(1000))
small_val   = dataset['test'].select(range(300))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

def run_experiment(use_lora, rank=8):
    base = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    if use_lora:
        config = LoraConfig(
            task_type=TaskType.SEQ_CLS, r=rank,
            lora_alpha=rank*2, lora_dropout=0.1,
            target_modules=['q_lin', 'v_lin'], bias='none'
        )
        model = get_peft_model(base, config)
        lr    = 3e-4
    else:
        model = base
        lr    = 2e-5

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())

    args = TrainingArguments(
        output_dir=f'./exp_{"lora" if use_lora else "full"}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        learning_rate=lr,
        evaluation_strategy='epoch',
        report_to='none',
        logging_steps=999
    )

    trainer = Trainer(
        model=model, args=args,
        train_dataset=train_ds, eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
        compute_metrics=compute_metrics
    )

    start   = time.time()
    trainer.train()
    elapsed = time.time() - start
    results = trainer.evaluate()

    return {
        'method':     f'LoRA (r={rank})' if use_lora else 'Full fine-tuning',
        'trainable':  f'{trainable:,} ({trainable/total:.1%})',
        'accuracy':   f"{results['eval_accuracy']:.3f}",
        'time_s':     f"{elapsed:.0f}s"
    }

print("Running comparison (this takes a few minutes)...")
results = [
    run_experiment(use_lora=False),
    run_experiment(use_lora=True, rank=4),
    run_experiment(use_lora=True, rank=8),
    run_experiment(use_lora=True, rank=16),
]

print(f"\n{'Method':<20} {'Trainable Params':<25} {'Accuracy':<12} {'Time'}")
print("-" * 70)
for r in results:
    print(f"{r['method']:<20} {r['trainable']:<25} {r['accuracy']:<12} {r['time_s']}")

Typical output:

Method               Trainable Params          Accuracy     Time
----------------------------------------------------------------------
Full fine-tuning     66,955,010 (100%)         0.934        148s
LoRA (r=4)           147,968 (0.22%)           0.921        102s
LoRA (r=8)           295,168 (0.44%)           0.928        108s
LoRA (r=16)          589,824 (0.88%)           0.931        115s

LoRA with r=8 gets 99.4% of full fine-tuning accuracy with 0.44% of the parameters and 73% of the training time. For larger models, the savings are even more dramatic.

When to Use LoRA vs Full Fine-Tuning

Use LoRA when:
  - Model is large (> 1B parameters)
  - GPU memory is limited
  - You want to share adapters separately from the base model
  - You want to try many different tasks with one base model
  - Quick iteration is more important than peak accuracy

Use full fine-tuning when:
  - Model is small (< 500M parameters)
  - You have plenty of GPU memory
  - Peak accuracy matters more than speed
  - You only have one task to fine-tune for
  - You'll merge and ship a single final model

Quick Cheat Sheet

Concept	What it means
Rank (r)	Dimensions of LoRA matrices. r=8 is a good default.
Alpha (α)	Scaling. Set to 2*r or same as r.
Target modules	Which weight matrices to apply LoRA to. Start with Q and V.
Scaling factor	alpha/rank. Controls LoRA strength.
Merge and unload	Bake LoRA into base weights. One clean model for deployment.
QLoRA	4-bit quantization + LoRA. Fine-tune 7B on 4GB GPU.

Task	Code
Configure LoRA	`LoraConfig(r=8, lora_alpha=16, target_modules=[...])`
Apply to model	`get_peft_model(base_model, lora_config)`
Check params	`model.print_trainable_parameters()`
Save adapters	`model.save_pretrained('./lora_weights')`
Load adapters	`PeftModel.from_pretrained(base_model, './lora_weights')`
Merge weights	`model.merge_and_unload()`
QLoRA setup	`BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')`

Practice Challenges

Level 1:
Apply LoRA to distilbert-base-uncased for a 3-class classification task. Use r=4 then r=16. Print the trainable parameter counts for both. Fine-tune each for 2 epochs. Compare accuracy vs parameter count.

Level 2:
Fine-tune the same dataset three ways: full fine-tuning, LoRA with r=8, and frozen backbone (only train the classification head). Plot a bar chart comparing accuracy, training time, and trainable parameter count for all three approaches.

Level 3:
Set up QLoRA with bitsandbytes on any GPT-style model. Verify it loads in 4-bit. Fine-tune on a small instruction dataset for 1 epoch. Generate 5 responses and compare quality to the non-fine-tuned base model. Report GPU memory usage before and after loading.

References

Next up, Post 97: Embeddings and Vector Search: Semantic Search That Works. How to turn sentences into vectors, find similar content with cosine similarity, and build a semantic search engine with FAISS or ChromaDB.

95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job

Akhilesh — Sat, 23 May 2026 13:30:18 +0000

A general language model knows a little about everything.

It knows some medicine. Some law. Some code. Some cooking. But it doesn't know your specific domain deeply. It doesn't know your company's tone, your product's terminology, or your task's format.

Fine-tuning fixes this. You take a pretrained model that already understands language and specialize it for your specific task with a fraction of the data and compute you'd need to train from scratch.

This post covers how to do it properly.

What You'll Learn Here

What fine-tuning actually does to a pretrained model
The three types of fine-tuning and when to use each
Preparing datasets for instruction fine-tuning
Full fine-tuning with the HuggingFace Trainer
Evaluating fine-tuned models properly
Catastrophic forgetting and how to avoid it
Tips that actually make a difference

What Fine-Tuning Does

A pretrained LLM has learned a general representation of language from billions of tokens. Its weights encode grammar, facts, reasoning patterns, and world knowledge.

Fine-tuning continues training on a smaller, task-specific dataset. The model adapts its weights slightly to specialize. The key word is slightly. You don't want to destroy the general knowledge. You want to build on it.

Pretrained model:
  - Knows language deeply
  - Broad but shallow domain knowledge
  - No concept of your task format

After fine-tuning:
  - Still knows language
  - Deep knowledge of your domain
  - Understands your task format
  - Responds in your required style

The weights change. But not completely. A well-fine-tuned model retains its general capabilities while gaining task-specific expertise.

Three Types of Fine-Tuning

Type 1: Full Fine-Tuning
Update all weights. Best results. Expensive. Needs lots of data. Risk of catastrophic forgetting.

Type 2: Feature Extraction (Frozen backbone)
Freeze the pretrained model. Only train a new head (classification layer, etc.). Fast. Needs very little data. Limited adaptation.

Type 3: Parameter-Efficient Fine-Tuning (LoRA, adapters)
Add small trainable modules. Freeze most of the model. Train only a tiny fraction of parameters. Best of both worlds. Covered deeply in Post 96.

# Type 1: Full fine-tuning
for param in model.parameters():
    param.requires_grad = True   # all params update

# Type 2: Frozen backbone
for param in model.base_model.parameters():
    param.requires_grad = False  # freeze backbone
# only classifier head trains

# Type 3: LoRA (simplified)
# Covered in Post 96

Dataset Preparation

Good data beats a good model almost every time. This is where most fine-tuning projects live or die.

For classification fine-tuning:

from datasets import Dataset, DatasetDict
import pandas as pd

# Your labeled data
data = {
    'text': [
        "The patient presented with acute chest pain radiating to the left arm.",
        "The quarterly earnings exceeded analyst expectations by 15%.",
        "The defendant claims he was not present at the scene of the crime.",
        "Treatment with metformin reduced HbA1c levels significantly.",
        "Revenue growth was driven by strong performance in cloud services.",
        "The prosecution presented DNA evidence linking the suspect to the crime.",
        "MRI results showed no signs of cerebral hemorrhage.",
        "Operating margins expanded by 200 basis points year over year.",
        "The jury found the defendant not guilty on all counts.",
        "The patient was discharged after a three-day hospitalization.",
    ],
    'label': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0]  # 0=medical, 1=finance, 2=legal
}

df = pd.DataFrame(data)

# Train/val split
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset   = Dataset.from_pandas(val_df.reset_index(drop=True))

dataset = DatasetDict({'train': train_dataset, 'validation': val_dataset})
print(dataset)

For instruction fine-tuning (making a model follow prompts):

# Instruction format used by most modern LLMs
def format_instruction(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# Example instruction dataset
instruction_data = [
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Patient reports persistent cough and shortness of breath for 3 weeks.',
        'output': 'symptom'
    },
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Prescribed amoxicillin 500mg three times daily for 7 days.',
        'output': 'treatment'
    },
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Confirmed diagnosis of type 2 diabetes mellitus based on HbA1c of 7.8%.',
        'output': 'diagnosis'
    },
]

for example in instruction_data:
    print(format_instruction(example))
    print("-" * 50)

Data Quality Checklist

Before fine-tuning, verify your data:

import pandas as pd
import numpy as np

def audit_dataset(df, text_col='text', label_col='label'):
    print("=" * 50)
    print("DATASET AUDIT REPORT")
    print("=" * 50)

    # Size
    print(f"\nTotal examples: {len(df):,}")

    # Class distribution
    print(f"\nClass distribution:")
    dist = df[label_col].value_counts(normalize=True)
    for label, pct in dist.items():
        count = df[label_col].value_counts()[label]
        print(f"  Class {label}: {count} ({pct:.1%})")

    # Imbalance check
    max_class = dist.max()
    min_class = dist.min()
    ratio     = max_class / min_class
    if ratio > 5:
        print(f"  WARNING: Imbalance ratio {ratio:.1f}x. Consider oversampling or class weights.")

    # Text length
    lengths = df[text_col].str.len()
    print(f"\nText length:")
    print(f"  Min:    {lengths.min()}")
    print(f"  Max:    {lengths.max()}")
    print(f"  Median: {lengths.median():.0f}")
    print(f"  Mean:   {lengths.mean():.0f}")

    # Long texts warning
    if lengths.max() > 512 * 4:  # rough estimate of 512 tokens
        print(f"  WARNING: Some texts may exceed token limits. Check truncation strategy.")

    # Duplicates
    n_dupes = df[text_col].duplicated().sum()
    if n_dupes > 0:
        print(f"\n  WARNING: {n_dupes} duplicate texts found. Remove before training.")

    # Missing values
    missing = df.isnull().sum().sum()
    if missing > 0:
        print(f"\n  WARNING: {missing} missing values found.")
    else:
        print(f"\nNo missing values.")

    print("=" * 50)

audit_dataset(pd.DataFrame(data))

Full Fine-Tuning for Sequence Classification

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
import evaluate
import numpy as np
import torch

model_name  = 'distilbert-base-uncased'
num_labels  = 3
label_names = ['medical', 'finance', 'legal']

id2label = {i: l for i, l in enumerate(label_names)}
label2id = {l: i for i, l in enumerate(label_names)}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=False,       # DataCollator will pad dynamically
        max_length=256
    )

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val   = val_dataset.map(tokenize_function, batched=True)

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

# Metrics
accuracy = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']
    f1  = f1_metric.compute(
        predictions=predictions, references=labels, average='weighted'
    )['f1']
    return {'accuracy': acc, 'f1': f1}

# Training arguments
training_args = TrainingArguments(
    output_dir='./checkpoints/domain_classifier',

    # Training schedule
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,

    # Optimization
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,            # warmup for 10% of steps
    lr_scheduler_type='cosine',  # cosine decay after warmup

    # Evaluation
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True,

    # Logging
    logging_steps=10,
    logging_dir='./logs',
    report_to='none',

    # Efficiency
    fp16=torch.cuda.is_available(),  # mixed precision on GPU
    dataloader_num_workers=0,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train
print("Starting fine-tuning...")
trainer.train()

# Evaluate
results = trainer.evaluate()
print(f"\nFinal Results:")
print(f"  Accuracy: {results['eval_accuracy']:.3f}")
print(f"  F1:       {results['eval_f1']:.3f}")

Evaluating a Fine-Tuned Model Properly

Accuracy alone isn't enough. Look at per-class performance, confusion matrix, and error cases.

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import torch

# Get predictions on validation set
model.eval()
all_preds  = []
all_labels = []

val_dataloader = trainer.get_eval_dataloader()

with torch.no_grad():
    for batch in val_dataloader:
        batch   = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        preds   = torch.argmax(outputs.logits, dim=-1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch['labels'].cpu().numpy())

# Classification report
print("Classification Report:")
print(classification_report(all_labels, all_preds, target_names=label_names))

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_names, yticklabels=label_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix - Fine-tuned DistilBERT')
plt.tight_layout()
plt.savefig('fine_tune_confusion.png', dpi=100)
plt.show()

# Error analysis: look at what the model gets wrong
errors = []
texts  = val_df['text'].tolist()

for i, (pred, true) in enumerate(zip(all_preds, all_labels)):
    if pred != true:
        errors.append({
            'text':      texts[i],
            'true':      label_names[true],
            'predicted': label_names[pred]
        })

print(f"\nErrors ({len(errors)} out of {len(all_labels)}):")
for e in errors:
    print(f"\n  True: {e['true']}, Predicted: {e['predicted']}")
    print(f"  Text: '{e['text'][:80]}...'")

Error analysis is often the most valuable step. Understanding why the model gets specific examples wrong tells you what data to add next.

Catastrophic Forgetting: The Real Risk

When you fine-tune on a small dataset, the model can forget what it learned during pretraining. Weights move too far from their pretrained values. General capabilities degrade.

# Signs of catastrophic forgetting:
# 1. Model performs well on your task but fails on general text
# 2. Perplexity on general text spikes
# 3. Model generates incoherent text outside your domain

# Prevent it with:

# 1. Low learning rate (2e-5 is usually safe for BERT-based models)
training_args_safe = TrainingArguments(
    learning_rate=2e-5,        # not 1e-3 or 1e-4
    weight_decay=0.01,         # L2 regularization
    warmup_ratio=0.1,
    num_train_epochs=3,        # not 50
    output_dir='./safe_ft'
)

# 2. Freeze early layers (they contain general language knowledge)
def freeze_early_layers(model, n_frozen_layers=4):
    # Freeze embedding layers
    for param in model.distilbert.embeddings.parameters():
        param.requires_grad = False

    # Freeze first n transformer layers
    for layer in model.distilbert.transformer.layer[:n_frozen_layers]:
        for param in layer.parameters():
            param.requires_grad = False

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})")

freeze_early_layers(model, n_frozen_layers=4)

# 3. Use a small dataset? Consider LoRA (Post 96) instead of full fine-tuning

Instruction Fine-Tuning a Generative Model

For causal LLMs (GPT-style), you format the data as prompts and completions.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import torch

# Load a small generative model
model_name = 'gpt2'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.use_cache = False   # required for gradient checkpointing

# Instruction dataset
instructions = [
    {
        'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nMachine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to parse data, learn from it, and make informed decisions.\n\n### Response:\n",
        'completion': "Machine learning allows computers to learn from data and make decisions without explicit programming."
    },
    {
        'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nThe Eiffel Tower, located in Paris, France, was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair and stands 330 meters tall.\n\n### Response:\n",
        'completion': "The Eiffel Tower is a 330-meter structure in Paris built in 1889 as the entrance arch for the World's Fair."
    },
]

# Tokenize: concatenate prompt + completion, mask prompt in loss
def tokenize_instruction(example, max_length=256):
    full_text = example['prompt'] + example['completion'] + tokenizer.eos_token

    tokenized = tokenizer(
        full_text,
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='pt'
    )

    input_ids  = tokenized['input_ids'][0]
    labels     = input_ids.clone()

    # Mask the prompt tokens in loss (we only want to train on completions)
    prompt_ids = tokenizer(example['prompt'], return_tensors='pt')['input_ids'][0]
    prompt_len = len(prompt_ids)
    labels[:prompt_len] = -100   # -100 is ignored in CrossEntropyLoss

    return {
        'input_ids':      input_ids,
        'attention_mask': tokenized['attention_mask'][0],
        'labels':         labels
    }

tokenized_data = [tokenize_instruction(ex) for ex in instructions]

# Convert to dataset
import torch

class InstructionDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

train_ds = InstructionDataset(tokenized_data)

# Fine-tune
training_args = TrainingArguments(
    output_dir='./instruct_model',
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,   # effective batch size = 4
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=5,
    save_steps=50,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
)

trainer.train()
print("Instruction fine-tuning complete")

Testing Your Fine-Tuned Model

# Test the fine-tuned generative model
model.eval()

def generate_response(prompt, max_new_tokens=100, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    generated = output[0][inputs['input_ids'].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True)

# Test prompt
test_prompt = """### Instruction:
Summarize this in one sentence.

### Input:
Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information using connectionist approaches to computation.

### Response:
"""

response = generate_response(test_prompt)
print(f"Generated response:\n{response}")

Fine-Tuning Best Practices

# Summary of what actually works

best_practices = {
    'learning_rate': {
        'BERT-based (classification)': '2e-5 to 5e-5',
        'GPT-based (generation)':      '1e-5 to 3e-5',
        'Frozen backbone':             '1e-3 to 1e-4 for head only'
    },
    'batch_size': {
        'recommendation': '16 or 32 if memory allows',
        'small GPU':      'batch=4 + gradient_accumulation=4'
    },
    'epochs': {
        'BERT classification': '2 to 4',
        'GPT generation':      '1 to 3',
        'note':                'More epochs = more overfitting risk'
    },
    'data_size': {
        'frozen backbone':  'Works with 100+ examples',
        'full fine-tuning': 'Need 1000+ for reliable results',
        'instruction FT':   '1000 to 10000 good examples'
    },
    'stopping': {
        'recommendation': 'Always use early stopping',
        'metric':         'Monitor validation loss, not training loss'
    }
}

for category, details in best_practices.items():
    print(f"\n{category.upper()}:")
    for k, v in details.items():
        print(f"  {k}: {v}")

Quick Cheat Sheet

Decision	Guidance
How much data do I have?	< 500: freeze backbone. 500-5k: full fine-tune. > 5k: great
Which model to start with?	DistilBERT for speed, RoBERTa for accuracy
Learning rate	2e-5 for BERT, 1e-5 for GPT, never > 5e-5
Epochs	2-4, use early stopping
Catastrophic forgetting	Lower LR, freeze early layers, fewer epochs
Model not learning	Raise LR, check data quality, check label correctness
Model overfitting	Lower LR, add dropout, add more data, use LoRA

Task	Code
Load model	`AutoModelForSequenceClassification.from_pretrained(name, num_labels=N)`
Tokenize	`tokenizer(texts, truncation=True, padding=False, max_length=256)`
Train	`Trainer(model, args, train_dataset, eval_dataset)`
Early stop	`EarlyStoppingCallback(early_stopping_patience=2)`
Save	`trainer.save_model('./my_model')`
Predict	`trainer.predict(test_dataset)`

Practice Challenges

Level 1:
Download any small labeled text dataset from the HuggingFace hub. Fine-tune distilbert-base-uncased on it for 3 epochs. Print the classification report. Compare to a TF-IDF + LogisticRegression baseline.

Level 2:
Fine-tune with and without freezing the first 4 transformer layers. Compare final F1 scores and training time. Which approach is better for your dataset size?

Level 3:
Create your own instruction dataset of 50+ examples for a specific task (code explanation, medical text classification, legal summarization). Fine-tune GPT-2 on it. Test the model with 10 new prompts it hasn't seen. Rate the responses 1-5 and report average quality.

References

Next up, Post 96: LoRA: Fine-Tune a Billion-Parameter Model on a Laptop. Parameter-efficient fine-tuning using rank decomposition. Train 1% of parameters and get 95% of the performance of full fine-tuning.