LangChain vs LlamaIndex vs Raw API Calls: What I Chose After 3 Production Projects

#ai #llm #python #programming

When you're building an LLM application for production, the first decision you'll hit is: should I use a framework, and which one? LangChain and LlamaIndex are the two dominant Python frameworks, but raw API calls are always an option. After shipping three production systems — a document Q&A service, an internal security alert classifier, and a multi-step research agent — I have strong opinions on this.

What Each Option Actually Is

LangChain is a general-purpose LLM application framework. It provides chains (sequential steps), agents (tool-calling loops), memory abstractions, and integrations with dozens of services. It's now split into langchain-core, langchain-community, and provider-specific packages like langchain-openai.

LlamaIndex (formerly GPT Index) focuses primarily on data ingestion and retrieval — RAG pipelines. It has strong abstractions for document loading, chunking, indexing, and querying. It recently added agent support, but retrieval is where it shines.

Raw API calls means calling the LLM provider's SDK directly and building your own pipeline logic without a framework in between.

Project 1: Document Q&A Service (LlamaIndex Won)

The first system was a Q&A service over a 50,000-page internal knowledge base. Users ask questions; the system retrieves relevant chunks and generates an answer.

I started with LangChain's RetrievalQA chain. The abstraction looked clean until I needed to tune retrieval: hybrid search (BM25 + vector), custom reranking, and source deduplication. I spent more time fighting the abstractions than using them.

Switching to LlamaIndex was the right move. Its RetrieverQueryEngine gave me exactly the composability I needed:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="docs")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

retriever = VectorIndexRetriever(index=index, similarity_top_k=10)
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.72)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[postprocessor],
)

response = query_engine.query("What is the data retention policy for logs?")
print(response.response)
for node in response.source_nodes:
    print(f"  Source: {node.metadata['filename']} (score: {node.score:.3f})")

LlamaIndex's node pipeline gave me control over every step without subclassing anything. The reranker plugged in cleanly, source attribution worked out of the box, and the abstractions matched how I think about retrieval.

Winner: LlamaIndex for RAG-heavy workloads.

Project 2: Security Alert Classifier (Raw API Calls Won)

The second system classified incoming SIEM alerts into severity tiers and routed them to the right team. One language model call per alert, structured JSON output, written to a database.

I initially used LangChain's StructuredOutputParser. It worked, but it added ~200ms of overhead, had transitive dependencies I didn't need, and broke twice during minor LangChain version bumps. The framework was solving a problem I didn't have.

Raw API calls with instructor for structured output validation turned out to be simpler, faster, and more maintainable:

import instructor
import openai
from pydantic import BaseModel
from enum import Enum

class Severity(str, Enum):
    critical = "critical"
    high = "high"
    medium = "medium"
    low = "low"
    informational = "informational"

class AlertClassification(BaseModel):
    severity: Severity
    category: str
    assigned_team: str
    reasoning: str
    false_positive_probability: float

client = instructor.from_openai(openai.OpenAI())

def classify_alert(alert_text: str) -> AlertClassification:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=AlertClassification,
        messages=[
            {
                "role": "system",
                "content": "You are a security operations expert. Classify SIEM alerts accurately.",
            },
            {"role": "user", "content": f"Alert:\n{alert_text}"},
        ],
    )

result = classify_alert("Failed SSH login from 185.220.101.42 - 847 attempts in 60s")
print(result.severity, result.assigned_team, result.false_positive_probability)

Forty lines. No framework ceremony. This has been running in production for eight months with zero framework-related incidents. If you are building security tooling and want a reference on what to validate at the model output layer, the security hardening checklists we publish cover the adjacent threat modeling ground.

Winner: Raw API calls for single-step or lightly orchestrated tasks.

Project 3: Multi-Step Research Agent (LangChain Won — Barely)

The third system was a research agent: given a topic, search the web, fetch pages, summarize each, cross-reference findings, produce a structured report. Four to seven tool calls per run with conditional branching.

Building this with raw API calls meant reinventing the tool-calling loop, error recovery, and state management. I got three-quarters through before switching to LangChain's agent abstraction.

LangChain's create_tool_calling_agent with AgentExecutor handled the loop correctly, including max iterations, early stopping, and streaming intermediate steps:

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
import httpx

@tool
def fetch_url(url: str) -> str:
    "Fetch the text content of a URL."
    try:
        resp = httpx.get(url, timeout=10, follow_redirects=True)
        return resp.text[:4000]
    except Exception as e:
        return f"Error: {e}"

llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [fetch_url]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a research assistant. Use tools to gather information before answering."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, max_iterations=8, verbose=False)

result = executor.invoke({"input": "What are the latest CVEs affecting Apache HTTP Server?"})
print(result["output"])

I said "barely" because LangChain still caused pain: abstractions leak, documentation frequently lags the API, and some community integrations are unmaintained. But for a stateful multi-step agent loop, the alternative — building it myself — was worse.

Winner: LangChain for complex multi-step agents, with reservations.

How to Choose

Based on these three projects, my decision tree is straightforward:

Scenario	Use
Pure RAG / retrieval pipeline	LlamaIndex
Document ingestion at scale	LlamaIndex
Simple LLM call with structured output	Raw API + instructor
Single-step classification or extraction	Raw API calls
Multi-step agent with tools	LangChain
Experimental prototype	Anything

The key insight: frameworks add value when they manage state or complex pipelines you would otherwise write yourself. If you are making one or two API calls and parsing the result, a framework is overhead, not an asset.

The Takeaway

Do not default to LangChain because it is popular. Start with raw API calls — they are simpler, faster to debug, and immune to framework version churn. Reach for LlamaIndex when your problem is fundamentally about retrieval and indexing. Use LangChain when you need a stateful multi-step agent loop and do not want to own the implementation.

The abstraction tax is real. Every framework you add becomes a dependency you maintain — including its bugs, its breaking changes, and its community's opinions about your architecture. The lightweight path is usually the right one until it is not.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.