Anurag Bagri

Posted on Nov 9

Understanding AI: From LLMs to MCP

#ai #mcp #llm #agents

Large Language Models (LLMs) form the foundation of today’s AI. At their core, an LLM like GPT-4 processes tokens (sub-word text units) through a deep neural network. Each token is converted into a high-dimensional embedding (a numeric vector capturing semantic meaning). For example, the sentence “Hello world” might be broken into tokens like "Hello", "Ġworld" and each token is mapped to a vector of hundreds or thousands of dimensions. These embeddings allow the model to understand relationships between words. GPT-4 also has a large context window (e.g. up to 8K or even 32K tokens in extended versions), meaning it can “remember” and attend to that many tokens in a single conversation. In practice, you might use GPT-4 in code like this:

from openai import OpenAI
 client = OpenAI(api_key="YOUR_KEY")
 response = client.chat.completions.create(
     model="gpt-4",
     messages=[
         {"role": "system", "content": "You are a helpful assistant."},
         {"role": "user",   "content": "Explain the concept of a context window."}
     ]
 )
print(response.choices[0].message.content)

Here, GPT-4 reads the system and user messages as tokens, embeds them into vectors, and generates a response. The context window lets the model incorporate long conversations or documents into its output. If a conversation exceeds the window size, older tokens are dropped or summarized, which can lead to loss of information. Large context windows address this limitation by allowing more prior text to influence the output. Embeddings and vector representations also enable similarity comparisons: two sentences with similar meaning will have vectors that are close under measures like cosine similarity.

Agents (LangChain)

As LLMs matured, developers needed ways to act, not just chat. Agents—often built with frameworks like LangChain—turn LLMs into dynamic actors that reason, make decisions, and use external tools. Instead of a single prompt-response, an agent runs in a loop: it analyzes input, maybe calls a function or searches the web, and then decides next steps. LangChain lets you create an agent with built-in reasoning and tool usage. For example, you might give an agent a search tool and a calculator, then ask it a question:

from openai import OpenAI
client = OpenAI(api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain the concept of a context window."}
    ]
)
print(response.choices[0].message.content)

Agents (LangChain)

from langchain.agents import create_agent
from langchain.tools import tool

@tool
def web_search(query: str) -> str:
    # Imagine this calls an actual search API
    return f"Top news results for '{query}'."

@tool
def calculator(expression: str) -> str:
    try:
        return str(eval(expression))
    except:
        return "Calculation error."

agent = create_agent(model="gpt-4", tools=[web_search, calculator])
result = agent.invoke({"input": "Who won the World Cup in 2022 and what is 2023 - 1980?"})
print(result)

Here, the agent uses GPT-4 as its reasoning model and has two tools. When asked a compound question, the agent can first use the web_search tool to find the World Cup winner, then use calculator to compute the arithmetic. This surpasses a standalone LLM by allowing tool use and multi-step thinking. Agents address LLM limitations (like not having up-to-date data or complex reasoning) by orchestrating the model’s outputs with external knowledge and logic.

Prompt Engineering

Even with a powerful LLM, how you frame the question matters a lot. Prompt engineering is the practice of designing prompts to get the best results. Effective strategies include giving the model a clear role (using a system message), providing examples (few-shot learning), and encouraging step-by-step reasoning. For instance, you might use chain-of-thought prompting to get a thorough answer:

messages = [
    {"role": "system", "content": "You are a brilliant tutor."},
    {"role": "user",   "content": "Explain the solution step by step to: 2345 * 789."}
]
response = client.chat.completions.create(model="gpt-4", messages=messages)
print(response.choices[0].message.content)

In this example, specifying a helpful persona encourages the model to provide clear steps. Key prompt tips include:

Be explicit: Clearly specify the task and format.

Use examples: Show input/output pairs to guide the model (few-shot).

Structured prompts: Ask for bullet points, numbered steps, or specific styles (e.g. “answer as JSON”).

Clarify the scope: Tell the model what to ignore or include to focus its reasoning.

Prompt engineering fills gaps where raw LLM outputs might be unfocused or hallucinated. By carefully crafting prompts, we steer GPT-4 to use its capabilities effectively and avoid previous issues like one-line answers or irrelevant details.

Vector Databases

LLMs have fixed context limits and no built-in memory, so we use vector databases to store and retrieve knowledge. In a vector DB, each piece of text (a document, paragraph, or chunk) is converted into an embedding vector (often 768 or 1536 dimensions). For example, using OpenAI’s embedding API:

embedding = client.embeddings.create(
    model="text-embedding-ada-002", input="Apple is looking at buying a UK startup for $1 billion"
)
vector = embedding.data[0].embedding  # a 1536-dimensional list of floats

These vectors go into a database (like Pinecone, Chroma, or FAISS). When a query comes, we embed the query and compute similarity scores (usually cosine similarity or dot product) against stored vectors. The top-scoring chunks (often overlapping segments of larger text) are returned. Chunk overlap means when splitting long documents into pieces, we overlap the splits slightly to avoid cutting apart relevant phrases.

The semantic meaning of vectors is key: similar content yields close vectors, so even if the wording changes, retrieval still finds relevant info. Dimensionality impacts how much nuance can be captured; typical embeddings use hundreds or thousands of dimensions. Vector DBs solve context-limit issues by effectively expanding an LLM’s memory: we can retrieve past data or knowledge on-the-fly and feed it into the model.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation combines vector search with LLM answers to produce grounded, accurate outputs. In RAG, a user’s query triggers a semantic search in a vector database; the retrieved documents become context for the LLM’s answer. For example:

# Step 1: Embed the query
query = "Who won the World Cup in 2022?"
q_embed = client.embeddings.create(model="text-embedding-ada-002", input=query)["data"][0]["embedding"]

# Step 2: Search the vector DB (pseudocode)
results = vector_db.search(q_embed, top_k=3)  # returns top 3 relevant chunks

# Step 3: Construct a prompt with retrieved info
documents_text = "\n".join([doc.text for doc in results])
messages = [
    {"role": "system", "content": "You answer using provided documents."},
    {"role": "user",   "content": f"Documents: {documents_text}\n\nQuestion: {query}"}
]
answer = client.chat.completions.create(model="gpt-4", messages=messages)
print(answer.choices[0].message.content)

Here, RAG grounds GPT-4’s answer with real documents. It addresses limitations of plain LLMs: it counters hallucinations by providing factual context, and it handles knowledge beyond the model’s training cutoff. The trade-off is that RAG requires an external knowledge base and careful prompt construction to ensure the model uses the retrieved content correctly.

LangGraph for AI Workflows

As projects grow complex, simple chains and agents can become hard to manage. LangGraph introduces graph-based workflows for AI pipelines. Instead of linear chains, you define nodes (tasks or agents) and edges (data flow). This graph structure brings flow control, branching, and state management into AI systems.

For example, you might have a workflow where one node queries GPT-4 for ideas, another node checks a knowledge base, and a third node synthesizes answers. LangGraph allows branches: if a condition is met, the graph can take one path or another. It also includes memory: using annotations, you can preserve state across runs or checkpoint progress. Human-in-the-loop nodes can pause the flow for manual input, and you can stream intermediate results for real-time monitoring.

A simplified pseudo-code of a LangGraph workflow might look like:

from langchain.graphs import StateGraph

def idea_node(state):
    text = state["input_text"]
    return {"ideas": GPT4_chain(text)}

def write_node(state):
    ideas = state["ideas"]
    return {"draft": GPT4_chain(ideas)}

workflow = StateGraph()
workflow.add_node("generate_ideas", idea_node)
workflow.add_node("write_draft", write_node)
workflow.add_edge("generate_ideas", "write_draft")
result = workflow.invoke({"input_text": "Write an article about AI ethics."})
print(result["draft"])

In practice, LangGraph (especially in its JavaScript form) supports advanced features like error recovery and modular agent collaboration. By structuring workflows as graphs, LangGraph addresses the limitation of linear or ad-hoc pipelines, making AI systems more maintainable and scalable.

Model Context Protocol (MCP)

As agents rely more on tools, a standard way to integrate those tools becomes crucial. The Model Context Protocol (MCP) is an emerging open standard for exactly this purpose. MCP defines a protocol where tools run in separate processes or servers, and LLMs communicate with them through a standardized JSON schema. This separates tool implementation from agent logic, making tools language-agnostic and interoperable.

For example, a MCP math server might expose an “add” and “multiply” tool. Your GPT-4 agent can call them without knowing internal details. In LangChain, you can use langchain-mcp-adapters to connect to these services:

from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent

# Define MCP servers for math and weather tools
client = MultiServerMCPClient({
    "math": {
        "transport": "stdio", 
        "command": "python",
        "args": ["math_server.py"]
    },
    "weather": {
        "transport": "streamable_http",
        "url": "http://localhost:8000/mcp"
    }
})
tools = client.get_tools()  # Load tools from both servers
agent = create_agent("gpt-4", tools=tools)

response = agent.invoke({"input": "If it's 20°C in London, what's that in °F?"})
print(response)

In this setup, the agent sees “math:add” or “weather:get_weather” as tools. MCP ensures each tool call follows the same protocol, handling communication details. This addresses previous limitations where each tool library had its own interface. With MCP, building an AI system with many specialized tools becomes standardized, safer, and easier to maintain.

Good Last words

Together, these components form a robust AI development stack. Each innovation addresses a specific bottleneck: LangChain (and its agents) orchestrates complex LLM workflows and tool usage;prompt engineering ensures precise control over GPT-4’s outputs; vector databases and RAG overcome context-window and knowledge limitations by injecting relevant data into the model’s input; LangGraph enables dynamic, stateful execution for multi-agent applications; and MCP provides a unified, scalable way to integrate external context. In combination, these layers empower GPT-4/ChatGPT systems to scale gracefully and adapt in real time, yielding AI applications that are scalable, dynamic, and deeply context-aware.

DEV Community

Understanding AI: From LLMs to MCP

Top comments (0)