How LLMs Actually Work (And What That Means for Your Architecture Decisions)

#ai #llm #machinelearning #webdev

When I started working with language models I made the same mistake almost everyone makes: I treated the LLM like an intelligent black box, I fed it a prompt, a response came out, and if the response was bad I assumed the model was bad.

I was wrong, the model is almost never the problem; the problem is that I didn't understand how it processes information, and that made my architecture decisions terrible.

This article is not an academic paper, I'm not going to talk about attention matrices or gradients; what I am going to do is explain how an LLM works the way I wish someone had explained it to me before I started building with one.

An LLM doesn't read, it predicts

The first thing to understand, and the one that most changes how you work with these models, is that an LLM doesn't "understand" text the way humans do.

What it does is predict; given a sequence of words, it predicts which word is most likely to come next, then the next, and the next, until it completes a response.

That sounds simple, almost trivial, but the reason that prediction seems intelligent is that the model was trained on massive amounts of human text, books, articles, code, conversations, and it learned the patterns of how humans connect ideas, argue, explain, and answer questions.

It doesn't know anything, it recognizes patterns extremely well.

Why does this matter in practice? Because when an LLM "hallucinates", when it invents a fact, cites a source that doesn't exist, or states something false with complete confidence, it's not lying; it's predicting the most likely response given its training; if its training had more text affirming X than denying X, it will predict X even if X is false.

Understanding this changes how you design your prompts, how you validate responses, and what kind of tasks you assign to the model.

Context is everything, and it has a limit

The second thing to understand is the concept of the context window; every time you interact with an LLM, the model only "sees" what's inside that window, it has no memory of previous conversations, it doesn't remember what you told it yesterday, it only processes what's in the current context.

Think of it like working with someone who has amnesia between meetings; every time you call them they start from zero, the only thing they know is what you show them in that session.

Modern models have enormous context windows, Claude handles up to 200,000 tokens according to Anthropic's documentation, roughly equivalent to an entire book, and GPT-4o handles 128,000 tokens according to OpenAI; but that doesn't mean you can dump everything in and expect the model to process it equally well throughout.

In practice models tend to pay more attention to the beginning and end of the context than the middle; if you put in 50 pages of documents and the critical information is on page 25, there's a real probability the model won't weigh it correctly in its response.

This has direct architecture implications; if you're building a RAG system, which is basically connecting the LLM to your knowledge base, the quality of what you retrieve and how you order it inside the context matters as much as the model you choose.

The difference between a base model and an instruction-tuned one

Something that confuses a lot of people at first is the difference between a base model and a chat or instruction-tuned model.

A base model is the result of training on massive text; if you give it the start of a sentence, it continues it, it's not optimized to follow instructions or have a conversation, it's like an engine without a steering wheel.

An instruction-tuned model, like GPT-4o, Claude Sonnet, or Gemini, is that same base engine but with additional training that teaches it to follow instructions, answer questions, and behave in a useful and safe way; it's what you use when you open ChatGPT or Claude and have a conversation.

Why does this distinction matter? Because when you evaluate whether to fine-tune a model you need to understand whether you're working on the base model or the instruction-tuned one, and that each requires different data and strategies; according to Hugging Face's documentation and the experience of teams that have done this in production, poorly planned fine-tuning can degrade the model's instruction-following behavior, making it less useful in general while improving it on the specific task.

RAG vs fine-tuning, the decision that gets made wrong most often

This is probably the most important architectural decision when building something with LLMs, and it's the one I most often see made without the right analysis.

RAG connects the LLM to an external knowledge base at inference time; when the user asks a question, the system first searches for the most relevant information fragments in your database, puts them in the context along with the question, and the model responds using that specific information.

Fine-tuning adapts the model's weights using your specific data during training; the model literally "learns" your domain and internalizes it.

The general rule I use: RAG for knowledge that changes, fine-tuning for behavior you want to change.

If you have internal documentation that constantly updates, a product catalog that changes, or a knowledge base that grows, RAG; updating a vector index is trivial compared to retraining a model.

If you want the model to respond in a specific tone, follow a particular format, or master a very specialized task where the base model is consistently poor, fine-tuning.

In most enterprise cases I've seen RAG is the right answer; fine-tuning is expensive, requires quality data in volume, and has to be repeated every time the base model updates; according to Weights & Biases data from 2025, more than 70% of enterprise LLM implementations in production use RAG as their primary architecture.

What an LLM can't do, and why that matters

Just as important as understanding what an LLM can do is understanding its real limitations, not the ones that appear in the headlines.

An LLM doesn't reason, it simulates reasoning convincingly because it was trained on text that contains reasoning; when you ask it to solve a complex logic problem it's not following logical steps, it's predicting what text should appear after a logic problem, sometimes it matches the correct answer, sometimes it doesn't.

An LLM doesn't have updated knowledge beyond its training cutoff date; GPT-4o has knowledge through early 2024 according to OpenAI, and for more recent information you need RAG with updated sources or a model with browsing enabled.

An LLM is not deterministic; the same question can produce different responses, that's intentional, there's a parameter called temperature that controls how much randomness there is in the prediction, but it has implications for systems that need consistency.

Is it worth it for your company?

My honest answer: it depends on whether you have a language problem.

If your operation has processes that involve processing, generating, classifying, or summarizing text in volume, documents, emails, support tickets, contracts, reports, an LLM can probably do something useful there; if the bottleneck in your operation is something completely different from language, an LLM is not the solution even if it sounds good in the deck.

What is true is that the cost of experimenting dropped dramatically; Claude's API costs cents per thousand tokens according to Anthropic's documentation, GPT-4o-mini is even cheaper, and you can build a functional prototype in days, not months, and validate whether there's real value before committing serious implementation budget.

What didn't drop is the cost of doing it wrong; a poorly designed LLM system that reaches production is harder to fix than one that was never built, and architecture matters from day one; if you want to go straight to building something with LLMs without getting lost in theory, here's how we do it: LLM Development Services