Lily

Posted on Jun 6

The Developer's Guide to LLMs

#general #interesting #howto #tutorial

The brainstorming skill doesn't fit here — the user has given a complete, explicit specification with no design ambiguity. Writing the article directly.

You paste a function into ChatGPT, it gives you a refactored version in seconds. You wire up an API, ship a "AI-powered" feature, and move on. Then a bug appears — hallucinated imports, a method signature that doesn't exist, context that evaporates mid-conversation — and you realize you've been flying blind.

LLMs are not magic. They're systems with specific mechanics, failure modes, and levers. Once you understand how they actually work, you stop fighting them and start engineering with them.

What an LLM Actually Is

A large language model is a neural network trained to predict the next token in a sequence. That's it. There's no reasoning engine, no fact database, no lookup table. The model has compressed patterns from billions of text documents into billions of numerical weights, and at inference time it samples the most statistically plausible continuation of your input.

This framing matters for developers. LLMs are extremely good at pattern matching and interpolation within their training distribution. They are unreliable for precise factual recall, arithmetic, and anything requiring deterministic correctness. They hallucinate confidently because confidence is baked into the sampling process — the model doesn't know what it doesn't know.

Tokens: The Unit of Everything

LLMs don't see characters or words. They see tokens — chunks of text produced by a tokenizer like BPE (Byte Pair Encoding). English text averages roughly one token per 0.75 words, but this varies widely. authentication might be one token; antidisestablishmentarianism might be five.

Why does this matter for developers?

Pricing is per token. Input and output tokens are often priced differently.
Context limits are in tokens. "128k context" means 128,000 tokens of combined input and output.
Tokenization affects model behavior. Weird tokenization of camelCase identifiers, URLs, or non-English text can degrade output quality.

Use a tokenizer library (Tiktoken for OpenAI models, the tokenizer endpoints for others) to check counts before sending large payloads. Don't guess.

Context Windows: The LLM's Working Memory

The context window is everything the model can "see" at once — your system prompt, conversation history, tool results, and current message. Unlike a database, the model has no persistent memory between API calls. Each call starts fresh.

This creates practical problems. A conversation that starts coherent can degrade as context fills up. Earlier instructions get "diluted" as they drift further from the model's effective attention. Some architectures handle long contexts better than others, but the constraint is real for all of them.

Strategies for managing context:

Summarize history rather than sending raw conversation logs.
Use retrieval to inject only the relevant parts of a knowledge base.
Trim tool outputs — LLM-visible tool results should be compact, not raw API responses.
Front-load your system prompt — models attend more reliably to content at the start and end of context.

Temperature, Top-p, and Sampling

When a model generates a token, it produces a probability distribution over its entire vocabulary. Sampling parameters control how you pick from that distribution.

Temperature scales the distribution. At 0, you always pick the highest-probability token — fully deterministic. At 1, you sample proportionally. Above 1, outputs get more random and incoherent. For code generation, use low temperature (0–0.3). For creative writing or brainstorming, use higher values (0.7–1.0).

Top-p (nucleus sampling) restricts sampling to the smallest set of tokens whose cumulative probability reaches p. At top_p=0.9, you only sample from the top 90% of the probability mass, pruning outlier tokens even at higher temperatures.

In practice: for production code generation, start at temperature=0 and only raise it if you need output diversity. For conversational applications, temperature=0.7 with top_p=0.95 is a reliable baseline.

Prompt Engineering for Developers

Prompting is interface design for a stochastic system. The goal is to constrain the model's output distribution toward what you actually need.

A few patterns that reliably work:

Role + task + format. "You are a senior TypeScript developer. Refactor the following function to use async/await. Return only the updated function, no explanation." All three components matter — the role primes behavioral patterns, the task specifies the action, the format prevents bloat.

Provide examples. Few-shot prompting — including two to three input/output pairs — is one of the highest-leverage techniques available. The model pattern-matches against your examples before it pattern-matches against its training data.

Chain-of-thought for complex reasoning. Asking the model to reason step-by-step before giving a final answer improves accuracy on multi-step tasks. The intermediate tokens act as working memory.

Positive over negative instructions. "Don't include explanations" is weaker than "Return only the code block." Tell the model what to do, not what to avoid.

Tool Use and Function Calling

Modern LLMs can invoke tools — functions they call to fetch data, run code, or trigger actions. The model doesn't execute the tool itself; it generates a structured request (usually JSON) that your application interprets and runs, then feeds the result back into context.

This is the backbone of agentic systems. A well-designed tool interface treats the LLM as the orchestration layer and keeps actual execution in deterministic code. The model decides what to do; your code decides how.

For reliable tool use:

Name tools clearly and describe parameters completely — the model's tool selection is only as good as your descriptions.
Return structured, compact results, not raw HTML dumps or paginated API responses.
Include error information in tool results; the model can reason about failures if you explain what happened.
Validate tool call arguments before execution. The model can produce malformed inputs for complex schemas.

RAG: Grounding LLMs in Real Data

Retrieval-Augmented Generation is a pattern where you retrieve relevant documents from an external store and inject them into the prompt before generating a response. The model answers based on retrieved content rather than training data alone.

The basic pipeline: embed the query → retrieve top-k chunks → inject into context → generate.

RAG solves two core problems: knowledge cutoffs (your model doesn't know about post-training events) and hallucination risk (grounding the model in source documents reduces confabulation).

For developers building RAG systems:

Chunk size matters. Too small and you lose semantic context; too large and retrieval precision drops.
Retrieval quality is the bottleneck. Better embedding models and reranking steps consistently beat raw vector search on real tasks.
Instruct the model to cite sources. This adds accountability and makes bugs far easier to diagnose.

Fine-Tuning vs. Prompting

Fine-tuning modifies model weights on a curated dataset. It's appropriate when you need consistent format and style adherence, domain-specific vocabulary, or behavior that's genuinely hard to achieve through prompting alone.

It's not appropriate for injecting knowledge. Fine-tuned models learn behavioral patterns, not factual recall. If you need the model to know your internal documentation, use RAG — not fine-tuning.

For most developer use cases, prompting gets you further than you'd expect. Reserve fine-tuning for cases where you have thousands of high-quality examples and a measurable, reproducible behavior gap that prompting can't close.

Choosing the Right Model

Frontier models are not always the right choice. A smaller, faster, cheaper model often outperforms a frontier model on narrow, well-specified tasks — because you can prompt it more precisely and iterate faster.

Match the model to the task:

Structured extraction, classification, routing → smaller models perform well and cost far less
Complex multi-step reasoning, large codebase generation → frontier models earn their price
Real-time, low-latency applications → optimize for speed with smaller models or prompt caching
High-volume batch jobs → batch API endpoints offer significant cost reductions with the same model

Benchmark on your actual task before committing. Leaderboard performance and real-world task performance are not the same thing.

Takeaway

LLMs are powerful tools with specific, predictable mechanics. Tokens are the unit of cost and capacity. Context is finite and managed deliberately. Sampling parameters shape output behavior. Prompts are interfaces, not incantations. Tool use and RAG extend what models can do without any retraining.

The developers who get the most out of LLMs aren't the ones who trust the models blindly — they're the ones who understand the failure modes, design around them, and measure outputs instead of eyeballing them. Treat an LLM like you'd treat any powerful external dependency: with respect, appropriate skepticism, and good instrumentation.

DEV Community