DEV Community: Cursuri AI

Fine-Tuning vs RAG vs Prompting: How to Actually Decide in 2026

galian — Wed, 15 Jul 2026 11:49:01 +0000

There's a predictable arc to most LLM projects. Something doesn't work, someone says "we should fine-tune it," a month disappears into dataset wrangling and GPU bills, and the model comes back... about as wrong as before — because the actual problem was that it never had the right facts in front of it. Fine-tuning was never going to fix that.

The three techniques — prompting, retrieval-augmented generation (RAG), and fine-tuning — are not a ladder you climb from cheap to fancy. They solve different problems, and choosing the wrong one is expensive in exactly the way that's hard to notice: it looks like progress while it burns weeks.

This is a decision framework. Not "here's what each one is" — you can get that anywhere — but the concrete questions that tell you which one your problem actually needs, and the failure signatures that mean you picked wrong.

The one distinction that resolves most arguments

Before any framework, internalize this split, because it settles 80% of the "should we fine-tune?" debates on its own:

RAG changes what the model knows. It injects facts into the context at inference time.
Fine-tuning changes how the model behaves. It adjusts the weights to shift style, format, structure, and task-specific skill.
Prompting changes what the model is told to do right now, using the knowledge and behavior it already has.

So the first question is never "which technique is best?" It's "is my problem a knowledge gap or a behavior gap?"

The model gives outdated, made-up, or "I don't have access to that" answers about your data → knowledge gap → RAG.
The model knows the facts but won't reliably produce the format, tone, or task structure you need → behavior gap → fine-tuning (maybe).
You haven't seriously tried telling it clearly what to do yet → prompting, first, always.

Get this wrong and no amount of engineering saves you. Fine-tuning a model to "know your docs" is the classic error: you can bake a few facts into weights, but they go stale the moment your docs change, you can't cite sources, and you've spent training compute to build a worse version of a lookup. Knowledge that changes belongs in retrieval, not in weights.

Always start with prompting (yes, even now)

Prompting is not the beginner tier you graduate from. In 2026, with frontier models, a well-constructed prompt plus a few good examples solves a startling share of problems that teams assume need training. It's the fastest, cheapest, most inspectable option, and it should be your baseline before you're allowed to say the word "fine-tune."

Reach for prompting when:

You're still discovering what "good output" even looks like. Prompts are editable in seconds; datasets are not.
The task is reasoning, transformation, or generation the model already broadly knows how to do.
You need to ship this week.

The techniques that make prompting punch above its weight are unglamorous but real: precise role and task framing, few-shot examples that demonstrate the exact output shape, chain-of-thought for multi-step reasoning, and rigid output contracts (structured/JSON) so downstream code can trust the result. Most "the model can't do this" conclusions are actually "we asked badly" conclusions. Squeezing the ceiling out of prompting before spending on anything heavier is a discipline in itself — it's the whole point of a prompt engineering masterclass, and the ROI of getting it right first is enormous because everything downstream inherits a better baseline.

The prompting ceiling — how you know you've hit it: you've iterated seriously, added good examples, and the model still fails — and the failure is either (a) it doesn't know facts it couldn't possibly know, or (b) it can't hold a consistent behavior across inputs no matter how you phrase the instruction. (a) points to RAG. (b) might point to fine-tuning. Not before.

Reach for RAG when the problem is knowledge

RAG is the answer whenever the model needs to work with information it wasn't trained on: your internal documentation, a product catalog, last week's tickets, a knowledge base that updates daily, anything private or fresh.

Choose RAG when:

Answers must be grounded in a specific corpus and you need to cite sources.
The knowledge changes — pricing, policies, docs, inventory. You update an index, not a model.
Hallucination on facts is unacceptable and you need an audit trail of where an answer came from.
The knowledge base is large, or partly access-controlled per user.

The reason RAG beats fine-tuning for knowledge isn't subtle: updating a document store is trivial and instant; updating weights is a training run. RAG gives you freshness, provenance, and per-user access control for free — none of which fine-tuning can offer. When your facts have a shelf life, retrieval is the only correct architecture, and building it well (chunking, hybrid search, re-ranking) is where the real engineering lives — the substance of a dedicated course on RAG and retrieval-augmented generation.

RAG's own ceiling: retrieval fixes what the model knows, not how it behaves. If your RAG answers are factually correct but come out in the wrong format, wrong tone, or don't follow your house style no matter how you prompt — that residual behavior gap is exactly where fine-tuning finally earns its place, on top of RAG, not instead of it.

Fine-tune when the problem is behavior — and only then

Fine-tuning is the right tool, but for a narrower set of problems than its reputation suggests. It shines at teaching consistent behavior that's hard to specify in a prompt: a very specific output structure, a domain's tone and terminology, a classification or extraction task where you have lots of labeled examples, or a skill the base model does clumsily.

Legitimately reach for fine-tuning when:

You need consistent style, format, or structure at a level prompting can't hold across the full input distribution.
You have a narrow, high-volume, well-defined task (classification, extraction, a specific transformation) and enough quality labeled data.
You want to bake in a behavior so you can drop it from the prompt — shorter prompts, lower per-call cost, faster responses at scale.
Latency or cost at scale matters and a smaller fine-tuned model can match a bigger prompted one.

Two things make modern fine-tuning far less scary than its reputation. First, you almost never do full fine-tuning — parameter-efficient methods like LoRA/QLoRA train a tiny set of adapter weights, cutting the compute and memory cost by orders of magnitude while getting most of the benefit. Second, the bottleneck is data quality, not model choice: a few hundred to a few thousand clean, consistent, representative examples beat a huge noisy pile every time. The hard part of fine-tuning was never running the training job — it's building the dataset, choosing PEFT trade-offs, and evaluating the result without fooling yourself, which is precisely the ground a fine-tuning course has to cover to be worth anything.

When fine-tuning is the wrong answer — the red flags:

"We'll fine-tune it on our docs so it knows them." → No. That's RAG. Fine-tuned facts go stale and can't be cited.
"We haven't really tried prompting." → Do that first; you may not need to train at all.
"The requirements change weekly." → Fine-tuning bakes behavior in; if the target moves, you're re-training constantly. Keep it in the prompt until it stabilizes.
"We have 40 examples." → Usually not enough for reliable behavior change; strong prompting with those 40 as few-shot examples will likely beat it.

The combinations are the real answer

Framing these as rivals is the beginner mistake. In production, the strongest systems combine them, because they operate on different axes — knowledge, behavior, and instruction — and stack cleanly:

RAG + prompting is the workhorse for most knowledge-grounded assistants: retrieve the right context, then a well-engineered prompt instructs the model to answer only from it and cite sources. No training required.
Fine-tuning + RAG is the high end: fine-tune for the domain's behavior (tone, format, task skill), and use RAG for the facts. The model behaves exactly right and stays current — behavior in the weights, knowledge in the index.
Fine-tuning + prompting collapses a long, brittle instruction into learned behavior, so your prompts get short and your inference gets cheaper.

Orchestrating these — deciding which layer owns which responsibility, and routing a request through retrieval, tools, and the model in the right order — is its own engineering discipline, and it's the core of a course on advanced LLM integration. The mental model to keep: knowledge → retrieval, behavior → weights, instruction → prompt. Put each requirement on the axis it actually lives on.

The decision, in one pass

Run your problem through this, in order. Stop at the first that fits:

Have you genuinely exhausted prompting — clear instructions, good few-shot examples, structured output? If not → prompt. (This is where most projects should still be.)
Is the failure a knowledge gap — missing, stale, or private facts; needs citations? → RAG.
Is the failure a behavior gap — format/tone/task consistency the prompt can't hold, and you have quality labeled data and the target is stable? → fine-tune (LoRA first).
Is it both? → RAG for the facts, fine-tuning for the behavior. In that order.

And underneath all of it: you cannot make this decision without evaluation. "It seems better" is not data. Before you choose, build a small eval set — representative inputs with known-good outputs — so you can measure whether prompting already clears the bar, whether RAG actually retrieves the right context, and whether a fine-tune moved the metric or just moved the failures around. Teams that skip this pick techniques by vibes and discover the mistake in production; teams that treat evals as first-class make the cheap correct choice on purpose. The eval set is what turns this framework from an opinion into a decision.

Conclusion

The reason so many LLM projects stall isn't a shortage of technique — it's reaching for the wrong one and mistaking motion for progress. Fine-tuning a model to "learn facts," RAG-ing a problem that was really a bad prompt, or grinding on prompts when the model fundamentally lacks the data: each fails in a way that looks like effort.

Anchor on the split and you'll rarely go wrong. Knowledge that changes → RAG. Behavior you can't prompt into place → fine-tuning. Everything else → prompt, and prompt well. Start cheap, measure honestly, and add complexity only when an eval — not a hunch — tells you the current layer has topped out. The best architecture isn't the most sophisticated one; it's the one that puts each requirement on the axis where it actually belongs.

Sources & further reading:

Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models
Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs

This article is educational content. Models, tooling, and cost trade-offs evolve quickly; validate any approach against your own data and current provider documentation before committing to it in production.

Vector Embeddings Explained: Build Semantic Search in Python

galian — Mon, 13 Jul 2026 11:25:24 +0000

Search for "reset my password" in a keyword-based system and a help article titled "How to recover your account credentials" won't match — not one word overlaps. Yet any human knows they mean the same thing. Closing that gap between characters and meaning is what vector embeddings do, and they're the quiet engine behind semantic search, RAG, recommendation systems, and most of the "AI that understands you" experiences shipped since 2023.

This is a practical guide. We'll cover what an embedding actually is, why cosine similarity is the operation you'll use constantly, and then build a real, working semantic search engine in Python — first with pure NumPy so you see the mechanics, then with the tools you'd actually reach for in production. By the end you'll have code that runs and a mental model that transfers to every embedding-powered feature you build next.

What an embedding actually is

An embedding is a list of numbers — a vector — that represents a piece of content as a point in high-dimensional space. An embedding model (a neural network trained on enormous text corpora) reads your text and outputs, say, 384 or 1,536 floating-point numbers. The magic isn't the numbers themselves; it's the property the training instills: texts with similar meaning land close together in that space, and unrelated texts land far apart.

That's the whole idea. "How do I reset my password?" and "I forgot my login credentials" produce vectors that sit near each other. "What's the weather in Cluj?" produces a vector off in a completely different region. The model has effectively turned meaning into geometry — and geometry is something a computer can measure with plain arithmetic.

A few properties worth internalizing before we write code:

Dimensionality is fixed per model. A given model always outputs the same length (e.g. 384 for all-MiniLM-L6-v2, 1,536 for OpenAI's text-embedding-3-small). You can't mix vectors from different models — they live in different spaces.
The individual numbers are not interpretable. Dimension 200 doesn't mean "formality" or "topic." Meaning is distributed across all dimensions at once. Don't try to read them.
Distance is the entire point. You almost never care about a vector's absolute position — only how close it is to other vectors.

Cosine similarity: the one operation you'll use everywhere

To ask "how similar are these two texts?", you compare their vectors. The near-universal choice for text embeddings is cosine similarity: it measures the angle between two vectors, ignoring their magnitude.

Why the angle and not, say, straight-line (Euclidean) distance? Because for text embeddings, direction encodes meaning while length often encodes uninteresting things like text length or token count. Two documents about the same topic point the same way even if one is a sentence and the other a paragraph. Cosine similarity captures exactly that.

The formula is just the dot product of the two vectors divided by the product of their magnitudes:

cos(θ) = (A · B) / (‖A‖ · ‖B‖)

It returns a value from -1 to 1, though for most modern text embeddings you'll see results land in roughly the 0-to-1 range:

~1.0 — nearly identical meaning
~0.5 — loosely related
~0.0 — unrelated

That single number, computed against a corpus of stored vectors, is semantic search. Everything else is optimization.

Build a semantic search engine from scratch

Let's make it concrete. We'll use sentence-transformers, which runs a capable embedding model locally — no API key, no network calls, so you can run this offline right now.

pip install sentence-transformers numpy

Step 1 — Embed a corpus

from sentence_transformers import SentenceTransformer
import numpy as np

# A small, fast, widely used model. 384-dimensional output.
model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "How to reset your account password",
    "Refund policy for annual subscriptions",
    "Setting up two-factor authentication",
    "Our office hours and contact details",
    "How to recover a locked account",
]

# Encode the whole corpus once. Shape: (5, 384)
doc_embeddings = model.encode(documents)
print(doc_embeddings.shape)  # (5, 384)

That doc_embeddings array is your search index. In a real app you compute it once, when a document is created or updated, and store it — never on every query.

Step 2 — Cosine similarity in NumPy

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Cosine similarity between vector `a` and each row of matrix `b`."""
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b, axis=1, keepdims=True)
    return b_norm @ a_norm

Ten lines, no dependencies beyond NumPy. This is the actual core of semantic search — the rest is plumbing.

Step 3 — Search

def search(query: str, k: int = 3):
    query_vec = model.encode(query)
    scores = cosine_similarity(query_vec, doc_embeddings)
    ranked = np.argsort(scores)[::-1][:k]  # top-k, highest first
    return [(documents[i], float(scores[i])) for i in ranked]

for doc, score in search("I forgot my login credentials"):
    print(f"{score:.3f}  {doc}")

Run it, and you'll get something like:

0.62  How to reset your account password
0.55  How to recover a locked account
0.31  Setting up two-factor authentication

Notice what happened: the query "I forgot my login credentials" shares zero words with "How to reset your account password," yet it ranked first. A keyword search would have returned nothing. That's the payoff — you matched on meaning, not on string overlap. This shift from lexical to semantic matching is the foundation every retrieval-augmented system builds on, and it's the starting point of a structured course on RAG and retrieval-augmented generation that goes from this toy index to production retrieval.

From toy to production: what changes

The NumPy version is perfect for learning and fine for a few thousand documents. Past that, three things force an upgrade.

You need a vector database

Computing cosine similarity against every stored vector on every query is O(n) — fine at 5 documents, painful at 5 million. Vector databases solve this with Approximate Nearest Neighbor (ANN) indexes (HNSW is the common one) that trade a sliver of accuracy for enormous speed, returning near-neighbors in milliseconds over huge corpora.

You have good open-source options:

pgvector — a Postgres extension. If your data already lives in Postgres, this is often the pragmatic choice: vectors and relational data in one place, one backup story.
Chroma / Qdrant / Weaviate / Milvus — purpose-built vector stores with richer filtering and scaling stories.
FAISS — a library (not a server) from Meta for fast similarity search when you want to manage the index yourself.

A minimal Chroma example shows how little the mental model changes:

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(documents=documents, ids=[f"d{i}" for i in range(len(documents))])

results = collection.query(query_texts=["I forgot my login credentials"], n_results=3)
print(results["documents"])

Chroma embeds the text for you and handles the index. Same concept, production ergonomics.

Chunking matters more than the model

You rarely embed whole documents. A 30-page PDF becomes one vector that's an average of everything and a good match for nothing. In practice you chunk documents into passages (a few hundred tokens, often with slight overlap) and embed each chunk. Get chunking wrong and even a great embedding model returns mush — which is one of the most common reasons retrieval systems quietly underperform. Chunking strategy, overlap, and metadata are exactly the unglamorous details that separate a demo from a dependable system, and they're covered in depth in a course on advanced LLM integration for production apps.

Choosing and swapping embedding models

Embedding models differ in dimensionality, speed, cost, and language coverage — and critically, you must embed your corpus and your queries with the same model. Change the model and you re-embed everything. For multilingual apps (Romanian included), pick a model with strong multilingual training rather than an English-first one, or your non-English recall will suffer silently. Public benchmarks like the MTEB leaderboard on Hugging Face are the sane starting point for comparing models on retrieval quality rather than vibes.

Hybrid search: when semantic alone isn't enough

Here's a lesson that surprises people: pure semantic search is worse than keyword search for certain queries. Ask for an exact product code, a specific error like ERR_CONN_REFUSED, a person's name, or an acronym, and embeddings can betray you — they match on meaning, and a precise identifier has little semantic meaning to spread around. The embedding for ERR_CONN_REFUSED sits near "connection problems" generally, so a document about a different connection error can outrank the exact match.

The production answer is hybrid search: run both a keyword search (classic lexical scoring like BM25) and a semantic search, then combine the rankings. Keyword search nails exact terms, identifiers, and rare words; semantic search nails paraphrase and intent. Together they cover each other's blind spots.

The standard way to merge the two result lists is Reciprocal Rank Fusion (RRF) — a simple, robust formula that scores each document by its rank in each list rather than by raw scores that live on incompatible scales:

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    """Fuse multiple ranked lists of doc IDs into one score per doc."""
    scores: dict[str, float] = {}
    for ranking in rankings:            # e.g. [keyword_results, semantic_results]
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

Because RRF only looks at positions, you don't have to normalize BM25 scores against cosine similarities — a notorious apples-to-oranges trap. Most serious vector databases (Weaviate, Qdrant, and others) now offer hybrid search with RRF built in, precisely because "just embeddings" quietly underperforms on real, messy query logs. If you take one production lesson from this article, make it this: measure your recall on real queries, and reach for hybrid the moment exact-match queries show up.

Where embeddings show up beyond search

Semantic search is the gateway, but the same vectors power a lot more:

RAG (retrieval-augmented generation) — retrieve relevant chunks by embedding similarity, then feed them to an LLM as grounding context. Embeddings are the retrieval half; without good retrieval, generation hallucinates.
Deduplication & clustering — near-duplicate detection and topic clustering fall out of distances almost for free.
Recommendations — "items similar to this one" is a nearest-neighbor query in embedding space.
Classification — embed labeled examples, then classify new items by nearest neighbors, often without training a dedicated model.

The through-line: any time you need "similar in meaning" rather than "matches exactly," embeddings are the tool. Building these features end to end — API to product, with the retrieval and orchestration wired up properly — is the spine of a hands-on course on building AI applications in Python with the OpenAI and Anthropic SDKs.

Common mistakes that cost you hours

A few traps that catch almost everyone the first time:

Re-embedding the corpus on every query. Embed documents once, store the vectors, embed only the incoming query at search time.
Mixing models. Query vectors and document vectors must come from the same embedding model. A silent mismatch produces garbage rankings with no error.
Forgetting to normalize. If you compute raw dot products instead of cosine similarity (and your vectors aren't already unit-normalized), longer texts get an unfair boost. Normalize, or use a library that does.
Embedding documents that are too large. One vector per giant document averages meaning into uselessness. Chunk first.
Trusting a single similarity threshold forever. The "good enough" cutoff depends on your model and data. Measure it on real queries; don't hardcode 0.8 because a blog post said so.

Conclusion

Vector embeddings are one of those ideas that feels like magic until you see the mechanics — and then it's just geometry. Text becomes a point in space, meaning becomes distance, and search becomes "find the nearest points." You built exactly that in a few lines of Python, from raw NumPy cosine similarity to a Chroma-backed index, and the same core idea scales from a toy corpus to millions of documents behind an ANN index.

Start where we started: embed a handful of your own documents, run a query that shares no keywords with the right answer, and watch it surface anyway. Once that clicks, RAG, recommendations, and semantic features stop looking like separate topics and start looking like one tool applied five ways. If you want the structured path from here — retrieval, chunking, evaluation, and production wiring — Cursuri-AI.ro builds it step by step.

Sources & further reading:

Sentence-Transformers — Official documentation
Hugging Face — MTEB: Massive Text Embedding Benchmark leaderboard
pgvector — Open-source vector similarity search for Postgres
Chroma — Open-source embedding database

This article is educational content. Model names, dimensions, and library APIs evolve; verify current details in the official documentation before building production systems.

Context Engineering for AI Agents: Beyond Prompt Engineering

galian — Thu, 09 Jul 2026 13:08:24 +0000

You wrote a great prompt. It worked beautifully in the playground — one question, one clean answer. Then you wired the same model into an agent that runs twenty steps, calls six tools, and reads back their output, and somewhere around step twelve it started forgetting the goal, calling the wrong tool, or confidently acting on something it misread three steps ago. The prompt didn't get worse. The context did.

This is the gap that context engineering fills. Prompt engineering is about writing one good instruction. Context engineering is about managing the entire set of tokens a model sees at inference time — across a long, multi-step run — so the signal stays high and the model keeps making good decisions. Anthropic frames it as the natural progression of prompt engineering, and if you're building anything agentic in 2026, it's the discipline that separates a demo from a system.

What context engineering actually is

Start with a precise definition. Context is the full set of tokens you include when you sample from a large language model. Not just your prompt — the system instructions, the tool definitions, the examples, the running message history, the retrieved documents, the tool results fed back in. Everything in the window.

Context engineering is the set of strategies for curating and maintaining the optimal set of those tokens during inference. The goal, in one line: find the smallest set of high-signal tokens that reliably produces the outcome you want.

The reason this is a distinct discipline from prompt engineering is the shape of the problem. A prompt is something you write once and it stays put. Context in an agent is dynamic — it grows on every turn as the model reads files, calls tools, and accumulates history. You're not authoring a static string anymore; you're managing a budget that fills up on its own, and deciding continuously what earns a place in it and what gets thrown out. That's an engineering problem, and it's the foundation the whole prompt-to-production journey builds on.

Why "just add more context" is the wrong instinct

The intuitive move, when an agent makes a mistake, is to give it more: more instructions, more examples, more history, more retrieved documents. Sometimes that helps. Very often it makes things worse, and here's why.

A model's effective attention is a finite resource. Every token you add competes with every other token for the model's limited ability to attend to what matters. Past a certain point, adding context doesn't add capability — it dilutes it. The relevant fact is now buried among ten irrelevant ones, and the model attends to the wrong thing.

This shows up empirically. The "lost in the middle" effect — documented by Liu et al. — found that models attend most reliably to information at the start and end of a long context, and least reliably to what's stuck in the middle. As context grows, retrieval of any single fact inside it gets less reliable, a degradation sometimes called context rot. A 200K-token window does not mean you should put 200K tokens in it. Capacity is not the same as attention.

So the mental model to internalize: context is a budget, not a bucket. You're not trying to fill it. You're trying to spend it on the highest-signal tokens available and refuse everything else. Every technique below is a way to enforce that discipline.

The four things competing for your window

Before the techniques, know your spenders. In a running agent, four categories of tokens fight for the same finite budget:

The system prompt — your instructions, role, constraints. Usually small, high-value, and stable.
Tool definitions — the schemas describing every tool the agent can call. These are sneakily expensive: each tool's description sits in context on every turn, whether or not it's used.
Message history — the accumulating transcript of the conversation and the agent's own steps. This is the one that grows without bound and quietly eats the window.
Retrieved / external data — documents, search results, file contents, database rows pulled in to ground the model.

The overall guidance from Anthropic is worth memorizing: keep each of these informative yet tight. Not empty — an under-specified system prompt or a missing tool leaves the model guessing. But not bloated either. The art is the calibration, and it's different for each category. Let's work through the techniques that manage them.

Technique 1: Compaction — summarize the history before it drowns you

Message history is the runaway spender. A long agent run accumulates hundreds of turns of "called tool, got 4KB of JSON back, reasoned about it, called the next tool." Most of those raw tool outputs are dead weight three steps later — you needed the conclusion, not the 4KB.

Compaction is the fix: periodically replace a chunk of verbose history with a tight summary that preserves the decisions, the key facts, and the current state, while dropping the raw noise. When the agent has finished investigating something, you don't need the full transcript of the investigation in context — you need "here's what I found and what it means for the task."

Practical rules:

Compact at natural boundaries, not mid-reasoning — after a sub-task completes, when a phase ends.
Preserve the load-bearing details: open questions, decisions made, constraints discovered, current state. Drop the intermediate chatter and the raw dumps.
Keep the goal pinned. The single most common long-run failure is the agent losing the plot on the original objective. The goal should survive every compaction.

Done well, compaction is what lets an agent run for hundreds of steps without either overflowing its window or forgetting why it started.

Technique 2: External memory — let the agent write things down

The window is not the only place to keep information. The most effective long-running agents treat the context window as working memory and offload durable state to external memory — a file, a scratchpad, a structured store the agent reads from and writes to deliberately.

Instead of carrying every fact in-context forever, the agent writes a note ("the auth module uses JWT with 15-minute expiry; the bug is in the refresh path") to a persistent store, and pulls it back only when relevant. The context window stays lean; the knowledge doesn't get lost. This is exactly how a human engineer works — you don't hold the entire codebase in your head, you keep notes and open the file when you need it.

This pattern — persistent, deliberately-managed memory outside the window — is the core of building agents that hold state over long horizons, and it's what turns a stateless model into something that can work a problem across a session without drowning.

Technique 3: Sub-agents — isolate context so the main thread stays clean

Here's a structural technique that most people never turn on. When a sub-task is going to burn a lot of tokens — "find every call site of this function across the repo," "research these five libraries" — don't do it in the main agent's context. Delegate it to a sub-agent: a separate agent instance with its own isolated window that does the noisy work and reports back a clean result.

The win is context hygiene. The thousands of tokens of file contents and search output that the investigation churns through stay in the sub-agent's context and die with it. The main agent gets back a two-paragraph summary, and its own window stays focused on the actual task instead of silting up with intermediate noise. As a bonus, independent sub-tasks run in parallel instead of serially.

The rule of thumb: delegate when work is independent, parallelizable, or context-heavy; keep it inline when it's sequential and cheap. Knowing when to fan out to sub-agents and how to orchestrate them without stepping on each other is a central skill in building AI agents and automation, and it's one of the highest-leverage context moves you have.

Technique 4: Just-in-time retrieval — pull data when needed, not upfront

There are two ways to get external data into an agent. The naive way is to preload: at the start, dump everything the agent might conceivably need into context — the whole document, all the schemas, every config file. The problem is obvious once you see it: you're spending your budget on maybes, and most of it goes unused while crowding out what matters.

The better pattern is just-in-time retrieval: give the agent the ability to fetch data (a search tool, a file reader, a database query) and let it pull exactly what it needs, exactly when it needs it. Instead of "here are all 40 files," it's "here's a tool to read a file — go get the one you need." The agent loads the relevant chunk into context at the moment of use, acts on it, and (with compaction) lets it fall away afterward.

This mirrors how retrieval-augmented systems already work, and getting the retrieval layer right — what to fetch, how to rank it, how much to bring back — is where advanced LLM integration earns its keep. Preloading feels safer; just-in-time is what actually scales.

Technique 5: Tool curation — the failure mode hiding in plain sight

Remember that tool definitions sit in context on every turn. That makes a bloated tool set a double tax: it burns tokens continuously, and it degrades decisions. Anthropic calls out one of the most common failure modes directly — tool sets that cover too much functionality or create ambiguous, overlapping choices about which tool to use.

The tell is a sharp one: if a human engineer can't say for certain which tool should be used in a given situation, an AI agent can't either. Fifteen tools with fuzzy, overlapping responsibilities will produce worse behavior than six sharp, non-overlapping ones — and cost more tokens doing it.

So curate the toolbox like you'd curate an API:

Fewer, sharper tools. Each with a clear, distinct job and an unambiguous "use this when…"
No overlap. Two tools that could both plausibly handle the same request is a decision point where the agent will sometimes pick wrong.
Prune ruthlessly. A tool that's rarely the right choice is paying rent in your context on every single turn. Cut it.

Tool curation is the least glamorous technique here and often the highest-ROI. It's pure subtraction, and subtraction is exactly what context engineering rewards.

How you know it's working: measure it

Every technique above is a change to your context, and changes to context are exactly the kind of thing that feels better while being worse — or vice versa. If your only signal is "I ran a few queries and it seemed fine," you're tuning blind, and you'll ship a regression the day you compact one turn too aggressively and the agent starts forgetting a constraint.

The fix is the same as anywhere in production ML: an evaluation set. Assemble representative tasks with known good outcomes, and re-run them every time you change how context is managed. Then you can say "compaction at phase boundaries held task success at 0.9 while cutting average tokens 40%" instead of "I think it's better now." Treating evaluation as first-class rather than an afterthought is what turns context engineering from a craft into engineering — a number that moves when you change something, not a vibe.

Putting it together

Context engineering isn't a framework you install; it's a posture you adopt toward the model's window. The whole discipline collapses to one principle applied relentlessly: spend the finite budget on the smallest set of high-signal tokens that does the job.

In practice, for a real agent, that means:

Compact the history at natural boundaries so a long run doesn't drown in its own transcript.
Offload durable state to external memory instead of carrying it forever.
Delegate noisy, independent work to sub-agents so the main window stays clean.
Retrieve just-in-time instead of preloading everything you might need.
Curate the tools hard — fewer, sharper, no overlap.
Measure with evals so every change is verified, not hoped.

None of these are exotic, and that's the point. The model is already capable. What makes the capability hold up over a twenty-step run isn't a cleverer prompt — it's disciplined management of everything the model reads along the way.

Conclusion

Prompt engineering taught us to write one good instruction. Context engineering is what you need the moment that instruction has to survive a long, tool-using, self-accumulating agent run — which is to say, the moment you build anything real. The failure you saw at step twelve was never the model getting dumber. It was the context getting noisier, and no one curating it.

Adopt the budget mindset, apply the five techniques, and put an eval set behind every change. Do that, and the same model that fell apart at step twelve will run to step fifty and still know exactly what it's doing — because you engineered what it was looking at the whole way.

The courses linked throughout are part of Cursuri-AI.ro, an AI-learning platform with hands-on, current tracks on context engineering, AI agents, LLM integration, and evaluating AI systems in production.

Sources & further reading:

Anthropic — Effective context engineering for AI agents (definition, the four context components, tool-set failure modes, compaction and memory)
Anthropic — Effective harnesses for long-running agents
Liu et al. — Lost in the Middle: How Language Models Use Long Contexts

This article is educational content. Model behavior, context limits, and tooling evolve quickly; validate approaches against your own workloads and current official documentation.

Run LLMs Locally with Ollama in 2026: The Practical Developer Guide

galian — Sun, 05 Jul 2026 21:11:24 +0000

For years, "run the model locally" was the option you mentioned and then didn't take: the models were too weak, the tooling too fiddly, and the cloud APIs too convenient. In 2026 that calculus has genuinely shifted. Open-weight models in the 12–35B range now handle real coding and agent workloads, Apple Silicon got a dedicated inference engine, and Ollama quietly became a drop-in backend for the same tools you already use against cloud APIs — including Claude Code.

I teach practical AI engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, and local inference has gone from a curiosity module to one of the questions companies ask us most — usually spelled "how do we use LLMs without sending our data anywhere?" This guide is the answer I give developers: what changed, what hardware you actually need, which models are worth pulling, and how to plug it all into a real workflow.

As always with this space: versions and model rankings move monthly. Everything below is verified against Ollama's official blog and docs as of early July 2026 — re-check before you build a budget or an architecture on it.

Why local, and why now

Three arguments have survived contact with production; the rest is mostly vibes.

Privacy and data residency. With a local model, prompts and outputs never leave your machine (or your VPC, if you self-host on a server). For anyone dealing with client data, medical text, legal documents, or GDPR-sensitive workloads, this eliminates the entire "what does the provider do with my data" conversation instead of managing it through contracts. This is the single biggest adoption driver we see in Europe, and it's the backbone of our course on local LLMs, self-hosting, and privacy.

Cost shape. Cloud APIs bill per token; local inference bills you once, in hardware you may already own. For high-volume, latency-tolerant workloads — batch classification, summarization pipelines, internal tooling — a mid-range GPU that's already on someone's desk can absorb work that would otherwise be a real monthly line item. (For low-volume or frontier-quality work, cloud still wins. More on that below.)

No external dependency. A local model doesn't get deprecated, rate-limited, price-changed, or suspended out from under you. After the model-availability surprises of the last year, "at least one workload runs on weights we control" has become a reasonable line item in a resilience plan, not paranoia.

What actually changed in Ollama in 2026

If you last touched Ollama when it was "a nice wrapper around llama.cpp," the 2026 releases are the reason to look again. All of this is from Ollama's official blog:

Anthropic API compatibility (January, v0.14.0). Ollama now exposes a native Anthropic-style /v1/messages endpoint. This is the sleeper feature of the year: Anthropic-native tools — most notably Claude Code — can talk to a local model directly, with no proxy or translation layer. There's a matching OpenAI-compatible endpoint too, so Codex and OpenAI-SDK apps work the same way.
ollama launch (January). A single command that configures and starts a coding agent against a local model — ollama launch claude sets up Claude Code, prompts you to pick a model, and you're in.
Experimental image generation (January). Early days, but the scope of "local model" is no longer text-only.
MLX engine on Apple Silicon (March preview → June release). Ollama moved its Mac inference path to Apple's MLX framework, which exploits unified memory. Ollama's own framing for the June release: its highest performance on Apple Silicon yet — faster output with reduced memory usage.
Ollama 0.30 and 0.31 (June). Version 0.30 brought improved performance and broader GGUF model compatibility through llama.cpp; 0.31 made Gemma 4 significantly faster on Apple Silicon via multi-token prediction (MTP), enabled by default.

The theme is clear: Ollama is positioning itself less as a hobbyist toy and more as the standard local backend for agentic tooling.

Getting started in five minutes

Install (macOS and Windows have installers at ollama.com/download; on Linux):

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model:

ollama pull gemma4
ollama run gemma4

That's a working local chat. Ollama also starts a local server on port 11434, which is where the interesting part begins — every API-based tool you have can point at it.

Useful daily commands: ollama ls (installed models), ollama ps (what's loaded and where — CPU vs GPU), ollama rm <model> (free disk space; models are multi-gigabyte).

Hardware: the honest sizing guide

The rule of thumb that matters: a model quantized to 4 bits needs very roughly 0.5–0.7 GB of memory per billion parameters, plus overhead for context. Everything else follows from that.

Your hardware	What runs comfortably	Experience
8 GB RAM, no GPU	3–8B models, quantized	Fine for chat, drafting, classification. Slow but usable on CPU
16 GB RAM (Apple Silicon)	8–14B models	Good daily-driver territory; MLX made this tier notably faster in 2026
24 GB+ (M-series Pro/Max or a 24 GB GPU)	27–35B models	Where local coding models get genuinely useful
48 GB+ unified memory / multi-GPU	Large MoE models	Server-class local inference

Two nuances that save people disappointment:

Quantization is why any of this works. Models ship in compressed 4–8 bit variants (the GGUF ecosystem) that trade a small quality loss for a 2–4× memory reduction. Ollama's default tags are already quantized — you rarely need to think about it, but it explains why a "27B model" fits in 24 GB.
Mixture-of-experts (MoE) models need memory for their total parameters but compute like their active subset. NVIDIA's Nemotron-3-Super, for example, is a 120B model with 12B active parameters: it runs faster than its size suggests, but you still need the RAM to hold it.

Context length eats memory too — an agent session with 32K+ tokens of context adds real overhead on top of the weights. If you're sizing for coding agents, budget for that, not just the model file.

The mid-2026 open-weight lineup worth knowing

Rankings churn monthly, so treat this as a map, not a leaderboard. From Ollama's model library, the families that matter right now:

Gemma 4 (12B–31B) — Google's open family, currently the most-pulled model on Ollama. Multimodal, tuned for reasoning and agentic work, and the main beneficiary of the MLX/MTP speedups on Macs.
Qwen3.5 / Qwen3.6 (0.8B–122B) — the ecosystem's Swiss army knife. Qwen3.5 spans everything from edge-tiny to server-large; Qwen3.6 (27B–35B) focuses on agentic coding. qwen3-coder is Ollama's own recommendation for coding-agent use.
GLM-5 family — flagship-class open weights (GLM-5 is 744B total / 40B active); strong at coding and long-horizon tasks. Too big for most desktops locally, but available as :cloud variants (see below) and self-hostable on serious hardware.
Nemotron-3-Super (120B MoE, 12B active) — NVIDIA's entry, aimed at multi-agent applications.
MiniMax-M3 — notable for a 1M-token context window, if your workload is long-document analysis.
Specialists: GLM-OCR for document understanding, TranslateGemma (4B–27B, 55 languages) for translation, LFM2 (24B) for on-device deployment, Ornith (9B–35B) for agentic coding.

Sensible defaults: on a 16 GB Mac, start with gemma4:12b. On 24 GB+, try qwen3-coder for code and gemma4:27b for general work. Then run your tasks on them — a model's rank on someone's benchmark tells you little about your use case.

The part that changes your workflow: Ollama as a drop-in API

Ollama's server speaks both major API dialects, which means "switch to a local model" is now a base-URL change, not a rewrite.

OpenAI-compatible (/v1) — any OpenAI-SDK app works:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "Explain GGUF quantization in one paragraph."}],
)
print(resp.choices[0].message.content)

Anthropic-compatible (/v1/messages) — and this is the one with teeth, because it means Claude Code runs against local models. Per Ollama's official docs:

export ANTHROPIC_AUTH_TOKEN=ollama       # accepted but not validated
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model qwen3-coder

Or let Ollama do the wiring for you:

ollama launch claude

Honest caveats before you get excited:

Ollama recommends at least 32K tokens of context for Claude Code, and its model suggestions for coding are qwen3-coder locally (30B — you want 24 GB+ of VRAM/unified memory) or glm-4.7:cloud / minimax-m2.1:cloud via Ollama's cloud, which keeps the same API surface but runs the weights remotely.
The compatibility layer doesn't cover everything: no prompt caching, no token-counting endpoint, no forced tool selection, no batches API, no PDF inputs (images must be base64). If your workflow leans on those, you'll feel it.
A 30B local model is not Opus, and it isn't trying to be. It's "capable pair of hands on an airplane / on confidential code," not "frontier reasoning."

The pattern that actually works in practice is routing: local models for the private, high-volume, or offline work; frontier cloud models for the hard reasoning. Deciding which tier a task belongs to — and building the escalation path — is an architecture skill, and it's exactly the kind of decision we drill in our AI system architecture course.

When local is the wrong choice

Being a fan of local inference means knowing where it loses:

Frontier-quality reasoning. For the hardest tasks, top cloud models remain clearly ahead of anything you can run on a workstation. If wrong answers are expensive, don't fight this.
Low-volume workloads. If you make a few thousand API calls a month, per-token billing is cheaper than any GPU. Local pays off at volume, at privacy constraints, or at both.
Ops you don't want. A self-hosted model is a service you now run: updates, monitoring, capacity. ollama run on a laptop is trivial; a team-wide inference server is real infrastructure.
Multimodal breadth and long-tail capabilities. Cloud APIs still bundle more (native PDF understanding, larger tool ecosystems, batch APIs) than the local stack replicates.

One more thing people conflate: running a model locally is different from customizing one. If your actual goal is a model that speaks your domain language or follows your house style, that's a fine-tuning question — LoRA adapters on an open-weight base, then serving the result through Ollama. That pipeline (when to fine-tune vs when to just engineer the prompt) is its own discipline, covered in our fine-tuning course.

Frequently asked questions

Is Ollama free?

The tool itself is open source and free. The models each carry their own licenses — Gemma, Qwen, GLM and friends have different terms, some with restrictions on commercial use. Check the license tab on the model's Ollama page before you ship a product on it. Ollama's optional cloud models are a paid service.

What hardware do I need to run LLMs locally in 2026?

As a rule of thumb at 4-bit quantization: 8 GB of RAM runs 3–8B models, 16 GB runs 8–14B comfortably (especially on Apple Silicon with the MLX engine), and 24 GB+ opens up the 27–35B class where local coding models get genuinely useful. More context = more memory on top of the weights.

Can a local model replace GPT or Claude?

For a growing set of tasks — summarization, classification, drafting, routine coding on mid-size codebases — yes, credibly. For frontier reasoning and the highest-stakes accuracy, no. Production teams typically route: local for private/high-volume work, cloud for the hard 10%.

Can I really use Claude Code with Ollama?

Yes. Since Ollama v0.14.0 (January 2026) there's native Anthropic Messages API compatibility: set ANTHROPIC_BASE_URL=http://localhost:11434, run claude --model qwen3-coder — or just ollama launch claude. Expect a capable assistant, not Opus-level reasoning, and note that prompt caching and a few other API features aren't supported through the compatibility layer.

Ollama vs llama.cpp vs vLLM — which should I use?

Ollama for developer experience: one command, model management, dual API compatibility. llama.cpp (which powers Ollama's GGUF path) for maximum control and minimal footprint. vLLM for high-throughput multi-user serving on server GPUs. Most developers should start with Ollama and only move down the stack when they hit a concrete limit.

The skill underneath the tool

Here's the uncomfortable part: pulling a model is the easy 5%. The value shows up when you can answer the questions around it — which model for which task, how to measure whether the local model is good enough for your workload instead of guessing, how to build the routing and fallback so privacy-sensitive work stays local while hard problems escalate to a frontier model. That's engineering judgment, not tooling trivia.

That judgment is what we teach at our AI education platform — hands-on courses built around real repositories and an interactive AI instructor, covering the full local-to-cloud spectrum: self-hosting and privacy, fine-tuning, architecture, and the agentic workflow on top.

Conclusion

In 2026, local LLMs crossed the line from hobby to infrastructure option. Ollama's dual API compatibility means your existing tools — including Claude Code — can run against open weights with a base-URL change; the MLX engine made a 16 GB MacBook a legitimate inference machine; and the open-weight lineup in the 12–35B range is good enough for a real slice of production work.

The play isn't "cancel your API keys." It's knowing which slice of your workload belongs on weights you control — then running it there deliberately, measured, with an escalation path for everything else. Start with ollama run gemma4 tonight; you're one evening away from having an informed opinion.

Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from local LLMs and self-hosting to agentic coding, evals, and AI system architecture.

Sources: Ollama Blog · Anthropic API compatibility — Ollama Docs · Claude Code with Anthropic API compatibility — Ollama · Ollama Model Library · Download Ollama

Cursor vs GitHub Copilot vs Claude Code: Which AI Coding Tool in 2026?

galian — Mon, 29 Jun 2026 14:59:29 +0000

If you write code for a living in 2026, you're not asking whether to use an AI coding tool — you're asking which one. And the three names that dominate every team's Slack debate are Cursor, GitHub Copilot, and Claude Code. They look similar from a distance (type intent, get code) but they're built on three genuinely different bets about how software gets written.

I've spent serious time in all three on real, multi-file, multi-repo work — not toy demos — and this is the comparison I wish someone had handed me before I burned a month figuring it out. I write and teach about agentic engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, so I'll keep this grounded in how these tools actually behave in production, not in launch-day marketing.

A note before we start: pricing and features in this category change almost monthly. Everything below is a mid-2026 snapshot — verify the current numbers on each tool's official page before you budget for a team.

TL;DR — three different philosophies

Here's the one-sentence version of each, before we go deep:

Cursor is an AI-native editor — it rebuilt the IDE around the agent. Best for developers who want fast, fluid, in-the-flow generation with deep editor integration.
GitHub Copilot is the ecosystem play — it lives where your code, issues, and PRs already are. Best for teams standardized on GitHub who want AI woven through the whole SDLC.
Claude Code is the terminal-first agent — it treats the command line as the primary surface and excels at autonomous, multi-step, multi-file work. Best for engineers comfortable orchestrating agents rather than babysitting autocomplete.

None of them is "the best." They optimize for different moments, and the real skill is knowing which to reach for. Let's break down why.

What is Cursor?

Cursor is an AI-native IDE built as a fork of VS Code, so the editor feels instantly familiar — your extensions, keybindings, and themes mostly carry over. What's different is that the AI isn't bolted on as a plugin; the whole editing experience is designed around it.

Its signature features:

Tab completion — a multi-line, context-aware autocomplete that predicts your next edit, not just the next token. It's the feature people miss most when they switch away.
Composer — Cursor's agentic, multi-file editing mode. You describe a change in natural language and it edits across files, runs commands, and iterates. Cursor now ships Composer 2.5, its own model trained specifically for agentic coding, alongside routing to frontier models from Anthropic, OpenAI, and Google.
Cloud Agents — introduced in the Cursor 3.5 release (May 20, 2026), these run in isolated cloud VMs with terminal and browser access, can work across multiple repos in parallel, and report results back to your IDE asynchronously. It's Cursor's answer to "I want the agent working while I do something else."

Cursor's center of gravity is in-the-flow coding: you stay in the editor, you see every diff, and the AI keeps pace with your thinking. It rewards developers who want speed without giving up granular control over the code.

What is GitHub Copilot?

Copilot is the most widely deployed of the three, and its biggest advantage is gravitational: it lives inside the tools and platform most teams already use. It runs in VS Code, JetBrains IDEs, Visual Studio, and on GitHub itself.

By 2026 Copilot has grown well past autocomplete:

Agent mode became generally available across both VS Code and JetBrains in March 2026 (previously VS Code only) — a multi-step agent that plans, edits across files, and runs commands inside your editor.
The autonomous coding agent is the standout. You assign a GitHub issue to Copilot, and it works asynchronously in the background — analyzing the repo, making changes, and opening a ready-to-review pull request. Assign, walk away, come back to a PR. It's the closest any mainstream tool comes to "fire-and-forget" feature work.
Agentic code review gathers full project context before suggesting changes and can hand fixes straight to the coding agent.
GitHub Spark lets you describe an app in plain English and get generated code with a live preview.

The strategic point: Copilot's value isn't any single feature — it's that AI is now threaded through the entire GitHub-centric SDLC, from issue to PR to review. If your team lives on GitHub, that integration is hard to beat.

One billing change worth flagging: as of June 1, 2026, GitHub moved to GitHub AI Credits (token-based billing) in place of the older Premium Request Units. You're now billed by tokens processed at published model rates, which makes heavy agent usage more transparent — and easier to accidentally overspend if you're not watching.

What is Claude Code?

Claude Code, from Anthropic, takes the opposite stance from Cursor: instead of building an editor, it makes the terminal the primary surface (with IDE extensions available on top). That sounds minimalist until you see what it does with full shell access.

Its defining strengths:

Agentic, multi-file, repo-aware work from the command line — it reads your codebase, makes coordinated changes across many files, runs your tests, and handles git operations and CI-aware workflows natively.
Subagents — reusable agent configurations with their own custom prompts and tool access, so you can define a "reviewer," a "test-writer," or a "migration" agent and invoke it on demand.
Agent teams and multi-agent orchestration — coordinate multiple agent sessions working in parallel, with an agent view dashboard to manage them.

Claude Code runs on Anthropic's models — currently Claude Opus 4.8 as the default, with the newer Claude Fable 5 as the most capable tier — and it's deliberately model-opinionated rather than a router. The tradeoff is real: it's the most powerful for autonomous, complex tasks, and the least hand-holdy. It assumes you're comfortable thinking like an orchestrator of agents rather than a writer of lines.

A word of caution that applies to every agent platform but bites hardest here: parallel agents multiply your token spend. Running ten agents at once consumes your quota roughly ten times faster. The autonomy is exhilarating; the bill is real. Set limits before you scale up.

Head-to-head: the dimensions that actually matter

The editing model

Cursor wins on in-editor flow. Tab completion and inline diffs keep you in control of every change.
Copilot wins on breadth of surface — it's good everywhere your code already is.
Claude Code wins on autonomous depth — it goes furthest without supervision, but you give up the inline, line-by-line feel.

Agents and autonomy

All three now have agents, but the philosophy differs. Cursor's Cloud Agents and Copilot's coding agent are both "assign work, get a result later." Claude Code goes further with explicit multi-agent orchestration and reusable subagents. If your work is increasingly delegating rather than typing, this is the dimension to weigh most — and it's exactly the shift that makes understanding AI agent architecture and automation a genuine career edge rather than a nice-to-have.

Ecosystem and integration

This is Copilot's home turf. The issue-to-PR loop, native code review, and presence across every major IDE make it the path of least resistance for GitHub-standardized teams. Cursor integrates deeply but inside its editor; Claude Code integrates deeply with your shell and git, which is either liberating or intimidating depending on your comfort with the command line.

Models

Cursor routes across many frontier models and adds its own Composer model. Copilot offers a model picker. Claude Code is Anthropic-only by design. If model choice matters to you (and for some workloads it genuinely does), Cursor and Copilot give you more knobs; Claude Code bets that a tightly-integrated, top-tier model beats a buffet.

Pricing, side by side (mid-2026 snapshot)

Tool	Entry	Mid tier	Power / team
Cursor	Hobby (free)	Pro — $20/user/mo	Teams — $40/user/mo (Standard), $120/user/mo (Premium)
GitHub Copilot	Free	Pro — $10/mo · Pro+ — $39/mo	Max — $100/mo · Business / Enterprise seats
Claude Code	Pro — $20/mo	Max 5× — $100/mo	Max 20× — $200/mo · API pay-per-token

A few honest caveats on cost:

Copilot has the cheapest entry paid tier ($10), but token-based AI Credits mean heavy agent use can climb fast beyond the included allotment.
Cursor's $20 Pro includes a fixed amount of frontier-model usage; power users hit the ceiling and either upgrade or switch to its cheaper Auto/Composer routing.
Claude Code's Max tiers are priced for sustained, agent-heavy sessions — and again, parallel agents are a multiplier, not an add.

Prices and tiers shift constantly in this category. Treat the table as a snapshot, not a quote, and confirm before committing a team budget.

So which one should you choose?

Here's the honest, persona-based answer:

Choose Cursor if you want the best in-editor experience, you value fast inline generation and tight control over every diff, and you're happy living inside a (very good) VS Code fork. It's the most natural upgrade for a developer who loves their editor and wants AI to keep pace with their flow.

Choose GitHub Copilot if your team is standardized on GitHub and you want AI woven through the entire lifecycle — issues, PRs, reviews — across whatever IDEs your team already uses. The issue-to-PR autonomous agent alone can change how a team ships. It's the safest institutional bet.

Choose Claude Code if you're comfortable in the terminal, your work skews toward complex multi-file refactors and autonomous tasks, and you want to orchestrate agents rather than supervise autocomplete. It has the highest ceiling for autonomy — and asks the most of you in return.

And the answer most senior engineers actually land on? More than one. Plenty of us keep Cursor open for flow-state editing, lean on Copilot inside the GitHub workflow, and fire up Claude Code for the gnarly autonomous jobs. The tools overlap, but they're not redundant — they're a toolkit. The real meta-skill isn't loyalty to one editor; it's fluency across the category so you instinctively reach for the right one per task.

The skill underneath the tools

Here's the uncomfortable truth that the demos hide: these tools amplify the engineer you already are. Point a powerful agent at a vague intent and you get a fast, confident wall of code you didn't design and can't fully maintain. The developers getting outsized leverage from Cursor, Copilot, and Claude Code aren't the ones who learned the keyboard shortcuts — they're the ones who understand agent architecture, context engineering, and how to specify intent precisely enough that autonomy becomes an asset instead of a liability.

That foundation is exactly what we build at our AI education platform for Eastern Europe — practical, project-based courses taught around real repositories with an interactive AI instructor, not slide decks. If you want to go from "I use these tools" to "I get serious leverage from them," we maintain dedicated, hands-on tracks for using Cursor as a pro and for agentic coding with Claude Code — both built around real multi-file, real-repo workflows rather than toy examples.

Conclusion

In 2026, "AI coding tool" isn't one product category — it's three philosophies wearing similar clothes. Cursor bet on the editor, Copilot bet on the ecosystem, and Claude Code bet on the terminal-native agent. Each is genuinely excellent at the thing it optimized for, and genuinely compromised at the things it didn't.

So don't ask "which is best." Ask "best at what, for whom, doing which task" — and then build the judgment to switch fluently between them. That judgment, not the tool, is what compounds over a career. Try each one on a real feature, not a demo, and you'll feel the differences fast.

Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from agentic coding and AI agents to context engineering and the modern AI-native IDE workflow.

Sources: Cursor Models & Pricing · GitHub Copilot Plans & Pricing · GitHub Copilot Plans (Docs) · Claude Pricing · Claude Platform Docs — Pricing

Stop Vibe-Checking Your LLM: A Developer's Guide to Evals

galian — Mon, 22 Jun 2026 08:24:22 +0000

You tweaked the system prompt, ran the same two test questions you always run, the answers looked good, and you shipped. A week later support is forwarding you screenshots of the model confidently doing the exact thing your prompt was supposed to stop. You never saw it, because "did it get better?" was answered by vibes.

This is the single most common failure mode in shipping LLM features, and it has nothing to do with which model you picked. If your only quality gate is reading a handful of outputs and nodding, every change you make is a coin flip. You can't tell whether a prompt edit helped, hurt, or just moved the failures somewhere you didn't look. Evals are how you replace the nod with a number.

This is a practical guide to building that number — from a 30-row eval set you can write this afternoon, through code-based checks and LLM-as-judge scoring, to wiring the whole thing into CI so regressions get blocked instead of discovered by users. No new framework to adopt; just the discipline that separates a demo from a system.

Why you can't just `assert output == expected`

Traditional tests work because the output space is small and exact. add(2, 2) is 4 or it's a bug. LLM output breaks all three assumptions that make assertEqual work:

It's non-deterministic. The same prompt can produce different text on two calls. Even at temperature 0 you are not guaranteed byte-identical output across runs or model versions.
It's open-ended. "Summarize this ticket" has thousands of correct answers. None of them are string-equal to your reference, and that's fine — a good summary isn't the summary.
It fails softly. A wrong answer isn't a stack trace. It's a fluent, plausible, well-formatted paragraph that happens to be incorrect. Nothing crashes. Nothing logs an error.

So the goal of an eval isn't "is the output identical to the expected string." It's "does the output satisfy the properties I care about" — is it grounded in the provided context, does it stay on policy, does it actually answer the question, is it valid JSON. You're testing behavior against criteria, not bytes against bytes. Once that clicks, the rest is mechanics.

Start with the eval set, not the metric

The instinct is to reach for a fancy metric first. Wrong order. The asset that makes everything else work is a small, representative eval set: a fixed collection of inputs paired with what a good output looks like (or the criteria a good output must meet). This is your golden dataset, your regression suite, your source of truth.

You do not need thousands of examples to start. Thirty to fifty well-chosen pairs turn LLM tuning from vibes into engineering, because now every change is measured against the same fixed bar. Build the set like this:

Mine real failures. Every time the system gets something wrong in dev or prod, that exact input goes into the eval set with a note on what the right behavior is. Your bug reports are your test cases. This is the highest-signal source you have.
Cover the categories, not just the happy path. Easy questions, ambiguous ones, adversarial ones, out-of-scope ones ("I don't know" is the correct answer and you should test that it says so), and the edge cases specific to your domain.
Freeze it and version it. The eval set lives in your repo next to the code. When you add a case, that's a commit. A moving target can't measure progress.
Keep a holdout. If you start tuning prompts against the eval set, you'll overfit to it. Keep a slice you don't look at until you think you're done.

A minimal eval set is just data — JSON, a CSV, a Python list. Here's the shape:

# evals/dataset.py
EVAL_SET = [
    {
        "id": "refund-window-basic",
        "question": "What is our refund window?",
        "context": "Refunds are accepted within 14 days of purchase.",
        "expected": "14 days",
        "must_not_say": ["30 days", "no refunds"],
    },
    {
        "id": "out-of-scope",
        "question": "What's the weather in Cluj tomorrow?",
        "context": "Refunds are accepted within 14 days of purchase.",
        "expected": "REFUSE",  # correct behavior: decline, don't invent
    },
    # ... 30-50 of these, grown from real failures
]

That's the foundation. Everything below scores outputs against this set.

The two halves of every LLM eval

Separate two questions that get mushed together when you eval by eyeball, because they have different fixes:

Did the system retrieve / set up the right context? (a retrieval or pipeline question)
Given that context, did the model produce a good answer? (a generation question)

If you're building RAG, the first half is its own discipline — measuring recall@k and precision@k on questions with known relevant documents tells you whether the right chunk even reached the prompt. That's a deep enough topic that it deserves its own treatment; a dedicated course on RAG and retrieval-augmented generation spends real time there, and the failure modes are different from the ones below. This guide focuses on the second half: scoring the generated answer. The techniques split into two families — code-based checks and model-based judges — and you want both.

Code-based checks: cheaper and more reliable than you think

Before you reach for an LLM to grade an LLM, a surprising amount of quality is checkable with plain code. These checks are deterministic, free, instant, and never hallucinate. Use them for everything they can cover:

Structural validity. If the output should be JSON matching a schema, validate it. A response that doesn't parse is a hard failure, no judgment call needed.
Must-contain / must-not-contain. The answer about a 14-day refund window must contain "14" and must not contain "30." Keyword and regex assertions catch a whole class of factual regressions for free.
Format and bounds. Length limits, required citations present, no leaked system-prompt text, no forbidden phrases (the "as an AI language model" tax), valid enum values.
Semantic similarity. For open-ended answers, embed the output and your reference answer and check cosine similarity passes a threshold. It's fuzzy, but it catches "the answer wandered off topic" without needing a judge model.

# evals/checks.py
import json

def check_structural(output: str, schema_keys: list[str]) -> bool:
    try:
        data = json.loads(output)
    except json.JSONDecodeError:
        return False
    return all(k in data for k in schema_keys)

def check_must_not_say(output: str, banned: list[str]) -> bool:
    low = output.lower()
    return not any(b.lower() in low for b in banned)

The rule of thumb: anything a regex or a schema can catch, don't pay a model to catch. Reserve the expensive, fuzzy judge for the genuinely subjective stuff.

LLM-as-judge: powerful, biased, and fixable

For the subjective half — "is this answer faithful to the source?", "is this helpful?", "is the tone right?" — you use a strong model to grade outputs. This is LLM-as-judge, and it scales human-quality judgment to thousands of examples for the price of an API call. Two metrics carry most of the weight for RAG-style apps:

Faithfulness / groundedness — does every claim in the answer trace back to the provided context, or did the model invent things? This is your hallucination detector.
Answer relevance — does the response actually address the question that was asked, or is it a fluent dodge?

The catch: LLM judges have well-documented biases, and if you ignore them your eval numbers are noise dressed up as signal. The big ones, all reported in the research on using models as evaluators:

Position bias — when comparing two answers, judges favor the one shown first (or in a fixed slot) regardless of quality.
Verbosity bias — judges tend to rate longer, more elaborate answers higher even when a short answer is more correct.
Self-preference — a judge model can favor text written in its own style or by its own family.

You don't abandon the technique; you engineer around the bias:

Score against a rubric, not a vibe. Ask for a 1–5 score with explicit criteria for each level, and require the judge to output its reasoning before the score. A judge forced to justify itself is more consistent.
For pairwise comparisons, randomize and swap. Run each comparison twice with the order flipped; only count it as a win if the judge picks the same answer both times. This cancels position bias directly.
Calibrate against humans. Hand-label 20–30 examples yourself, run the judge on them, and check it agrees with you. If it doesn't, fix the rubric before trusting it on 2,000. An uncalibrated judge is a random number generator with good grammar.
Use a strong model as the judge. Grading is harder than answering. Use a current frontier model for the judge even if your app runs on a smaller, cheaper one.

# evals/judge.py — sketch of a rubric-based faithfulness judge
JUDGE_PROMPT = """You are grading whether an ANSWER is fully supported by the CONTEXT.

CONTEXT:
{context}

ANSWER:
{answer}

Rules:
- A claim is "supported" only if the CONTEXT states or directly implies it.
- Outside knowledge does NOT count as support.

First write one sentence of reasoning. Then output a JSON object:
{{"reasoning": "...", "faithful": true|false}}"""

def judge_faithfulness(client, context: str, answer: str) -> bool:
    resp = client.complete(
        JUDGE_PROMPT.format(context=context, answer=answer),
        temperature=0,
    )
    return json.loads(resp)["faithful"]

Designing judges that hold up — picking the rubric, calibrating, knowing when a model is the wrong tool for the grade — is exactly the muscle a course on AI evals in production builds, because it's the difference between "the new prompt feels better" and "faithfulness went from 0.78 to 0.91 on the holdout."

Wire it into CI, or it won't survive contact with deadlines

An eval you run by hand when you remember to is an eval you'll stop running the week things get busy. The whole point is to make regressions impossible to ship silently, and that means the eval runs automatically on every change to a prompt, a retrieval setting, or a model version.

The pattern is a regression gate: run the eval set, compute the aggregate score, and fail the build if the score drops below a threshold (or below the last known-good baseline). It looks like an ordinary test suite, because that's what it is.

# tests/test_evals.py
import pytest
from evals.dataset import EVAL_SET
from evals.checks import check_must_not_say
from myapp import answer_question

PASS_THRESHOLD = 0.90  # 90% of eval cases must pass to ship

def run_case(case) -> bool:
    output = answer_question(case["question"], case["context"])
    if case["expected"] == "REFUSE":
        return "i don't know" in output.lower() or "can't" in output.lower()
    if not check_must_not_say(output, case.get("must_not_say", [])):
        return False
    return case["expected"].lower() in output.lower()

def test_eval_suite_meets_threshold():
    results = [run_case(c) for c in EVAL_SET]
    score = sum(results) / len(results)
    failed = [c["id"] for c, ok in zip(EVAL_SET, results) if not ok]
    assert score >= PASS_THRESHOLD, f"Eval score {score:.2f} below {PASS_THRESHOLD}. Failed: {failed}"

A few practical notes that keep this sane in CI:

Pin the model version. Provider model IDs update, and an unpinned model means your eval baseline shifts under you for reasons unrelated to your code. Pin it, and treat a model upgrade as its own deliberate eval run.
Budget for cost and flakiness. LLM calls cost money and occasionally time out. Cache where you can, run the judge-heavy suite on a schedule rather than every commit if needed, and set a slightly forgiving threshold so one stochastic blip doesn't red-X a good PR.
Log the failures, not just the score. When the gate trips, the output should name which cases regressed so the fix is obvious. A bare "0.86 < 0.90" sends you debugging blind.

Now a prompt change is a PR with a number attached. The reviewer sees faithfulness went up and refusal rate held steady, or they see it tanked and the build is red. That's the entire difference between hoping and knowing.

Five mistakes that quietly poison your evals

Even teams that build evals often undermine them. Watch for these:

Testing only the happy path. If every case in your set is a question the system already answers well, your score is a flattering lie. Adversarial and out-of-scope cases are where the signal is.
Tuning on your test set. Optimize prompts against the same examples you grade on and you'll overfit to them. Keep a holdout you don't peek at.
An uncalibrated judge. Trusting an LLM judge you never checked against your own labels is trusting a number you made up. Calibrate first.
One giant blended score. A single average hides that faithfulness improved while refusals broke. Track metrics separately so a regression in one can't be masked by a gain in another.
Letting the set rot. Your product changes; cases that no longer reflect real usage drag the signal down. Prune and grow the set as part of normal work, the same way you maintain any test suite.

None of these are exotic. They're the eval equivalent of not testing error paths — obvious in hindsight, easy to skip under deadline.

How this connects to the rest of your LLM stack

Evals aren't a standalone chore; they're the measurement layer that makes every other improvement legible. When you tighten a prompt, evals tell you if it worked — which is why structured prompt engineering and a real eval loop are two halves of the same skill. When you redesign what goes into the context window — what to include, what to cut, how to order it — evals are how you know the redesign helped rather than just felt cleaner; that discipline of deciding what earns a place in the prompt is increasingly called context engineering and has its own dedicated course. And when you wire up function calling, multi-tool orchestration, and the production concerns of a real integration, evals are what keep the whole pipeline honest as it grows — the kind of end-to-end build covered in a deeper course on advanced LLM integration. The pattern is always the same: build the measurement first, then every change becomes verifiable instead of hopeful.

Conclusion

The teams whose LLM features actually hold up in production aren't using a secret model or a magic prompt. They're disciplined about measurement. They have a versioned eval set grown from real failures, code-based checks for everything a regex can catch, calibrated LLM judges for the subjective rest, and a CI gate that blocks regressions before users find them.

Start smaller than you think you can. Write thirty cases this afternoon — half of them things your system currently gets wrong — add three code checks and one rubric-based judge, and put a threshold in your test suite. The first time a red build stops you from shipping a prompt change that would have quietly broken refusals, you'll never go back to vibe-checking. That's the moment an LLM demo becomes an LLM system people can trust.

The courses linked throughout are part of Cursuri-AI.ro, an AI-learning platform with hands-on, current tracks on evaluating AI systems in production, prompt engineering, RAG, and advanced LLM integration.

Sources & further reading:

Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (documents position, verbosity, and self-enhancement bias in LLM judges)
Liu et al. — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Liang et al. — Holistic Evaluation of Language Models (HELM)

This article is educational content. Techniques and tooling evolve quickly; validate approaches against your own data and current library documentation.

Claude Fable 5: A Developer's Guide to Anthropic's New Top

galian — Wed, 10 Jun 2026 22:22:18 +0000

Anthropic just moved the ceiling again. Claude Fable 5 is the company's most powerful, most intelligent model to date — and it isn't "Opus 4.9." It's a new tier that sits above the entire Opus family. If you build with LLMs, that distinction matters: it changes how you think about model routing, cost, and which tasks deserve your most capable (and most expensive) reasoning.

This is a practical, no-hype guide for developers. We'll cover what Claude Fable 5 actually is, how it slots into Anthropic's 2026 lineup, what changes in the API surface, when the premium is justified, and how to migrate existing code. Everything here is grounded in Anthropic's own model and API documentation — no invented benchmarks.

What Is Claude Fable 5?

Claude Fable 5 is Anthropic's flagship reasoning model, exposed through the API as claude-fable-5. The headline facts:

A new tier above Opus. Until now, "Opus" was the top of the Claude lineup. Fable 5 establishes a level above it — positioned for the hardest reasoning, planning, and long-horizon agentic work.
1M-token context window, with up to 128K tokens of output.
Premium pricing: roughly $10 / $50 per million input / output tokens — about double Opus 4.8's $5 / $25. That price tag is the whole point: Fable 5 is a precision tool you point at the problems that justify it, not a default for every call.
Adaptive thinking only. The fixed "thinking budget" knob is gone. The model decides how much to reason per request.

The mental model to internalize: Fable 5 is the peak of a four-tier lineup, and capability scales with cost. You don't run your whole pipeline on it any more than you'd render every frame of a film at maximum quality regardless of the shot. You route the hard parts to it.

Where Fable 5 Fits in the 2026 Anthropic Lineup

Anthropic's current family is a ladder of capability-vs-cost. Picking the right rung per task is one of the highest-leverage habits an AI engineer can build.

Model	Role	Reach for it when…
Claude Fable 5	Absolute peak capability; premium price	The hardest reasoning, planning, cross-cutting refactors, and long-running agent loops where correctness outweighs cost
Claude Opus 4.8	Top of the Opus family; a strong default in Claude Code	Complex day-to-day work — planning, large refactors, tricky debugging — with a better capability/cost ratio than Fable
Claude Sonnet 4.6	Balanced, fast, 1M context	The bulk of everyday coding, reading, and iteration
Claude Haiku 4.5	Light, fast, cheap	High-volume small operations, classification, auxiliary steps

The practical takeaway: model choice is a cost-and-quality lever. A well-designed system routes each sub-task to the cheapest model that can do it well, and escalates to Fable 5 only where the payoff is real. If you want a structured, side-by-side breakdown of the 2026 models and how to choose between them, there's a dedicated AI model comparison course that goes deeper than any single table can.

What Changes in the API

This is the part developers actually care about. Fable 5 shares the modern Claude request surface (the same one introduced with Opus 4.7/4.8), with a couple of sharp edges worth knowing before you ship.

Adaptive thinking, not a token budget

Fable 5 supports a single thinking mode: adaptive. You no longer pass a fixed budget_tokens value — the model regulates its own reasoning depth.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-fable-5",
    max_tokens=16000,
    thinking={"type": "adaptive"},        # adaptive is the only thinking mode
    output_config={"effort": "xhigh"},    # strong default for coding/agentic work
    messages=[
        {"role": "user", "content": "Refactor this module and add unit tests."}
    ],
)

for block in response.content:
    if block.type == "text":
        print(block.text)

A few things that will save you a debugging session:

Don't send temperature, top_p, top_k, or budget_tokens. They're removed on this generation and return 400. Steer behavior with prompting and the effort parameter instead.
Don't send thinking={"type": "disabled"} on Fable 5. Unlike Opus 4.8/4.7, an explicit disabled returns 400 here. To run without thinking, omit the thinking parameter entirely. This is the one genuinely new breaking change relative to the Opus 4.x line — easy to miss.
Thinking text is omitted by default. Thinking blocks still stream, but their content is empty unless you opt in with thinking={"type": "adaptive", "display": "summarized"}. If your UI shows reasoning progress, set this or your users will see a long pause before output.

The effort parameter is your real control knob

output_config.effort accepts low, medium, high, xhigh, and max. It controls how much the model thinks and acts — not just thinking depth. For coding and agentic workloads, xhigh is the sweet spot and is the effort level Claude Code defaults to. Treat effort as something to tune per route: max for correctness-critical work, medium/low for latency-sensitive or simple steps.

Large outputs need streaming

With up to 128K output tokens available, non-streaming requests will hit SDK HTTP timeouts well before that ceiling. For anything above ~16K max_tokens, stream and collect the final message:

with client.messages.stream(
    model="claude-fable-5",
    max_tokens=64000,
    thinking={"type": "adaptive"},
    output_config={"effort": "xhigh"},
    messages=[{"role": "user", "content": "Generate the full migration plan."}],
) as stream:
    message = stream.get_final_message()

What it still supports

Fable 5 keeps the modern toolbox: structured outputs (output_config.format), prompt caching (minimum cacheable prefix ~2,048 tokens), server-side compaction for very long conversations, web search with dynamic filtering, and task budgets (beta) for telling an agent how many tokens it has for a full loop. If you're wiring these into a real application, the patterns matter as much as the model — that's the focus of this hands-on course on building AI apps with the Anthropic and OpenAI SDKs, which walks from raw API calls to a production-shaped product.

Fable 5 for Agentic Coding

The reason Fable 5 is interesting to developers specifically is long-horizon agentic execution: multi-file refactors, overnight runs, and tasks that span dozens of tool calls without a human correcting course.

Three habits get the most out of it:

Give the full task spec up front in one well-formed turn. Fable 5 plans better when it has the complete goal early; drip-feeding requirements across many turns tends to cost more tokens and sometimes performance.
Run at high or xhigh effort with generous max_tokens. Long-horizon coherence comes partly from the model reasoning more at each step — give it room.
Route deliberately. Use Fable 5 for the planning and the genuinely hard edits; delegate mechanical or high-volume sub-steps to Sonnet 4.6 or Haiku 4.5.

If terminal-first agentic coding is your world, the workflow discipline — CLAUDE.md project memory, plan/edit/review loops, hooks as deterministic guardrails, and model routing across the lineup — is exactly what a dedicated Claude Code mastery course covers end to end. Agent architecture beyond a single tool (orchestration, delegation, parallelism) is its own discipline, well covered in this course on designing autonomous AI agents.

Context is a resource, even at 1M tokens

A 1M-token window is not a license to dump everything into context. Irrelevant context dilutes the model's attention and costs tokens on every turn, no matter how capable the model is. The skill that separates engineers who "get lucky" with agents from those who ship reliable ones is deliberate context engineering — what to load, what to compact, what to persist as memory across sessions. It's enough of a topic to warrant its own course on context engineering and memory for agents.

When Fable 5 Is Actually Worth the Premium

Here's the honest cost reasoning, because "use the best model" is bad engineering advice.

At roughly double the per-token cost of Opus 4.8, Fable 5 pays off when the cost of a wrong answer is high relative to the token bill:

Worth it: a complex cross-service refactor where a subtle regression costs hours of human review; a planning step that determines the trajectory of a long agent run; an analysis where correctness is non-negotiable.
Not worth it: routine edits, summaries, classifications, and the long tail of mechanical sub-tasks — those belong on Sonnet 4.6 or Haiku 4.5.

A useful rule of thumb: let Fable 5 plan and decide, and let cheaper models execute the parts that are already well-specified. That keeps your bill proportional to difficulty instead of flat-out maximal.

The other lever is effort. Because effort matters more on this generation than on any prior Opus, a Fable 5 call at medium effort can be both cheaper and faster than an Opus 4.8 call at xhigh for some tasks — so benchmark on your own workload rather than assuming "bigger model = always slower and pricier in practice."

Migrating from Opus 4.8 / 4.7

If you're already on the modern Claude surface, moving to Fable 5 is mostly a model-ID swap plus a couple of checks:

Swap the model string to claude-fable-5.
Remove budget_tokens if any remain → use thinking={"type": "adaptive"}.
Strip temperature / top_p / top_k — they 400.
Replace last-assistant-turn prefills with structured outputs (output_config.format) or a system-prompt instruction — prefills 400 on this generation.
Audit for thinking={"type": "disabled"} — it 400s on Fable 5. Omit thinking instead.
Re-tune effort per route — start at high, use xhigh for coding/agentic, reserve max for correctness-critical work.
Set display: "summarized" if you surface reasoning in a UI.

Steering this generation is done through prompting and effort rather than sampling parameters, so the quality of your instructions matters more than ever. If your prompts were tuned years ago for older models, they're probably leaving capability on the table — a structured refresh of prompt engineering fundamentals tends to pay for itself quickly on a model this capable.

A Note on Hype vs. Reality

Two guardrails worth keeping as the launch noise settles:

Fable 5 is the most capable model — not necessarily the default everywhere. In Claude Code, for instance, Opus 4.8 remains a strong default; Fable 5 is the tier you select for the hardest work. "Most capable" and "default" are different claims.
Version hygiene matters. Fable 5 is the current peak, Opus 4.8 is the top of the Opus family, and Opus 4.7 is the previous Opus generation. Anything from the Claude 3.x line (or GPT-4-class / Gemini 2.x models) is outdated and shouldn't be treated as current when you're evaluating tutorials or benchmarks. Always confirm model IDs, limits, and pricing against the official docs, since they shift between releases.

TL;DR Cheat Sheet

For quick reference when you wire Claude Fable 5 into a real codebase:

Model ID: claude-fable-5. Context window 1M tokens, output up to 128K.
Thinking: {"type": "adaptive"} is the only mode. To run without it, omit the parameter — never send {"type": "disabled"} (it returns 400).
Effort: output_config.effort is your main control — xhigh for coding and agents, max when correctness is critical, low/medium for simple or latency-sensitive steps.
Removed (all 400 if sent): temperature, top_p, top_k, budget_tokens, and last-assistant-turn prefills.
Reasoning in your UI: add "display": "summarized" to the thinking config, or the thinking text comes back empty.
Large outputs: stream anything above ~16K max_tokens.
Routing: send the hard reasoning to Fable 5; keep routine and high-volume work on Sonnet 4.6 and Haiku 4.5.

Conclusion

Claude Fable 5 isn't just a bigger Opus — it's a new top tier that reframes how you should think about model routing in 2026. The winning pattern is the same as it's always been, just sharper: use the most capable model where correctness compounds, push everything else down the ladder to cheaper models, and tune effort per route. Master that, and Fable 5 becomes a precision instrument rather than a line item that surprises you on the invoice.

If you want to go from "I read about it" to "I ship with it," the courses linked throughout are part of Cursuri-AI.ro, a Romanian AI-learning platform with deep, hands-on tracks on Claude Code, agent architecture, the Anthropic SDK, context engineering, and model selection — all kept current with the 2026 lineup, Fable 5 included.

Found this useful? Save it, and drop your Fable 5 routing strategy in the comments — what are you sending to the top tier, and what stays on Sonnet?

Prompt Caching with Claude: How We Cut AI API Costs by 90% in Production (2026 Guide)

galian — Mon, 01 Jun 2026 09:02:05 +0000

TL;DR — Anthropic's prompt caching gives you a 90% discount on cached input tokens and up to 85% lower latency on long-context calls. But the wins only show up if you understand cache breakpoints, TTLs, and what actually invalidates the cache. This guide walks through 5 production patterns we use, real benchmarks, and the pitfalls that silently kill your hit rate.

The cost problem nobody warns you about

When you ship anything serious with Claude — an agent, a RAG system, a code assistant, a customer support bot — you discover the same uncomfortable truth: your input token bill dwarfs your output bill.

A typical agent loop looks like:

System prompt: ~3,000 tokens (instructions, persona, constraints)
Tool definitions: ~4,000 tokens (JSON schemas for 10–20 tools)
Conversation history: 5,000–50,000 tokens (grows every turn)
RAG context: 5,000–20,000 tokens per query
User message: ~200 tokens
Model output: ~500 tokens

Every single turn, you re-send the same system prompt, the same tool definitions, and most of the conversation history. On Claude Sonnet 4.6 at $3 per million input tokens, a 15,000-token prefix sent across 20 conversation turns costs you $0.90 per conversation in input alone — before you've generated a single useful token of output.

Multiply that by 10,000 daily active users and you're burning $9,000/day just to re-tokenize content you already sent.

This is exactly what prompt caching fixes.

What Claude's prompt caching actually does

Anthropic's prompt caching lets the API store the internal state for a prefix of your prompt and reuse it on subsequent requests. Two numbers matter:

Operation	Pricing relative to base input
Cache write (first time a prefix is seen)	1.25× base input cost
Cache read (subsequent hits)	0.10× base input cost (90% off)

You pay a small one-time premium to write the cache, then every hit after that is 10% of the normal price. The break-even point is after the second request — anything more than one read and you're saving money.

The mental model

Think of it as a prefix tree with checkpoints. You mark up to 4 points in your prompt with cache_control, and Claude caches everything from the start of the prompt up to each breakpoint. On the next request, if the prefix matches byte-for-byte, you get a cache hit.

The order Claude processes the prompt is fixed:

tools → system → messages (oldest → newest)

Your cache breakpoints must respect that order. You cannot cache a later block without caching everything before it.

The TTL trap

The default cache TTL is 5 minutes, refreshed on every read. A 1-hour TTL is available as a premium option (costs more on write, same on read). Most teams over-pay for the 1-hour cache when 5 minutes would have served them fine — if your traffic is steady, every request refreshes the TTL and the cache effectively lives forever.

Want to go deeper on Claude's API mechanics in production? Prompt caching, tool use, batch API, streaming, and cost optimization are covered in depth in the Advanced LLM Integration course on Cursuri-AI.ro.

Pattern 1: Cache the system prompt and tool definitions

This is the highest-ROI change you can make, and most codebases get it wrong on the first try.

Wrong (no caching):

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a senior software engineer. [...3000 tokens of instructions...]",
    tools=[...20 tool definitions, ~4000 tokens...],
    messages=[{"role": "user", "content": "Refactor this function"}],
)

Right (cached):

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a senior software engineer. [...3000 tokens of instructions...]",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=[
        {
            "name": "read_file",
            "description": "...",
            "input_schema": {...},
        },
        # ... more tools ...
        {
            "name": "last_tool",
            "description": "...",
            "input_schema": {...},
            "cache_control": {"type": "ephemeral"},  # cache breakpoint on the last tool
        },
    ],
    messages=[{"role": "user", "content": "Refactor this function"}],
)

Two things to notice:

cache_control on the system block caches everything up through the system prompt.
cache_control on the last tool caches everything through the tool definitions — this is critical because tools are evaluated before system per the processing order above.

Wait — that's actually wrong as stated. Let me correct: because the order is tools → system → messages, putting cache_control on the last tool caches just the tools, and putting it on system caches tools + system. You typically only need the system breakpoint; it covers everything before it.

Reading the response

The API returns cache stats in response.usage:

print(response.usage.cache_creation_input_tokens)  # tokens written to cache (1.25x cost)
print(response.usage.cache_read_input_tokens)      # tokens read from cache (0.10x cost)
print(response.usage.input_tokens)                 # uncached tokens (1x cost)

On the first request: cache_creation_input_tokens is high, cache_read_input_tokens is 0.
On every subsequent request within 5 minutes: cache_creation_input_tokens is 0, cache_read_input_tokens is high. That's the win condition.

Pattern 2: Cache conversation history with rolling breakpoints

In a multi-turn agent, the conversation grows on every turn. If you only cache the system prompt, you're still re-sending and re-billing every prior turn at full price.

The trick is to add a second cache breakpoint on the most recent assistant message, so the entire conversation up to that point is cached:

def build_messages_with_cache(history, new_user_message):
    """
    history: list of {"role": "user"|"assistant", "content": ...}
    new_user_message: str
    """
    messages = []
    for i, turn in enumerate(history):
        if i == len(history) - 1:
            # Add cache breakpoint on the last historical message
            messages.append({
                "role": turn["role"],
                "content": [
                    {
                        "type": "text",
                        "text": turn["content"],
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
            })
        else:
            messages.append(turn)

    messages.append({"role": "user", "content": new_user_message})
    return messages

Now every new turn reads the entire prior conversation from cache. Cost per turn becomes nearly constant instead of growing linearly with conversation length.

The 4-breakpoint budget

Claude allows up to 4 cache breakpoints per request. A common production layout uses all four:

Breakpoint 1: end of tools
Breakpoint 2: end of system prompt
Breakpoint 3: end of "stable" conversation history (turns 1 through N-2)
Breakpoint 4: end of "recent" history (turn N-1)

This gives you a layered cache: tools rarely change, system rarely changes, old history never changes, recent history is sliding. Each layer hits or misses independently.

Pattern 3: Cache few-shot examples separately from the user query

Few-shot prompting is one of the highest-leverage techniques in production LLM apps — and one of the most expensive if you don't cache. A typical few-shot block with 5–10 examples can run 8,000–15,000 tokens.

FEW_SHOT_EXAMPLES = """
Example 1:
Input: ...
Output: ...

Example 2:
Input: ...
Output: ...

[... 8 more examples ...]
"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a classifier. Categorize support tickets.",
        },
        {
            "type": "text",
            "text": FEW_SHOT_EXAMPLES,
            "cache_control": {"type": "ephemeral"},  # cache the examples
        },
    ],
    messages=[{"role": "user", "content": user_ticket}],
)

Critical rule: put the variable content last. Cache only works on prefix matches. If your user-specific data is in the middle of the prompt, everything after it becomes uncacheable.

Pattern 4: RAG with cached document chunks

RAG systems are notorious for blowing up token bills because the retrieved context is large and unique per query. You can't cache the retrieved chunks themselves (they change), but you can cache the surrounding framework:

def rag_query(user_question, retrieved_chunks):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_INSTRUCTIONS,  # ~2000 tokens, stable
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[
            {
                "role": "user",
                "content": (
                    f"Context:\n{retrieved_chunks}\n\n"
                    f"Question: {user_question}"
                ),
            }
        ],
    )

For RAG with a stable knowledge base (corporate docs, product manuals, codebases), there's a more advanced pattern: pre-tile your documents into fixed-size cacheable blocks and choose your retrieval strategy to favor returning whole blocks rather than slices. You trade some retrieval precision for massive cost savings on hot documents.

If you build RAG systems for production, the RAG (Retrieval-Augmented Generation) course on Cursuri-AI.ro covers caching strategies, retrieval optimization, hybrid search, and eval pipelines end-to-end.

Pattern 5: Cache tool results in long-running agents

Agent loops are caching's sweet spot. An agent runs tool_call → tool_result → tool_call → tool_result cycles, and each iteration the prompt grows by the new tool result. Without caching, you re-bill the entire history every iteration.

def agent_loop(initial_user_message, tools):
    messages = [{"role": "user", "content": initial_user_message}]

    while True:
        # Add cache breakpoint to the latest message
        cached_messages = messages[:-1] + [
            add_cache_breakpoint(messages[-1])
        ]

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
            tools=tools,
            messages=cached_messages,
        )

        if response.stop_reason == "end_turn":
            return response

        # Append assistant turn + tool results, loop
        messages.append({"role": "assistant", "content": response.content})
        tool_results = execute_tools(response.content)
        messages.append({"role": "user", "content": tool_results})


def add_cache_breakpoint(message):
    content = message["content"]
    if isinstance(content, str):
        content = [{"type": "text", "text": content}]
    content[-1]["cache_control"] = {"type": "ephemeral"}
    return {**message, "content": content}

In a 15-step agent run with a 4,000-token system prompt and 8,000-token tools, this pattern cuts input cost by ~80–88% versus uncached.

Agent loops, tool design, multi-step planning and cost modeling are the focus of the AI Agents & Automation course on Cursuri-AI.ro — built around the same Claude Agent SDK patterns shown here.

Real benchmarks: before vs after

These numbers are from a production code-review agent running on Claude Sonnet 4.6, averaged over 1,000 conversations of 12 turns each.

Metric	Uncached	Cached	Change
Avg input tokens per turn	18,400	18,400	—
Avg billed input cost per turn	$0.0552	$0.0061	−89%
Avg time-to-first-token	1,840 ms	380 ms	−79%
Avg total cost per 12-turn conversation	$0.66	$0.10	−85%
Cache hit rate (warm)	—	96.3%	—

The latency win surprised us as much as the cost win. Cache reads skip the prompt processing phase entirely, which dominates time-to-first-token for long contexts.

The pitfalls that silently kill your hit rate

These are mistakes we've made or seen in production code reviews.

1. Whitespace and formatting drift

Cache hits require byte-exact prefix matches. If your system prompt is built with f-strings and you add a timestamp, conditional newline, or trailing space, you invalidate the cache:

# BREAKS the cache every minute
system = f"You are a helpful assistant. Current time: {datetime.now()}"

# Works
system = "You are a helpful assistant."
# Pass time as a separate user message field if needed

Audit your prompts for hidden variability: locale-formatted numbers, dict iteration order in older Pythons, tool definitions where field order changes between deploys.

2. Reordering tool definitions

If you generate tool schemas from a dict and the dict iteration order changes between runs, your cache evaporates. Always sort tool definitions before sending:

tools = sorted(generate_tools(), key=lambda t: t["name"])

3. Wrong breakpoint placement

Breakpoints must come after the content you want to cache, not before. The breakpoint marks "cache everything up to here." Putting it on the user message instead of the system prompt is a common rookie mistake.

4. Caching tiny prefixes

There's a minimum cacheable size:

Claude Sonnet & Opus: 1,024 tokens
Claude Haiku: 2,048 tokens

Below the minimum, the cache_control is silently ignored — the API doesn't error, it just doesn't cache. Always check response.usage.cache_creation_input_tokens > 0 on your first request to confirm the cache actually wrote.

5. Ignoring the 5-minute TTL on bursty traffic

If your traffic is bursty — heavy during business hours, dead overnight — the 5-minute cache will expire between sessions and you'll pay the write premium every time. For bursty patterns, either:

Use the 1-hour TTL (more expensive write, same read price)
Or send a small "keep-alive" request every 4 minutes during expected idle windows

6. Mixing cached and uncached models

Cache is model-specific. If your code falls back from Sonnet 4.6 to Haiku 4.5 on rate limit, the Haiku call has no cache history. Either keep fallback paths uncached, or build separate caches per model.

When NOT to use prompt caching

Caching has overhead. Skip it when:

One-shot calls with no shared prefix — single-request classification, one-off summarization. The 1.25× write premium is pure loss.
High-variability prompts — if each request has different boilerplate, you're paying write premium for nothing.
Prompts below the minimum — short prompts can't be cached.
Cost is already negligible — if you spend $20/month on the API, the engineering time to optimize caching costs more than the savings.

A useful heuristic: if your stable prefix is ≥2,000 tokens AND you make ≥3 requests per 5-minute window with that prefix, cache it.

Putting it together: a production checklist

Before you ship a Claude integration in 2026, run this list:

[ ] System prompt has cache_control set
[ ] Tool definitions are sorted and stable
[ ] User-variable content is at the end of the prompt, not in the middle
[ ] Cache stats (cache_read_input_tokens) are logged and dashboarded
[ ] Cache hit rate is monitored — alert if it drops below 80%
[ ] No timestamps, request IDs, or random data injected into cached blocks
[ ] First-request cache write is verified in tests
[ ] Fallback model paths handle cache absence cleanly
[ ] 5-minute vs 1-hour TTL choice is documented with reasoning

Wrapping up

Prompt caching is the single highest-leverage cost optimization for Claude in production. The mechanics are simple, but the gotchas — formatting drift, reorder bugs, minimum sizes, TTL mismatches — are where teams leave money on the table.

If you treat caching as a first-class concern from day one, you ship AI features that are 5–10× cheaper to operate than the naive implementation. If you bolt it on later, you spend weeks chasing cache misses through your logging.

Where to go deeper

I write about production AI engineering — Claude API, multi-agent systems, RAG, cost optimization — on Cursuri-AI.ro, an interactive learning platform with an always-available AI tutor that walks you through every concept and reviews your code. The four courses most relevant to what's in this article:

Advanced LLM Integration — Claude API in production: prompt caching, tool use, batch API, streaming, error handling, retries
Prompt Engineering Masterclass — structured prompting, few-shot patterns, evaluation, prompt versioning
AI Agents & Automation — agent loops, tool design, multi-agent orchestration, cost modeling
RAG (Retrieval-Augmented Generation) — retrieval, embeddings, hybrid search, caching, eval pipelines

Course content is delivered in Romanian (the platform's primary audience), but the code, frameworks, and patterns are language-agnostic — the IT Pro track is built specifically for engineers shipping AI in production.

What's your cache hit rate in production? Drop a comment with your setup — I'm collecting patterns for a follow-up post on caching at the multi-tenant scale (per-customer cache namespaces, cache warm-up strategies, and the cost model when you have 10,000+ concurrent users).

If this helped, a ❤️ or a 🦄 keeps it visible for other devs hitting the same cost wall. Follow for more deep-dives on Claude in production.

Related reading:

Anthropic's official prompt caching docs: docs.anthropic.com
Claude API pricing: anthropic.com/pricing
Full IT Pro AI engineering catalog: Cursuri-AI.ro/courses

AI for Influencers in 2026: How to Build a Content Engine That Runs Itself

galian — Tue, 19 May 2026 13:34:41 +0000

The influencer economy is no longer about who posts the most. It's about who has built the smartest AI content system behind the scenes.

In 2026, the top 1% of creators aren't outworking everyone else. They're out-engineering them. They've turned what used to be a 60-hour-a-week grind into a streamlined pipeline where AI handles 80% of the production work — and they keep 100% of the creative direction.

Over the past two years, working with hundreds of creators and educators through Cursuri-AI.ro — Eastern Europe's leading AI education platform — I've watched this shift happen in real time. The patterns are consistent, the playbook is replicable, and the gap between those who adopt it and those who don't is widening every month.

This article breaks down exactly how it works, what tools they use, and how you can build the same stack — whether you're an influencer who codes, or a developer building tools for creators.

Why AI Changed the Influencer Game (Permanently)

Three years ago, an influencer's competitive advantage was personality plus consistency. Today, that's table stakes.

The real moat now is operational leverage:

How fast can you identify a trending topic?
How quickly can you produce content across 5+ formats?
How precisely can you target each piece to its platform?
How much of this can run without your direct involvement?

The creators who answered "all of it, mostly automated" are the ones scaling past 1M followers, 7-figure revenues, and 50+ pieces of content per week — solo or with tiny teams.

This isn't theoretical. It's already happening. The question is whether you're building the system or watching others build it.

The 5-Layer AI Stack for Modern Influencers

Every high-output creator I've analyzed runs some version of this five-layer architecture. The tools change. The structure doesn't.

Layer 1: Intelligence (Research & Trend Detection)

Before you create, you need to know what to create.

What it does:

Monitors trending topics, keywords, and conversations in your niche
Analyzes competitor content performance
Identifies content gaps and opportunities
Surfaces audience questions before they become saturated

Tools and APIs:

Perplexity API — for real-time research with citations
Exa AI — semantic search for niche topics
Google Trends API + YouTube Data API — for trend signals
Reddit API + Twitter/X API — for audience listening
BuzzSumo or SparkToro — for content gap analysis

Pro tip: Don't just track what's popular. Track what's about to become popular by monitoring signal velocity (rate of change), not absolute volume.

Layer 2: Ideation (Concept & Angle Generation)

This is where most creators waste the most time — staring at a blank page deciding what to make.

What AI does well here:

Generates 30+ angle variations from a single topic
Adapts ideas to your specific voice and audience
Identifies counterintuitive takes that drive engagement
Maps ideas to platform-specific formats

Recommended approach:

Build a custom GPT or Claude project trained on:

Your past top-performing content (with metrics)
Your audience persona and voice guidelines
Your content pillars and forbidden topics

💡 If you've never structured a voice profile before, this is one of the highest-leverage skills you can develop. We dedicate an entire module to it inside our AI for Content Creators track on Cursuri-AI.ro — including the exact prompts and templates we use internally.

Then prompt it like this:

from anthropic import Anthropic

client = Anthropic()

def generate_content_angles(topic: str, voice_profile: str, n: int = 20):
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4000,
        system=f"""You are a content strategist for an influencer with this profile:
        {voice_profile}

        Generate angles that are specific, counterintuitive, and aligned with their voice.
        Avoid generic takes. Each angle should be testable as a hook.""",
        messages=[{
            "role": "user",
            "content": f"Give me {n} distinct angles for content about: {topic}"
        }]
    )
    return response.content[0].text

angles = generate_content_angles(
    topic="building a personal brand in 2026",
    voice_profile="Direct, data-driven, contrarian, B2B-focused",
    n=25
)
print(angles)

The output of this single function call can fuel a month of content. Cost: ~$0.15.

Layer 3: Production (Multi-Format Content Generation)

This is the heaviest-lifting layer — and where AI compounds value most.

The repurposing principle:

One "pillar" piece (a long-form video, podcast, or article) should generate 10–15 derivative pieces with minimal manual work.

Sample workflow for a 30-minute podcast episode:

Transcription → Whisper API or AssemblyAI ($0.36 for 30 min)
Long-form blog post → Claude/GPT generates structured article from transcript
LinkedIn carousel → 8–10 slide deck with key insights
Twitter/X thread → 10-tweet thread with the strongest takes
Short-form clips → Opus Clip or Riverside AI extracts viral moments
Newsletter → Personalized summary with commentary
YouTube Shorts → Auto-captioned vertical clips
Quote graphics → Designed via Canva API or Bannerbear
Instagram Reels → Repurposed clips with platform-native captions
SEO blog series → 3–5 articles targeting specific search queries

Total human time: 1–2 hours of review and approval, instead of 30+ hours of production.

Layer 4: Distribution (Platform-Native Publishing)

Most creators lose performance here by posting the same content identically across platforms. AI fixes this by adapting each piece to the platform's native expectations.

Adaptive distribution looks like:

LinkedIn → Professional tone, longer-form, hook in first 2 lines
Twitter/X → Punchy, opinionated, thread-friendly
Instagram → Visual-first, emotion-driven captions
TikTok → Hook in 1 second, vertical, trend-aware
YouTube → SEO-optimized titles, timestamps, structured descriptions

Tools:

Buffer, Hypefury, or Typefully — scheduling with AI optimization
Make or n8n — custom automation workflows
Postiz (open source) — self-hosted social scheduling

Layer 5: Optimization (Performance Feedback Loop)

This is the layer most creators skip — and it's the one that compounds the hardest.

What to track:

Hook performance (which first lines drive scroll-stops?)
Format performance (which content types convert best per platform?)
Topic performance (which themes consistently win?)
Audience signals (which content brings in your ICP vs. tourists?)

How AI helps:

Analyzes patterns across hundreds of posts in seconds
Identifies non-obvious performance correlations
Suggests next-week content based on last week's winners
Drafts variations of top performers for retesting

Build a simple dashboard that ingests your analytics from each platform and feeds it back to your ideation layer. This closes the loop — every post makes the next one smarter.

A Minimal Working Example: Content Repurposing Pipeline

Here's a stripped-down Python pipeline that takes a transcript and produces three platform-adapted outputs. Useful as a starting point you can extend.

import os
import json
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-7"

def repurpose_content(transcript: str, voice: str) -> dict:
    """Generate LinkedIn post, Twitter thread, and newsletter from a transcript."""

    prompt = f"""You are an expert content strategist. The creator's voice is: {voice}

    From the transcript below, produce THREE outputs in JSON:
    1. "linkedin": A 200-word LinkedIn post with strong hook
    2. "twitter_thread": A 8-tweet thread (array of strings, max 280 chars each)
    3. "newsletter": A 400-word personal newsletter section

    Each must feel platform-native, not copy-pasted.

    Transcript:
    {transcript}

    Return only valid JSON."""

    response = client.messages.create(
        model=MODEL,
        max_tokens=4000,
        messages=[{"role": "user", "content": prompt}]
    )

    return json.loads(response.content[0].text)


if __name__ == "__main__":
    sample_transcript = """[Your podcast/video transcript here]"""
    voice = "Direct, contrarian, B2B-focused, data-driven"

    outputs = repurpose_content(sample_transcript, voice)

    print("=== LINKEDIN ===")
    print(outputs["linkedin"])
    print("\n=== TWITTER THREAD ===")
    for i, tweet in enumerate(outputs["twitter_thread"], 1):
        print(f"{i}/ {tweet}")
    print("\n=== NEWSLETTER ===")
    print(outputs["newsletter"])

Extend this with:

Whisper for audio-to-text input
A queue system (Redis + Celery) for batch processing
A simple Streamlit UI for non-technical creator team members
Webhook integration with Buffer or Typefully for direct publishing

The 5 Mistakes That Kill AI Content Pipelines

I've audited dozens of creator AI workflows. The same mistakes appear over and over.

1. Treating AI as a Writer Instead of a Drafter

AI-generated text published without human editing is detectable, generic, and erodes trust. Use AI for the first 80%, but always edit the final 20% — that's where your voice lives.

2. Skipping the Voice Calibration Step

Without a documented voice profile (tone, vocabulary, forbidden phrases, examples), every output regresses to the mean. Spend 4 hours documenting your voice once. It pays back for years. If you want a structured framework for this, we walk through the full process in our AI workflow courses.

3. Building Without Measurement

Pipelines without analytics are vibes-based content factories. If you can't tell which output formats win, you're optimizing blind.

4. Over-Automating Distribution

Full automation of posting (no human in the loop) is how creators end up with embarrassing posts going live during global news events. Keep a 1-click approval step at minimum.

5. Choosing Tools Over Architecture

The creators who win don't have the best tools. They have the clearest workflow. Tools change every quarter. Architecture compounds.

What's Coming Next (2026–2027)

A few signals worth watching:

Personalized AI clones — creators training models on their voice/likeness to scale 1:1 audience interaction
Multimodal generation at scale — single prompts producing full video, audio, and graphics in one pass
AI-native platforms — new social networks built around AI-generated content as a first-class citizen
Agent-driven content ops — autonomous agents that research, produce, schedule, and optimize with minimal human input

The creators preparing for this now — by building modular, API-driven systems — will be the ones operating at unprecedented scale by 2027.

FAQ: AI for Influencers

Q: Do I need to code to use AI as an influencer?

No. Many top creators use no-code tools (Zapier, Make, ChatGPT, Claude Projects). But knowing even basic Python unlocks 10x more customization.

Q: Will AI-generated content hurt my reach?

Only if it sounds generic. Platforms penalize low-effort content, not AI assistance. Original voice + AI scaffolding consistently outperforms 100% human or 100% AI.

Q: How much should I budget for AI tools?

A solo creator can build a complete stack for $50–150/month. Larger operations run $500–2000/month. ROI is usually measured in weeks, not months.

Q: Is this ethical? Should I disclose AI usage?

Be transparent about what AI does in your workflow (research, drafting, editing), but you don't need to flag every AI-touched word. The standard: would your audience feel deceived if they saw your process? If no, you're fine.

Q: Which AI model should I use as a creator?

For creative content: Claude tends to lead. For research with citations: Perplexity. For images: Midjourney or Flux. For video: Runway or Sora. Test all of them — they each have strengths.

Conclusion: Build the System, Not the Output

The influencer economy is splitting into two clear tiers.

The first tier still manually crafts every piece of content. They post when they have time. They burn out. They plateau.

The second tier has built systems. AI handles the heavy lifting. They post consistently across every platform. Their content compounds because their architecture compounds.

The gap between these two tiers is widening every month. And by 2027, it will be unbridgeable for those who waited too long to start.

The good news: building your AI content engine doesn't require a team or a six-figure budget. It requires clear thinking, a few APIs, and the willingness to treat content like the engineering problem it actually is.

Start with one layer. Make it work. Add the next.

That's how the top 1% built it. And it's how you build it too.

Want to Go Deeper?

If this resonated and you want a structured path instead of piecing it together from scattered blog posts and YouTube videos:

🎓 Cursuri-AI.ro — Our complete AI education platform covers the entire creator stack: prompting, automation, content pipelines, AI workflows for business, and how to build production-grade AI systems. Interactive courses with an AI tutor that adapts to how you learn — not passive video watching.

Whether you're a creator looking to scale, a developer building tools for the creator economy, or a business owner figuring out how to integrate AI into your operations — start here.

About the Author

I'm the founder of Cursuri-AI.ro, where I help thousands of creators, professionals, and businesses build with AI. I write about AI workflows, content automation, and the engineering side of the creator economy.

If this article helped, drop a reaction and follow for more deep dives. What layer of your content stack are you working on right now? Let me know in the comments — I read every one.

7 Production Patterns for AI Agents That Don't Break in 2026

galian — Wed, 13 May 2026 11:38:37 +0000

A demo agent that loops three times, calls one tool, and returns "Hello, I helped you" is easy. A production agent that handles 10k requests a day across paying customers, without lighting your API bill on fire or hallucinating tool arguments at 3am, is a different animal.

I've shipped AI agents in production for the last 18 months — search, content generation, support triage, document analysis. The same seven patterns keep showing up in every codebase that actually works. None of them are exotic. Most of them are boring. That's the point: production agents are boring on purpose.

Here are the patterns, with Python examples you can drop into your own loop today.

1. The Tool Result Validator

Problem: LLMs hallucinate tool arguments. They will confidently call send_email(to="user@example.com", subject="Refund", body="...") when the user never asked for an email. They will pass user_id="123abc" to a function that requires an integer. They will invent product SKUs that don't exist.

If your tool layer trusts the model's output, every hallucination becomes a production incident.

Pattern: Validate tool arguments at the tool boundary, not inside the tool. Reject early with a structured error the model can recover from.

from pydantic import BaseModel, ValidationError

class SendEmailArgs(BaseModel):
    to: str
    subject: str
    body: str
    requires_user_confirmation: bool = True

def execute_tool(name: str, raw_args: dict) -> dict:
    schema = TOOL_SCHEMAS[name]
    try:
        args = schema.model_validate(raw_args)
    except ValidationError as e:
        return {
            "status": "error",
            "error_type": "invalid_arguments",
            "message": f"Tool call rejected. Fix these fields: {e.errors()}",
        }

    if name == "send_email" and args.requires_user_confirmation:
        return {"status": "pending_confirmation", "preview": args.model_dump()}

    return TOOLS[name](args)

Gotcha: Always return the validation error back to the model as a tool result. Don't raise it. The agent can usually self-correct in the next turn — but only if it sees the error.

2. Bounded Memory

Problem: Naive agent loops accumulate every tool call, every observation, every reasoning step into the conversation history. After 15 turns, you're sending 80k tokens per request. Your latency doubles. Your cost goes up 10x. The model starts losing track of what it was doing because the relevant context is buried under five tool dumps.

Pattern: Treat conversation history as a finite resource. Compress aggressively, summarize old turns, and keep tool outputs out of the main thread when you can.

class BoundedMemory:
    def __init__(self, max_tokens: int = 32_000, summarize_at: int = 24_000):
        self.messages: list[dict] = []
        self.max_tokens = max_tokens
        self.summarize_at = summarize_at

    def add(self, message: dict) -> None:
        self.messages.append(message)
        if self._token_count() > self.summarize_at:
            self._compress()

    def _compress(self) -> None:
        # Keep system message + last 4 turns verbatim
        keep_recent = self.messages[-8:]
        to_summarize = self.messages[1:-8]
        if not to_summarize:
            return
        summary = summarize_with_llm(to_summarize, max_tokens=2_000)
        self.messages = (
            [self.messages[0]]
            + [{"role": "user", "content": f"<earlier_context>{summary}</earlier_context>"}]
            + keep_recent
        )

Gotcha: Don't summarize tool call messages — the model needs the exact arguments to chain reasoning. Summarize only the observations, and only when they're old enough that detail no longer matters.

3. The Observable Loop

Problem: Your agent is in production. A user complains it gave them garbage. You have... a final string output and a vague memory of what the loop does. Good luck debugging.

Pattern: Emit a structured event for every state transition in the loop. Every model call, every tool call, every retry, every error. Ship them to whatever observability stack you already use (Datadog, Honeycomb, OpenTelemetry, even just structured JSON to stdout).

import time
import uuid
from contextlib import contextmanager

@contextmanager
def trace_step(run_id: str, step: str, **attrs):
    span_id = str(uuid.uuid4())
    start = time.perf_counter()
    log_event("step.start", run_id=run_id, span_id=span_id, step=step, **attrs)
    try:
        yield span_id
        log_event("step.end", run_id=run_id, span_id=span_id, step=step,
                  status="ok", duration_ms=(time.perf_counter() - start) * 1000)
    except Exception as e:
        log_event("step.end", run_id=run_id, span_id=span_id, step=step,
                  status="error", error=str(e),
                  duration_ms=(time.perf_counter() - start) * 1000)
        raise

def run_agent(task: str) -> str:
    run_id = str(uuid.uuid4())
    memory = BoundedMemory()
    memory.add({"role": "user", "content": task})

    for turn in range(MAX_TURNS):
        with trace_step(run_id, "model_call", turn=turn):
            response = call_model(memory.messages)
        memory.add(response)

        if not response.tool_calls:
            return response.content

        for call in response.tool_calls:
            with trace_step(run_id, "tool_call", tool=call.name, turn=turn):
                result = execute_tool(call.name, call.arguments)
            memory.add({"role": "tool", "tool_call_id": call.id, "content": result})

    return "Max turns exceeded"

Gotcha: Include a stable run_id on every event. When a customer reports an issue, you want one query that returns the entire trace.

4. Graceful Degradation

Problem: Your agent depends on three external services and a vector store. One of them is having a bad day. Your agent now returns a 500 to the user, even though for this particular query the broken dependency wasn't actually needed.

Pattern: Wrap dependencies in fallback chains. If the primary fails, the agent should know that capability is degraded — not crash.

class ToolRegistry:
    def __init__(self):
        self.tools: dict[str, list[Callable]] = {}
        self.health: dict[str, bool] = {}

    def register(self, name: str, *implementations: Callable) -> None:
        self.tools[name] = list(implementations)

    def call(self, name: str, args: dict) -> dict:
        for i, impl in enumerate(self.tools[name]):
            try:
                result = impl(args)
                self.health[f"{name}:{i}"] = True
                return {"status": "ok", "result": result, "tier": i}
            except Exception as e:
                self.health[f"{name}:{i}"] = False
                log_event("tool.fallback", tool=name, tier=i, error=str(e))
                continue
        return {
            "status": "degraded",
            "message": f"Tool '{name}' is unavailable. Try a different approach.",
        }

The crucial bit is the degraded response — it goes back to the model as a tool result, and a well-prompted agent will re-plan. Maybe it tries a different tool. Maybe it tells the user "I can't check live inventory right now, but here's what I know." Either is better than a 500.

Gotcha: Surface the degraded status in your prompt. A line like "If a tool returns status=degraded, do not retry it. Acknowledge the limitation in your final response." prevents the model from looping on a dead service.

5. The Cost Circuit Breaker

Problem: A bug or an adversarial input puts your agent in a tool-calling loop. By the time you notice, you've spent $400 in 20 minutes.

Pattern: Track cumulative cost per run and per session. Hard-stop when limits are exceeded. This is not optional in production — it's the difference between a bad day and a layoff conversation.

class CostBudget:
    def __init__(self, max_usd_per_run: float = 0.50, max_usd_per_user_per_day: float = 5.00):
        self.run_cost = 0.0
        self.max_run = max_usd_per_run
        self.max_day = max_usd_per_user_per_day

    def charge(self, usage: dict, model: str) -> None:
        cost = compute_cost(usage, model)
        self.run_cost += cost
        if self.run_cost > self.max_run:
            raise BudgetExceeded(f"Run exceeded ${self.max_run}")

    def precheck_user(self, user_id: str) -> None:
        spent_today = redis.get(f"cost:{user_id}:{today()}") or 0
        if float(spent_today) > self.max_day:
            raise BudgetExceeded(f"User {user_id} exceeded daily budget")

Gotcha: Different limits for different surfaces. An internal batch job can have a $5 ceiling per run. A free-tier chat user gets $0.10. A paying enterprise customer gets $2. Hardcoding one number is a footgun.

6. The Deterministic Critic

Problem: "LLM-as-a-judge" sounds clever, but using a model to grade itself is unreliable and slow. Two model calls per output, both hallucinate, both cost money.

Pattern: For checks you can express as code, use code. Reserve LLM grading for genuinely subjective dimensions, and only after the deterministic checks pass.

class OutputCritic:
    def evaluate(self, output: str, context: dict) -> dict:
        issues = []

        if context.get("must_cite_sources") and not re.search(r"\[\d+\]", output):
            issues.append("missing_citations")

        if context.get("max_length") and len(output) > context["max_length"]:
            issues.append("too_long")

        if BANNED_PHRASES.search(output):
            issues.append("banned_phrase")

        if context.get("must_mention"):
            missing = [k for k in context["must_mention"] if k.lower() not in output.lower()]
            if missing:
                issues.append(f"missing_keywords:{missing}")

        if issues:
            return {"verdict": "reject", "issues": issues, "method": "deterministic"}

        if context.get("subjective_check"):
            return llm_grade(output, context["subjective_check"])

        return {"verdict": "accept", "method": "deterministic"}

When the critic rejects, feed the issues back to the agent as a "revise this" instruction. After two rejections, return whatever you have with a flag — infinite revision loops are their own bug class.

Gotcha: Don't make the critic too strict. If your accept rate is below 70%, your prompt is broken, not your output.

7. Stateless Replay (Idempotency)

Problem: Your agent half-completed a task — it sent the email, then crashed before logging the result. The user retries. Now they get two emails.

Pattern: Treat every external side-effect as idempotent by design. Use deterministic IDs derived from the input, dedupe at the tool layer, and make agent runs replayable from any saved checkpoint.

import hashlib
import json

def idempotency_key(tool_name: str, args: dict) -> str:
    canonical = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]

def execute_tool_idempotent(name: str, args: dict, run_id: str) -> dict:
    key = idempotency_key(name, args)
    cache_key = f"tool_result:{run_id}:{key}"
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    result = TOOLS[name](args)
    redis.setex(cache_key, 3600, json.dumps(result))
    return result

Now if the agent retries the same step within the run, it gets the cached result. If you persist the cache across runs (with a longer TTL), you get cross-run idempotency too — which is what you want for anything that costs money or sends messages.

Gotcha: Be careful what you put in the idempotency key. Timestamps, request IDs, or random nonces in the args will defeat it. Strip them before hashing.

Putting It Together

A production agent loop using all seven patterns is roughly 200 lines of Python. Not glamorous, but it survives. Here's the skeleton:

def run_agent_production(task: str, user_id: str) -> str:
    run_id = str(uuid.uuid4())
    budget = CostBudget()
    budget.precheck_user(user_id)

    memory = BoundedMemory(max_tokens=32_000)
    memory.add({"role": "system", "content": SYSTEM_PROMPT})
    memory.add({"role": "user", "content": task})

    critic = OutputCritic()

    for turn in range(MAX_TURNS):
        with trace_step(run_id, "model_call", turn=turn) as span:
            response = call_model(memory.messages)
            budget.charge(response.usage, response.model)

        memory.add(response.message)

        if not response.tool_calls:
            verdict = critic.evaluate(response.content, task_context())
            if verdict["verdict"] == "accept":
                return response.content
            memory.add({"role": "user", "content": f"Revise: {verdict['issues']}"})
            continue

        for call in response.tool_calls:
            with trace_step(run_id, "tool_call", tool=call.name, turn=turn):
                args = call.arguments
                result = execute_tool_idempotent(call.name, args, run_id)
            memory.add({"role": "tool", "tool_call_id": call.id, "content": result})

    return "Task incomplete after max turns"

That's the loop. Drop in your favorite model API (Claude, GPT, open source — patterns work the same), wire up your tools with the validator from pattern 1, and you have something that won't embarrass you in production.

What I'd Read Next

Anthropic's "Building effective agents" guide — the canonical reference on when to use agents vs simple workflows.
OpenAI's Agents SDK docs — clean reference implementation of multi-agent handoffs.
For Romanian-speaking developers building agents in production, the AI Agents course on Cursuri-AI.ro goes deeper on these patterns with hands-on exercises.

If you've shipped agents in production, what patterns did I miss? Drop them in the comments — I'll add the best ones to a follow-up post.

Written by a developer who has paged themselves at 3am because an agent went into a tool-calling loop. Don't be that developer. Use the circuit breaker.

Fine-Tuning LLMs in 2026: A Practical Guide for Engineers (LoRA, QLoRA, DPO, GRPO)

galian — Fri, 01 May 2026 20:31:02 +0000

Fine-tuning has gone from "research lab toy" to a first-class production technique for AI engineers. With LoRA-class adapters, modern alignment algorithms (DPO, GRPO, RLVR), and serving stacks like vLLM, you can ship a custom model on a single H100 — sometimes on a single 4090.

But the question isn't can you fine-tune. It's: should you?

This guide is the engineering checklist I wish I'd had two years ago. It covers the decision tree, the modern toolchain, the gotchas, and the EU compliance constraints you can't ignore in 2026.

🇪🇺 Romanian / EU readers: the full hands-on Romanian-language program is at Fine-Tuning și Adaptarea Modelelor AI — Enterprise Edition. It includes a complete end-to-end project, EU AI Act governance, and FinOps modeling.

TL;DR

Don't fine-tune first. Try prompting → RAG → fine-tuning. In that order.
LoRA / QLoRA is the default in 2026. Full fine-tuning is rarely the right call.
Alignment ≠ SFT. SFT teaches format; DPO/GRPO/RLVR teach preferences and reasoning.
Evaluation is the hard part. Loss curves don't tell you if the model is better.
Serving matters. A great fine-tune served badly is just an expensive demo.
EU AI Act applies. Document your data, your evals, and your model card.

1. When fine-tuning is actually the right tool

Most teams reach for fine-tuning too early. Here's the honest decision tree:

Problem	First try	Fine-tune only if
Inconsistent output format	Prompting + structured outputs	Format breaks > 5% even with strict prompts
Knowledge cutoff / private data	RAG (Retrieval-Augmented Generation)	RAG retrieves the right chunks but the model still misuses them
Domain-specific style/voice	System prompt + few-shot	You need it baked in across thousands of calls (latency/cost)
Specialized reasoning (math, code, legal)	Better base model + CoT	You have a clean preference dataset and need stable behavior
Tool use / agents	MCP + good prompts	Tool-call accuracy is below your SLA after prompt iteration

Rule of thumb: if you can't articulate what your fine-tune teaches that a 200-line system prompt can't, you're not ready to fine-tune.

If you're earlier in the journey, the Prompt Engineering Masterclass and Advanced LLM Integration cover the cheaper alternatives in depth.

2. The 2026 technique landscape

Full fine-tuning

Updates every parameter. Maximum capacity, maximum cost, maximum risk of catastrophic forgetting. Justified for: foundational training, large domain shifts, or when you own the inference path and the dataset is huge (>1M high-quality examples).

LoRA (Low-Rank Adaptation)

The original LoRA paper (Hu et al., 2021) is still required reading. You freeze the base weights and train two small low-rank matrices A and B per attention layer. Typical adapter is 0.1–1% of the model's parameters.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                       # rank
    lora_alpha=32,              # scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 8.4M || all params: 7.2B || trainable%: 0.12

QLoRA

QLoRA (Dettmers et al., 2023) loads the base model in 4-bit (NF4) and trains LoRA adapters on top. This is what lets you fine-tune a 70B model on a single 80GB GPU. Use bitsandbytes + HuggingFace PEFT.

DoRA, OLoRA, rsLoRA

Newer variants that decouple magnitude/direction (DoRA), use orthogonal init (OLoRA), or rescale rank (rsLoRA). Marginal gains in most cases — start with vanilla LoRA, only switch if you've measured a problem.

3. Alignment: SFT is just step one

Supervised Fine-Tuning (SFT) teaches the model what good output looks like. It does not teach preferences, refusals, or reasoning quality. That's what alignment is for.

DPO (Direct Preference Optimization)

DPO (Rafailov et al., 2023) replaces the RLHF pipeline (reward model + PPO) with a single classification-style loss on preference pairs. Simpler, more stable, and the de facto default in 2026.

from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    beta=0.1,                   # KL regularization
    learning_rate=5e-7,
    num_train_epochs=1,
    per_device_train_batch_size=2,
)

trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,             # PEFT auto-handles reference
    args=config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
trainer.train()

GRPO and RLVR

GRPO (Group Relative Policy Optimization, popularized by DeepSeek-R1) and RLVR (RL with Verifiable Rewards) are the techniques behind the reasoning-model wave. If you're training for math, code, or anything with a programmatic verifier — these matter.

The HuggingFace TRL library now ships first-class support for SFT, DPO, GRPO, and KTO.

4. The data pipeline is the moat

A bad dataset will defeat a perfect training loop every time. Things that actually move metrics:

Diversity over volume. 5K diverse examples beats 50K near-duplicates.
Hard negatives. For preference data, pairs where chosen and rejected are almost equally good teach more than obvious wins.
Decontamination. Strip eval-set leakage from training data. Always.
Format consistency. Tokenize early to catch chat-template mismatches before you waste 10 GPU-hours.
PII and licensing. This is where the EU AI Act lives. Document provenance.

5. The 2026 tooling stack

Here's what a production-grade fine-tuning project looks like today:

Layer	Tool
Training framework	HuggingFace TRL
Adapters	HuggingFace PEFT
Quantization	`bitsandbytes`
Distributed	Accelerate / DeepSpeed ZeRO-3 / FSDP
Experiment tracking	Weights & Biases or MLflow
Serving	vLLM
Eval harness	`lm-evaluation-harness` + custom domain evals
Closed-source baseline	OpenAI fine-tuning for comparison

Wiring all of this into a real CI/CD lifecycle is what separates a notebook experiment from a deployable system. That's the focus of MLOps: Prototype to Production.

6. Evaluation: where most projects quietly fail

Loss curves go down. The model "feels better." You ship. Production complaints spike. Sound familiar?

Build a holistic eval suite before you start training:

Capability evals — domain-specific tasks scored by rubric.
Regression evals — verify the model didn't lose abilities (catastrophic forgetting is real).
Safety evals — refusals, jailbreak resistance, policy adherence.
LLM-as-judge — useful, but bias-corrected with human spot-checks.
Cost & latency — TTFT, throughput, p95 — these are product metrics.

If your eval suite isn't version-controlled and reproducible, you don't have an eval suite. You have vibes.

7. Serving: the part nobody talks about until it breaks

LoRA adapters can be hot-swapped at inference time. vLLM, SGLang, and TensorRT-LLM all support multi-LoRA serving — meaning you can host one base model and dozens of fine-tuned adapters with near-zero overhead.

# vLLM with LoRA adapters
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules legal-adapter=./adapters/legal sales-adapter=./adapters/sales \
  --max-loras 4

This is the architectural unlock that makes fine-tuning economically viable for SaaS multi-tenancy.

8. EU AI Act: not optional in 2026

If you're shipping in the EU, fine-tuning a foundation model can put you in the deployer or provider category under the EU AI Act. Practical consequences:

Model card documenting training data, intended use, limitations.
Risk assessment if the use case touches Annex III (HR, education, critical infrastructure, law enforcement, etc.).
Logging of significant model updates and eval results.
Transparency obligations to end users for AI-generated content.

This isn't lawyer paranoia — auditors are already asking. Bake it into your pipeline from day one.

9. The mistakes I see most often

Fine-tuning before exhausting prompting and RAG. Cheaper, faster, easier to roll back.
Using r=64 because "bigger is better". Most tasks saturate at r=8 to r=16. Measure.
Mismatched chat template between training and inference. Silent quality killer.
Training on the eval set. Decontaminate. Then decontaminate again.
Skipping the SFT-only baseline. You can't claim DPO helped if you didn't measure SFT-only first.
Ignoring catastrophic forgetting. Always run a regression eval against the base model.
Forgetting the FinOps math. A $400 fine-tune that adds $0.002/request to inference is not a win at 1M requests/day.

Where to go next

If you want a structured path that goes from prompt engineering to deploying fine-tuned models in production:

Foundation: Introduction to AI Engineering
Before fine-tuning: Prompt Engineering Masterclass → RAG: Retrieval-Augmented Generation
The full deep dive: Fine-Tuning and Model Adaptation — Enterprise Edition (LoRA/QLoRA/DoRA, DPO/GRPO/RLVR, vLLM serving, EU AI Act, end-to-end project)
Productionization: MLOps: Prototype to Production
Integration layer: MCP — Model Context Protocol

Browse the full IT engineering track at cursuri-ai.ro/cursuri/it.

Closing thought

Fine-tuning in 2026 is no longer about can the model learn the task. It's about whether your dataset, eval suite, serving stack, and governance process are good enough to deserve a custom model. Get those right, and a single adapter can be the difference between a feature that costs you money and a feature that defines your product.

If this resonated, I'd love to hear what fine-tuning problem you're actually stuck on — drop it in the comments. 👇

Originally published on Cursuri-AI.ro — the AI engineering education platform for Romanian and EU professionals.

Claude Opus 4.7 vs GPT-5.5: A Developer's Pragmatic Comparison Guide (2026)

galian — Tue, 28 Apr 2026 10:03:06 +0000

TL;DR — In 2026, choosing an LLM is no longer about picking "the best model." It's about understanding which model solves your specific problem at the lowest total cost and risk. Claude Opus 4.7 brings a 1M token context window and exceptional reasoning. GPT-5.5 brings ecosystem maturity and multimodal strength. The right answer for production is almost always multi-model orchestration, not allegiance.

If you're a backend engineer, ML engineer, or solutions architect choosing a foundation model in 2026, this guide is for you. No marketing fluff. Just patterns I've validated on real projects.

A Quick Note on Honesty

Before we go further: I'm not going to fabricate specs.

Claude Opus 4.7 is verified to ship with a 1M token context window (Anthropic's official spec).
Claude Opus 4.6 remains in active production as the cost-efficient predecessor.
GPT-5.5 is OpenAI's current flagship at the time of writing. For exact context window, pricing, and benchmark numbers, always check OpenAI's official documentation — those numbers shift between point releases, and any blog quoting them risks being stale within a month.

This article focuses on architectural and methodological differences that age well, not spec-sheet trivia that doesn't.

Why This Comparison Matters Differently in 2026

Three years ago, picking a model meant running it through a weekend benchmark and shipping. Today, the calculus has changed:

Context windows have stopped being a bottleneck. With Opus 4.7's 1M token window, the question is no longer "can I fit my codebase?" — it's "should I, given attention dynamics and cost?"
Total Cost of Ownership has become non-trivial. API price-per-token is maybe 30% of what you actually pay in production.
Regulatory pressure is real. The EU AI Act and GDPR are no longer theoretical — they shape architecture decisions for any team with European users.

Engineers who still treat model selection as a 2-hour decision are leaving serious money and reliability on the table.

Architectural Differences That Actually Matter

Context Window

Model	Context Window	Practical Implication
Claude Opus 4.7	1,000,000 tokens	Full enterprise codebases, long-form legal docs, multi-document RAG without chunking compromises
Claude Opus 4.6	(See Anthropic docs)	Cost-optimized workhorse for everyday agentic workloads
GPT-5.5	(See OpenAI docs)	Tight integration with Azure OpenAI, mature tooling ecosystem

The 1M context window is not just bigger — it changes architectural patterns.

When you have a million tokens, you stop building chunked RAG pipelines for many use cases. You stop fighting context truncation. You can pass a full repo, a full deposition, a full quarterly filing — and ask the model to reason over it directly.

But this comes with a real trade-off: attention quality degrades unevenly across very long contexts. Just because you can stuff 800K tokens in doesn't mean the model will reliably find the needle. Always run targeted needle-in-haystack evals on your data structure.

Reasoning Style

This is hard to quantify but easy to feel after enough projects:

Claude Opus 4.7 tends to reason more conservatively. It pushes back on ambiguity, asks clarifying questions, and produces structured outputs that hold up well under JSON schema validation.
GPT-5.5 tends to be more proactive and creative. It will often produce a complete answer where Claude would ask "did you mean X or Y?"

Neither is universally better. Conservative reasoning saves you from hallucinated database queries in production. Proactive reasoning ships features faster in a hackathon.

Tool Use & Agentic Workflows

Both models support function calling and agentic loops. In my experience:

Claude's tool use feels more deterministic. JSON schemas hold. Parallel tool calls behave predictably.
GPT's tool use has a more mature ecosystem (Assistants API, more SDK examples, broader community).

If you're building a pure agent system, both work. If you're integrating into an existing Azure / Microsoft stack, GPT-5.5 has lower friction. If you're building a regulated workflow with strict guarantees, Claude's structured output behavior wins on reliability.

When To Choose Each — A Decision Framework

Stop asking "which is best?" Start asking these four questions:

1. What problem am I actually solving?

Long-form document reasoning, code analysis at scale, regulated decision support → Claude Opus 4.7
Multimodal user-facing features, real-time voice, ecosystem-heavy integrations → GPT-5.5
High-volume cost-sensitive agentic workloads → Claude Opus 4.6 (or smaller models)

2. What's my failure cost?

A chatbot that recommends the wrong product costs a sale. An assistant that misreads a contract clause costs a lawsuit. Match the model's reliability profile to your downside risk.

3. Who maintains this in 18 months?

Models get deprecated. Pricing changes. APIs evolve. Pick the model whose migration path you can stomach. If your answer is "we can't migrate" — you've built tech debt, not capability.

4. What's my regulatory surface?

For EU-resident users:

EU AI Act classifies systems by risk tier — high-risk systems carry significant compliance overhead.
GDPR still applies to any prompt containing personal data.
Vendor concentration risk is now a documented audit concern.

Single-vendor architectures are increasingly hard to defend in compliance reviews.

Build Your Own Evaluation Harness (Don't Trust Public Benchmarks)

Public benchmarks measure general capability. Your production system needs domain-specific capability. Here's a minimal evaluation pattern I use:

import anthropic
from openai import OpenAI

anthropic_client = anthropic.Anthropic()
openai_client = OpenAI()

def evaluate_on_task(model_id: str, provider: str, task: dict) -> dict:
    """Run a single task against a model and return structured output."""
    prompt = task["prompt"]
    expected = task["expected"]

    if provider == "anthropic":
        response = anthropic_client.messages.create(
            model=model_id,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )
        output = response.content[0].text
    else:  # openai
        response = openai_client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
        )
        output = response.choices[0].message.content

    return {
        "model": model_id,
        "task_id": task["id"],
        "output": output,
        "expected": expected,
        "match": evaluate_match(output, expected),
    }


def run_eval_suite(test_cases: list[dict]) -> dict:
    """Compare both models on the same tasks."""
    results = {"claude": [], "gpt": []}
    for task in test_cases:
        results["claude"].append(
            evaluate_on_task("claude-opus-4-7", "anthropic", task)
        )
        results["gpt"].append(
            evaluate_on_task("gpt-5.5", "openai", task)
        )
    return results

A few principles for building your eval suite:

Use real production data (anonymized). Synthetic tasks lie.
Include adversarial cases — ambiguous inputs, near-duplicates, edge cases.
Measure cost-per-correct-answer, not just accuracy.
Run it weekly — model behavior drifts between point releases.

The Hidden Costs Nobody Talks About

API price-per-token is the smallest part of your real cost. Here's the full picture:

Cost Layer	Typical Range	What Drives It
Direct API tokens	20-30% of total	Pricing tier, prompt size
Re-prompting on errors	10-20%	Model reliability, validation strictness
Human-in-the-loop validation	15-30%	Use case sensitivity, regulatory requirements
Caching infrastructure	5-10%	Architecture, library choices
Vendor migration overhead	10-25% (when triggered)	Lock-in level, abstraction quality
Compliance audits	5-15%	Regulatory environment, data sensitivity

A model that's "20% cheaper at the API" can be 2x more expensive in TCO if it triggers more re-prompts or requires heavier human validation.

Multi-Model Orchestration: The Pattern That Wins

In 2026, the production-grade answer is rarely "one model for everything." Common patterns:

┌─────────────────────────────────────────────────────────────┐
│  Router (lightweight model)                                 │
│  ├── Classifies request complexity & sensitivity            │
│  └── Routes to appropriate model                            │
└─────────────────────────────────────────────────────────────┘
            │
   ┌────────┼────────┐
   ▼        ▼        ▼
[Haiku]  [Opus 4.6]  [Opus 4.7]
 cheap    balanced    deep reasoning
 fast     production  complex docs

This pattern routinely cuts costs by 40-60% versus single-model architectures, with no quality loss when the router is well-calibrated.

Going Deeper: Resources

If you want to go beyond this article and build genuine expertise in model selection, evaluation, and multi-model architecture, I've put together a structured course covering exactly these topics:

🔗 AI Model Comparison 2026 — Enterprise Edition (course is in Romanian)

It covers:

Full enterprise evaluation methodology — from benchmark to production
How to interpret 2026 benchmarks correctly (signal vs. marketing noise)
Structured selection frameworks based on cost / risk / use case
Complete landscape: Anthropic, OpenAI, Google, Meta, Mistral
Multi-model architectures and cost optimization strategies
Applied case studies with European regulatory context

🔗 Full platform: Cursuri-AI.ro — single subscription, full catalog of AI courses for IT and non-IT professionals.

Closing Thoughts

The real edge in 2026 isn't access to AI — it's methodological maturity in choosing, evaluating, and governing AI. Model access has become a commodity. The competence to architect around models is the scarce resource.

If you take one thing from this article, let it be this:

Stop asking "which model is best?" Start asking "which model best fits this specific decision, and what's my exit if I'm wrong?"

That single shift in framing will save your team thousands of hours and tens of thousands of euros over the next twelve months.

Found this useful? Drop a comment with your current model stack — I'm always curious how teams are actually orchestrating these in production.

DEV Community: Cursuri AI

Fine-Tuning vs RAG vs Prompting: How to Actually Decide in 2026

The one distinction that resolves most arguments

Always start with prompting (yes, even now)

Reach for RAG when the problem is knowledge

Fine-tune when the problem is behavior — and only then

The combinations are the real answer

The decision, in one pass

Conclusion

Vector Embeddings Explained: Build Semantic Search in Python

What an embedding actually is

Cosine similarity: the one operation you'll use everywhere

Build a semantic search engine from scratch

Step 1 — Embed a corpus

Step 2 — Cosine similarity in NumPy

Step 3 — Search

From toy to production: what changes

You need a vector database

Chunking matters more than the model

Choosing and swapping embedding models

Hybrid search: when semantic alone isn't enough

Where embeddings show up beyond search

Common mistakes that cost you hours

Conclusion

Context Engineering for AI Agents: Beyond Prompt Engineering

What context engineering actually is

Why "just add more context" is the wrong instinct

The four things competing for your window

Technique 1: Compaction — summarize the history before it drowns you

Technique 2: External memory — let the agent write things down

Technique 3: Sub-agents — isolate context so the main thread stays clean

Technique 4: Just-in-time retrieval — pull data when needed, not upfront

Technique 5: Tool curation — the failure mode hiding in plain sight

How you know it's working: measure it

Putting it together

Conclusion

Run LLMs Locally with Ollama in 2026: The Practical Developer Guide

Why local, and why now

What actually changed in Ollama in 2026

Getting started in five minutes

Hardware: the honest sizing guide

The mid-2026 open-weight lineup worth knowing

The part that changes your workflow: Ollama as a drop-in API

When local is the wrong choice

Frequently asked questions

Is Ollama free?

What hardware do I need to run LLMs locally in 2026?

Can a local model replace GPT or Claude?

Can I really use Claude Code with Ollama?

Ollama vs llama.cpp vs vLLM — which should I use?

The skill underneath the tool

Conclusion

Cursor vs GitHub Copilot vs Claude Code: Which AI Coding Tool in 2026?

TL;DR — three different philosophies

What is Cursor?

What is GitHub Copilot?

What is Claude Code?

Head-to-head: the dimensions that actually matter

The editing model

Agents and autonomy

Ecosystem and integration

Models

Pricing, side by side (mid-2026 snapshot)

So which one should you choose?

The skill underneath the tools

Conclusion

Stop Vibe-Checking Your LLM: A Developer's Guide to Evals

Why you can't just assert output == expected

Start with the eval set, not the metric

The two halves of every LLM eval

Code-based checks: cheaper and more reliable than you think

LLM-as-judge: powerful, biased, and fixable

Wire it into CI, or it won't survive contact with deadlines

Five mistakes that quietly poison your evals

How this connects to the rest of your LLM stack

Conclusion

Claude Fable 5: A Developer's Guide to Anthropic's New Top

What Is Claude Fable 5?

Where Fable 5 Fits in the 2026 Anthropic Lineup

What Changes in the API

Why you can't just `assert output == expected`