DEV Community: galian

We just launched our AI course catalog in English — 20 hands-on courses with an AI professor in every lesson

galian — Fri, 17 Jul 2026 07:21:16 +0000

I'm the founder of Cursuri AI, an AI e-learning platform for professionals. "Cursuri" is simply Romanian for "courses" — the platform started in Romania, where it now runs a catalog of 50 AI courses in Romanian, from prompt engineering fundamentals to building production AI agents.

This week we shipped the thing people kept asking for: the platform is now available in English, with an international catalog of 20 courses built for people who want to actually use AI at work — not just read about it.

This post is the announcement, but I also want to be concrete about what the platform does and what's in the catalog, so you can decide in two minutes whether it's for you.

The idea: learning AI should feel like working with AI

Most online courses are passive: you watch or read, you nod along, and a week later you remember 10% of it. That model is especially broken for AI, where the whole point is interaction — prompting, iterating, getting feedback, correcting course.

So we built the platform around active learning instead:

An AI professor in every lesson. Not a generic chatbot bolted onto the side — an assistant that knows exactly which lesson you're in and answers questions in that context. You can type or just talk to it: it works in text and voice. Stuck on why your RAG retrieval returns garbage, or what a "system prompt" actually is? Ask mid-lesson and get a precise answer. There's a full overview of how it works on the AI professor page.
Interactive AI quizzes. Questions adapt to your level, and every answer comes with a detailed explanation — including why the wrong options are wrong, which is where most of the learning actually happens.
Hands-on exercises on the platform. Coding challenges for the engineering track, realistic work scenarios for the business track. You apply things as you learn them, not "someday later."
Automatic AI summaries. Key points extracted from every lesson, so reviewing before an interview or a meeting takes minutes, not hours.
A personal dashboard. Progress, scores, and insights, so you can see whether you're actually getting better or just clicking "next."

And because AI moves absurdly fast, content is kept up to date — courses get refreshed as the tools and models change, and new courses are added to the bundles at no extra cost.

What's in the English catalog

The 20 launch courses are split into two tracks, because "learning AI" means very different things for a backend engineer and a marketing manager.

IT & Engineering track (10 courses)

This is the track I'm personally most excited about, because it covers the stack of skills that 2026-era AI engineering actually demands:

Introduction to AI Engineering — the on-ramp if you're a developer new to the field
The Complete Prompt Engineering Masterclass
RAG: Retrieval-Augmented Generation in Practice
Advanced LLM Integration in Production Applications
AI Agents: Architecting and Automating Autonomous Systems
Context Engineering and Memory for AI Agents — beyond prompting: managing what your agent knows and remembers
Claude Code Mastery: Agentic Coding from the Terminal — multi-file work, git workflows, CI, MCP
MCP (Model Context Protocol): Building Servers and Integrations
Cursor as a Pro: AI-Native IDE, Composer and Multi-Agent workflows
Vibe Coding: From Prompt to Application with Lovable, v0, Bolt and Replit Agent

If you've been meaning to move from "I use ChatGPT sometimes" to "I ship AI features and agentic workflows," this track is a structured path through exactly that.

Business & Professionals track (10 courses)

No code required — built for managers, marketers, founders, analysts, and office professionals:

AI for Business Leaders
Manager in the AI Era: Leading Your Team Through the AI Transformation
AI for Entrepreneurs and Startups: The Complete Guide
AI for Digital Marketing
AI for Content Creation and Copywriting
AI for Sales and CRM
SEO and AEO/GEO in the AI Era — optimizing for Google, AI Overviews and generative engines
Microsoft 365 Copilot for Office Work: Role-Based Productivity
No-Code Data Analysis with AI — ChatGPT, Excel and SQL for non-programmers
AI Image Generation: The Complete Guide from Prompt to Publishing

The full list with detailed curricula is on the course catalog.

Pricing — simple and honest

We deliberately kept it simple, and every plan is billed monthly with cancel-anytime:

Plan	What you get	Price
Single course	Any one course, full access + AI professor	€49/month + VAT
Business	All 10 Business & Professionals courses	€199/month + VAT
IT Pro	All 10 IT & Engineering courses	€399/month + VAT
All Access	The entire international catalog, both tracks	€499/month + VAT

Two things worth calling out:

You're not forced into a bundle. If you only need the RAG course or the Copilot course, subscribe to just that one for €49/month and cancel when you're done.
New courses are included. Bundle subscribers get every new course we release, as it's released.

Full details and a plan comparison are on the pricing page.

Who this is for (and who it isn't for)

It's for you if you're a professional who wants a structured, hands-on path to using AI in your actual job — whether that job is writing code, running a team, or growing a business. It works well for companies too: there's a dedicated track structure and team offering for organizations that want to upskill whole departments.

It's probably not for you if you want academic ML theory — we don't teach you to derive backpropagation. The focus is applied: tools, workflows, and skills you use the same week you learn them.

The ask

If this sounds useful, take a look at cursuri-ai.ro/en — and if you check out a course, I'd genuinely love feedback. We're a team from Romania, competing on quality and depth of content, so every piece of honest criticism from this community makes the platform better.

And if you have questions about the catalog, the AI professor, or where we're taking the platform next — ask away in the comments. I'll answer everything.

Fine-Tuning vs RAG vs Prompting: How to Actually Decide in 2026

galian — Wed, 15 Jul 2026 11:49:01 +0000

There's a predictable arc to most LLM projects. Something doesn't work, someone says "we should fine-tune it," a month disappears into dataset wrangling and GPU bills, and the model comes back... about as wrong as before — because the actual problem was that it never had the right facts in front of it. Fine-tuning was never going to fix that.

The three techniques — prompting, retrieval-augmented generation (RAG), and fine-tuning — are not a ladder you climb from cheap to fancy. They solve different problems, and choosing the wrong one is expensive in exactly the way that's hard to notice: it looks like progress while it burns weeks.

This is a decision framework. Not "here's what each one is" — you can get that anywhere — but the concrete questions that tell you which one your problem actually needs, and the failure signatures that mean you picked wrong.

The one distinction that resolves most arguments

Before any framework, internalize this split, because it settles 80% of the "should we fine-tune?" debates on its own:

RAG changes what the model knows. It injects facts into the context at inference time.
Fine-tuning changes how the model behaves. It adjusts the weights to shift style, format, structure, and task-specific skill.
Prompting changes what the model is told to do right now, using the knowledge and behavior it already has.

So the first question is never "which technique is best?" It's "is my problem a knowledge gap or a behavior gap?"

The model gives outdated, made-up, or "I don't have access to that" answers about your data → knowledge gap → RAG.
The model knows the facts but won't reliably produce the format, tone, or task structure you need → behavior gap → fine-tuning (maybe).
You haven't seriously tried telling it clearly what to do yet → prompting, first, always.

Get this wrong and no amount of engineering saves you. Fine-tuning a model to "know your docs" is the classic error: you can bake a few facts into weights, but they go stale the moment your docs change, you can't cite sources, and you've spent training compute to build a worse version of a lookup. Knowledge that changes belongs in retrieval, not in weights.

Always start with prompting (yes, even now)

Prompting is not the beginner tier you graduate from. In 2026, with frontier models, a well-constructed prompt plus a few good examples solves a startling share of problems that teams assume need training. It's the fastest, cheapest, most inspectable option, and it should be your baseline before you're allowed to say the word "fine-tune."

Reach for prompting when:

You're still discovering what "good output" even looks like. Prompts are editable in seconds; datasets are not.
The task is reasoning, transformation, or generation the model already broadly knows how to do.
You need to ship this week.

The techniques that make prompting punch above its weight are unglamorous but real: precise role and task framing, few-shot examples that demonstrate the exact output shape, chain-of-thought for multi-step reasoning, and rigid output contracts (structured/JSON) so downstream code can trust the result. Most "the model can't do this" conclusions are actually "we asked badly" conclusions. Squeezing the ceiling out of prompting before spending on anything heavier is a discipline in itself — it's the whole point of a prompt engineering masterclass, and the ROI of getting it right first is enormous because everything downstream inherits a better baseline.

The prompting ceiling — how you know you've hit it: you've iterated seriously, added good examples, and the model still fails — and the failure is either (a) it doesn't know facts it couldn't possibly know, or (b) it can't hold a consistent behavior across inputs no matter how you phrase the instruction. (a) points to RAG. (b) might point to fine-tuning. Not before.

Reach for RAG when the problem is knowledge

RAG is the answer whenever the model needs to work with information it wasn't trained on: your internal documentation, a product catalog, last week's tickets, a knowledge base that updates daily, anything private or fresh.

Choose RAG when:

Answers must be grounded in a specific corpus and you need to cite sources.
The knowledge changes — pricing, policies, docs, inventory. You update an index, not a model.
Hallucination on facts is unacceptable and you need an audit trail of where an answer came from.
The knowledge base is large, or partly access-controlled per user.

The reason RAG beats fine-tuning for knowledge isn't subtle: updating a document store is trivial and instant; updating weights is a training run. RAG gives you freshness, provenance, and per-user access control for free — none of which fine-tuning can offer. When your facts have a shelf life, retrieval is the only correct architecture, and building it well (chunking, hybrid search, re-ranking) is where the real engineering lives — the substance of a dedicated course on RAG and retrieval-augmented generation.

RAG's own ceiling: retrieval fixes what the model knows, not how it behaves. If your RAG answers are factually correct but come out in the wrong format, wrong tone, or don't follow your house style no matter how you prompt — that residual behavior gap is exactly where fine-tuning finally earns its place, on top of RAG, not instead of it.

Fine-tune when the problem is behavior — and only then

Fine-tuning is the right tool, but for a narrower set of problems than its reputation suggests. It shines at teaching consistent behavior that's hard to specify in a prompt: a very specific output structure, a domain's tone and terminology, a classification or extraction task where you have lots of labeled examples, or a skill the base model does clumsily.

Legitimately reach for fine-tuning when:

You need consistent style, format, or structure at a level prompting can't hold across the full input distribution.
You have a narrow, high-volume, well-defined task (classification, extraction, a specific transformation) and enough quality labeled data.
You want to bake in a behavior so you can drop it from the prompt — shorter prompts, lower per-call cost, faster responses at scale.
Latency or cost at scale matters and a smaller fine-tuned model can match a bigger prompted one.

Two things make modern fine-tuning far less scary than its reputation. First, you almost never do full fine-tuning — parameter-efficient methods like LoRA/QLoRA train a tiny set of adapter weights, cutting the compute and memory cost by orders of magnitude while getting most of the benefit. Second, the bottleneck is data quality, not model choice: a few hundred to a few thousand clean, consistent, representative examples beat a huge noisy pile every time. The hard part of fine-tuning was never running the training job — it's building the dataset, choosing PEFT trade-offs, and evaluating the result without fooling yourself, which is precisely the ground a fine-tuning course has to cover to be worth anything.

When fine-tuning is the wrong answer — the red flags:

"We'll fine-tune it on our docs so it knows them." → No. That's RAG. Fine-tuned facts go stale and can't be cited.
"We haven't really tried prompting." → Do that first; you may not need to train at all.
"The requirements change weekly." → Fine-tuning bakes behavior in; if the target moves, you're re-training constantly. Keep it in the prompt until it stabilizes.
"We have 40 examples." → Usually not enough for reliable behavior change; strong prompting with those 40 as few-shot examples will likely beat it.

The combinations are the real answer

Framing these as rivals is the beginner mistake. In production, the strongest systems combine them, because they operate on different axes — knowledge, behavior, and instruction — and stack cleanly:

RAG + prompting is the workhorse for most knowledge-grounded assistants: retrieve the right context, then a well-engineered prompt instructs the model to answer only from it and cite sources. No training required.
Fine-tuning + RAG is the high end: fine-tune for the domain's behavior (tone, format, task skill), and use RAG for the facts. The model behaves exactly right and stays current — behavior in the weights, knowledge in the index.
Fine-tuning + prompting collapses a long, brittle instruction into learned behavior, so your prompts get short and your inference gets cheaper.

Orchestrating these — deciding which layer owns which responsibility, and routing a request through retrieval, tools, and the model in the right order — is its own engineering discipline, and it's the core of a course on advanced LLM integration. The mental model to keep: knowledge → retrieval, behavior → weights, instruction → prompt. Put each requirement on the axis it actually lives on.

The decision, in one pass

Run your problem through this, in order. Stop at the first that fits:

Have you genuinely exhausted prompting — clear instructions, good few-shot examples, structured output? If not → prompt. (This is where most projects should still be.)
Is the failure a knowledge gap — missing, stale, or private facts; needs citations? → RAG.
Is the failure a behavior gap — format/tone/task consistency the prompt can't hold, and you have quality labeled data and the target is stable? → fine-tune (LoRA first).
Is it both? → RAG for the facts, fine-tuning for the behavior. In that order.

And underneath all of it: you cannot make this decision without evaluation. "It seems better" is not data. Before you choose, build a small eval set — representative inputs with known-good outputs — so you can measure whether prompting already clears the bar, whether RAG actually retrieves the right context, and whether a fine-tune moved the metric or just moved the failures around. Teams that skip this pick techniques by vibes and discover the mistake in production; teams that treat evals as first-class make the cheap correct choice on purpose. The eval set is what turns this framework from an opinion into a decision.

Conclusion

The reason so many LLM projects stall isn't a shortage of technique — it's reaching for the wrong one and mistaking motion for progress. Fine-tuning a model to "learn facts," RAG-ing a problem that was really a bad prompt, or grinding on prompts when the model fundamentally lacks the data: each fails in a way that looks like effort.

Anchor on the split and you'll rarely go wrong. Knowledge that changes → RAG. Behavior you can't prompt into place → fine-tuning. Everything else → prompt, and prompt well. Start cheap, measure honestly, and add complexity only when an eval — not a hunch — tells you the current layer has topped out. The best architecture isn't the most sophisticated one; it's the one that puts each requirement on the axis where it actually belongs.

Sources & further reading:

Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models
Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs

This article is educational content. Models, tooling, and cost trade-offs evolve quickly; validate any approach against your own data and current provider documentation before committing to it in production.

Vector Embeddings Explained: Build Semantic Search in Python

galian — Mon, 13 Jul 2026 11:25:24 +0000

Search for "reset my password" in a keyword-based system and a help article titled "How to recover your account credentials" won't match — not one word overlaps. Yet any human knows they mean the same thing. Closing that gap between characters and meaning is what vector embeddings do, and they're the quiet engine behind semantic search, RAG, recommendation systems, and most of the "AI that understands you" experiences shipped since 2023.

This is a practical guide. We'll cover what an embedding actually is, why cosine similarity is the operation you'll use constantly, and then build a real, working semantic search engine in Python — first with pure NumPy so you see the mechanics, then with the tools you'd actually reach for in production. By the end you'll have code that runs and a mental model that transfers to every embedding-powered feature you build next.

What an embedding actually is

An embedding is a list of numbers — a vector — that represents a piece of content as a point in high-dimensional space. An embedding model (a neural network trained on enormous text corpora) reads your text and outputs, say, 384 or 1,536 floating-point numbers. The magic isn't the numbers themselves; it's the property the training instills: texts with similar meaning land close together in that space, and unrelated texts land far apart.

That's the whole idea. "How do I reset my password?" and "I forgot my login credentials" produce vectors that sit near each other. "What's the weather in Cluj?" produces a vector off in a completely different region. The model has effectively turned meaning into geometry — and geometry is something a computer can measure with plain arithmetic.

A few properties worth internalizing before we write code:

Dimensionality is fixed per model. A given model always outputs the same length (e.g. 384 for all-MiniLM-L6-v2, 1,536 for OpenAI's text-embedding-3-small). You can't mix vectors from different models — they live in different spaces.
The individual numbers are not interpretable. Dimension 200 doesn't mean "formality" or "topic." Meaning is distributed across all dimensions at once. Don't try to read them.
Distance is the entire point. You almost never care about a vector's absolute position — only how close it is to other vectors.

Cosine similarity: the one operation you'll use everywhere

To ask "how similar are these two texts?", you compare their vectors. The near-universal choice for text embeddings is cosine similarity: it measures the angle between two vectors, ignoring their magnitude.

Why the angle and not, say, straight-line (Euclidean) distance? Because for text embeddings, direction encodes meaning while length often encodes uninteresting things like text length or token count. Two documents about the same topic point the same way even if one is a sentence and the other a paragraph. Cosine similarity captures exactly that.

The formula is just the dot product of the two vectors divided by the product of their magnitudes:

cos(θ) = (A · B) / (‖A‖ · ‖B‖)

It returns a value from -1 to 1, though for most modern text embeddings you'll see results land in roughly the 0-to-1 range:

~1.0 — nearly identical meaning
~0.5 — loosely related
~0.0 — unrelated

That single number, computed against a corpus of stored vectors, is semantic search. Everything else is optimization.

Build a semantic search engine from scratch

Let's make it concrete. We'll use sentence-transformers, which runs a capable embedding model locally — no API key, no network calls, so you can run this offline right now.

pip install sentence-transformers numpy

Step 1 — Embed a corpus

from sentence_transformers import SentenceTransformer
import numpy as np

# A small, fast, widely used model. 384-dimensional output.
model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "How to reset your account password",
    "Refund policy for annual subscriptions",
    "Setting up two-factor authentication",
    "Our office hours and contact details",
    "How to recover a locked account",
]

# Encode the whole corpus once. Shape: (5, 384)
doc_embeddings = model.encode(documents)
print(doc_embeddings.shape)  # (5, 384)

That doc_embeddings array is your search index. In a real app you compute it once, when a document is created or updated, and store it — never on every query.

Step 2 — Cosine similarity in NumPy

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Cosine similarity between vector `a` and each row of matrix `b`."""
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b, axis=1, keepdims=True)
    return b_norm @ a_norm

Ten lines, no dependencies beyond NumPy. This is the actual core of semantic search — the rest is plumbing.

Step 3 — Search

def search(query: str, k: int = 3):
    query_vec = model.encode(query)
    scores = cosine_similarity(query_vec, doc_embeddings)
    ranked = np.argsort(scores)[::-1][:k]  # top-k, highest first
    return [(documents[i], float(scores[i])) for i in ranked]

for doc, score in search("I forgot my login credentials"):
    print(f"{score:.3f}  {doc}")

Run it, and you'll get something like:

0.62  How to reset your account password
0.55  How to recover a locked account
0.31  Setting up two-factor authentication

Notice what happened: the query "I forgot my login credentials" shares zero words with "How to reset your account password," yet it ranked first. A keyword search would have returned nothing. That's the payoff — you matched on meaning, not on string overlap. This shift from lexical to semantic matching is the foundation every retrieval-augmented system builds on, and it's the starting point of a structured course on RAG and retrieval-augmented generation that goes from this toy index to production retrieval.

From toy to production: what changes

The NumPy version is perfect for learning and fine for a few thousand documents. Past that, three things force an upgrade.

You need a vector database

Computing cosine similarity against every stored vector on every query is O(n) — fine at 5 documents, painful at 5 million. Vector databases solve this with Approximate Nearest Neighbor (ANN) indexes (HNSW is the common one) that trade a sliver of accuracy for enormous speed, returning near-neighbors in milliseconds over huge corpora.

You have good open-source options:

pgvector — a Postgres extension. If your data already lives in Postgres, this is often the pragmatic choice: vectors and relational data in one place, one backup story.
Chroma / Qdrant / Weaviate / Milvus — purpose-built vector stores with richer filtering and scaling stories.
FAISS — a library (not a server) from Meta for fast similarity search when you want to manage the index yourself.

A minimal Chroma example shows how little the mental model changes:

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(documents=documents, ids=[f"d{i}" for i in range(len(documents))])

results = collection.query(query_texts=["I forgot my login credentials"], n_results=3)
print(results["documents"])

Chroma embeds the text for you and handles the index. Same concept, production ergonomics.

Chunking matters more than the model

You rarely embed whole documents. A 30-page PDF becomes one vector that's an average of everything and a good match for nothing. In practice you chunk documents into passages (a few hundred tokens, often with slight overlap) and embed each chunk. Get chunking wrong and even a great embedding model returns mush — which is one of the most common reasons retrieval systems quietly underperform. Chunking strategy, overlap, and metadata are exactly the unglamorous details that separate a demo from a dependable system, and they're covered in depth in a course on advanced LLM integration for production apps.

Choosing and swapping embedding models

Embedding models differ in dimensionality, speed, cost, and language coverage — and critically, you must embed your corpus and your queries with the same model. Change the model and you re-embed everything. For multilingual apps (Romanian included), pick a model with strong multilingual training rather than an English-first one, or your non-English recall will suffer silently. Public benchmarks like the MTEB leaderboard on Hugging Face are the sane starting point for comparing models on retrieval quality rather than vibes.

Hybrid search: when semantic alone isn't enough

Here's a lesson that surprises people: pure semantic search is worse than keyword search for certain queries. Ask for an exact product code, a specific error like ERR_CONN_REFUSED, a person's name, or an acronym, and embeddings can betray you — they match on meaning, and a precise identifier has little semantic meaning to spread around. The embedding for ERR_CONN_REFUSED sits near "connection problems" generally, so a document about a different connection error can outrank the exact match.

The production answer is hybrid search: run both a keyword search (classic lexical scoring like BM25) and a semantic search, then combine the rankings. Keyword search nails exact terms, identifiers, and rare words; semantic search nails paraphrase and intent. Together they cover each other's blind spots.

The standard way to merge the two result lists is Reciprocal Rank Fusion (RRF) — a simple, robust formula that scores each document by its rank in each list rather than by raw scores that live on incompatible scales:

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    """Fuse multiple ranked lists of doc IDs into one score per doc."""
    scores: dict[str, float] = {}
    for ranking in rankings:            # e.g. [keyword_results, semantic_results]
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

Because RRF only looks at positions, you don't have to normalize BM25 scores against cosine similarities — a notorious apples-to-oranges trap. Most serious vector databases (Weaviate, Qdrant, and others) now offer hybrid search with RRF built in, precisely because "just embeddings" quietly underperforms on real, messy query logs. If you take one production lesson from this article, make it this: measure your recall on real queries, and reach for hybrid the moment exact-match queries show up.

Where embeddings show up beyond search

Semantic search is the gateway, but the same vectors power a lot more:

RAG (retrieval-augmented generation) — retrieve relevant chunks by embedding similarity, then feed them to an LLM as grounding context. Embeddings are the retrieval half; without good retrieval, generation hallucinates.
Deduplication & clustering — near-duplicate detection and topic clustering fall out of distances almost for free.
Recommendations — "items similar to this one" is a nearest-neighbor query in embedding space.
Classification — embed labeled examples, then classify new items by nearest neighbors, often without training a dedicated model.

The through-line: any time you need "similar in meaning" rather than "matches exactly," embeddings are the tool. Building these features end to end — API to product, with the retrieval and orchestration wired up properly — is the spine of a hands-on course on building AI applications in Python with the OpenAI and Anthropic SDKs.

Common mistakes that cost you hours

A few traps that catch almost everyone the first time:

Re-embedding the corpus on every query. Embed documents once, store the vectors, embed only the incoming query at search time.
Mixing models. Query vectors and document vectors must come from the same embedding model. A silent mismatch produces garbage rankings with no error.
Forgetting to normalize. If you compute raw dot products instead of cosine similarity (and your vectors aren't already unit-normalized), longer texts get an unfair boost. Normalize, or use a library that does.
Embedding documents that are too large. One vector per giant document averages meaning into uselessness. Chunk first.
Trusting a single similarity threshold forever. The "good enough" cutoff depends on your model and data. Measure it on real queries; don't hardcode 0.8 because a blog post said so.

Conclusion

Vector embeddings are one of those ideas that feels like magic until you see the mechanics — and then it's just geometry. Text becomes a point in space, meaning becomes distance, and search becomes "find the nearest points." You built exactly that in a few lines of Python, from raw NumPy cosine similarity to a Chroma-backed index, and the same core idea scales from a toy corpus to millions of documents behind an ANN index.

Start where we started: embed a handful of your own documents, run a query that shares no keywords with the right answer, and watch it surface anyway. Once that clicks, RAG, recommendations, and semantic features stop looking like separate topics and start looking like one tool applied five ways. If you want the structured path from here — retrieval, chunking, evaluation, and production wiring — Cursuri-AI.ro builds it step by step.

Sources & further reading:

Sentence-Transformers — Official documentation
Hugging Face — MTEB: Massive Text Embedding Benchmark leaderboard
pgvector — Open-source vector similarity search for Postgres
Chroma — Open-source embedding database

This article is educational content. Model names, dimensions, and library APIs evolve; verify current details in the official documentation before building production systems.

Context Engineering for AI Agents: Beyond Prompt Engineering

galian — Thu, 09 Jul 2026 13:08:24 +0000

You wrote a great prompt. It worked beautifully in the playground — one question, one clean answer. Then you wired the same model into an agent that runs twenty steps, calls six tools, and reads back their output, and somewhere around step twelve it started forgetting the goal, calling the wrong tool, or confidently acting on something it misread three steps ago. The prompt didn't get worse. The context did.

This is the gap that context engineering fills. Prompt engineering is about writing one good instruction. Context engineering is about managing the entire set of tokens a model sees at inference time — across a long, multi-step run — so the signal stays high and the model keeps making good decisions. Anthropic frames it as the natural progression of prompt engineering, and if you're building anything agentic in 2026, it's the discipline that separates a demo from a system.

What context engineering actually is

Start with a precise definition. Context is the full set of tokens you include when you sample from a large language model. Not just your prompt — the system instructions, the tool definitions, the examples, the running message history, the retrieved documents, the tool results fed back in. Everything in the window.

Context engineering is the set of strategies for curating and maintaining the optimal set of those tokens during inference. The goal, in one line: find the smallest set of high-signal tokens that reliably produces the outcome you want.

The reason this is a distinct discipline from prompt engineering is the shape of the problem. A prompt is something you write once and it stays put. Context in an agent is dynamic — it grows on every turn as the model reads files, calls tools, and accumulates history. You're not authoring a static string anymore; you're managing a budget that fills up on its own, and deciding continuously what earns a place in it and what gets thrown out. That's an engineering problem, and it's the foundation the whole prompt-to-production journey builds on.

Why "just add more context" is the wrong instinct

The intuitive move, when an agent makes a mistake, is to give it more: more instructions, more examples, more history, more retrieved documents. Sometimes that helps. Very often it makes things worse, and here's why.

A model's effective attention is a finite resource. Every token you add competes with every other token for the model's limited ability to attend to what matters. Past a certain point, adding context doesn't add capability — it dilutes it. The relevant fact is now buried among ten irrelevant ones, and the model attends to the wrong thing.

This shows up empirically. The "lost in the middle" effect — documented by Liu et al. — found that models attend most reliably to information at the start and end of a long context, and least reliably to what's stuck in the middle. As context grows, retrieval of any single fact inside it gets less reliable, a degradation sometimes called context rot. A 200K-token window does not mean you should put 200K tokens in it. Capacity is not the same as attention.

So the mental model to internalize: context is a budget, not a bucket. You're not trying to fill it. You're trying to spend it on the highest-signal tokens available and refuse everything else. Every technique below is a way to enforce that discipline.

The four things competing for your window

Before the techniques, know your spenders. In a running agent, four categories of tokens fight for the same finite budget:

The system prompt — your instructions, role, constraints. Usually small, high-value, and stable.
Tool definitions — the schemas describing every tool the agent can call. These are sneakily expensive: each tool's description sits in context on every turn, whether or not it's used.
Message history — the accumulating transcript of the conversation and the agent's own steps. This is the one that grows without bound and quietly eats the window.
Retrieved / external data — documents, search results, file contents, database rows pulled in to ground the model.

The overall guidance from Anthropic is worth memorizing: keep each of these informative yet tight. Not empty — an under-specified system prompt or a missing tool leaves the model guessing. But not bloated either. The art is the calibration, and it's different for each category. Let's work through the techniques that manage them.

Technique 1: Compaction — summarize the history before it drowns you

Message history is the runaway spender. A long agent run accumulates hundreds of turns of "called tool, got 4KB of JSON back, reasoned about it, called the next tool." Most of those raw tool outputs are dead weight three steps later — you needed the conclusion, not the 4KB.

Compaction is the fix: periodically replace a chunk of verbose history with a tight summary that preserves the decisions, the key facts, and the current state, while dropping the raw noise. When the agent has finished investigating something, you don't need the full transcript of the investigation in context — you need "here's what I found and what it means for the task."

Practical rules:

Compact at natural boundaries, not mid-reasoning — after a sub-task completes, when a phase ends.
Preserve the load-bearing details: open questions, decisions made, constraints discovered, current state. Drop the intermediate chatter and the raw dumps.
Keep the goal pinned. The single most common long-run failure is the agent losing the plot on the original objective. The goal should survive every compaction.

Done well, compaction is what lets an agent run for hundreds of steps without either overflowing its window or forgetting why it started.

Technique 2: External memory — let the agent write things down

The window is not the only place to keep information. The most effective long-running agents treat the context window as working memory and offload durable state to external memory — a file, a scratchpad, a structured store the agent reads from and writes to deliberately.

Instead of carrying every fact in-context forever, the agent writes a note ("the auth module uses JWT with 15-minute expiry; the bug is in the refresh path") to a persistent store, and pulls it back only when relevant. The context window stays lean; the knowledge doesn't get lost. This is exactly how a human engineer works — you don't hold the entire codebase in your head, you keep notes and open the file when you need it.

This pattern — persistent, deliberately-managed memory outside the window — is the core of building agents that hold state over long horizons, and it's what turns a stateless model into something that can work a problem across a session without drowning.

Technique 3: Sub-agents — isolate context so the main thread stays clean

Here's a structural technique that most people never turn on. When a sub-task is going to burn a lot of tokens — "find every call site of this function across the repo," "research these five libraries" — don't do it in the main agent's context. Delegate it to a sub-agent: a separate agent instance with its own isolated window that does the noisy work and reports back a clean result.

The win is context hygiene. The thousands of tokens of file contents and search output that the investigation churns through stay in the sub-agent's context and die with it. The main agent gets back a two-paragraph summary, and its own window stays focused on the actual task instead of silting up with intermediate noise. As a bonus, independent sub-tasks run in parallel instead of serially.

The rule of thumb: delegate when work is independent, parallelizable, or context-heavy; keep it inline when it's sequential and cheap. Knowing when to fan out to sub-agents and how to orchestrate them without stepping on each other is a central skill in building AI agents and automation, and it's one of the highest-leverage context moves you have.

Technique 4: Just-in-time retrieval — pull data when needed, not upfront

There are two ways to get external data into an agent. The naive way is to preload: at the start, dump everything the agent might conceivably need into context — the whole document, all the schemas, every config file. The problem is obvious once you see it: you're spending your budget on maybes, and most of it goes unused while crowding out what matters.

The better pattern is just-in-time retrieval: give the agent the ability to fetch data (a search tool, a file reader, a database query) and let it pull exactly what it needs, exactly when it needs it. Instead of "here are all 40 files," it's "here's a tool to read a file — go get the one you need." The agent loads the relevant chunk into context at the moment of use, acts on it, and (with compaction) lets it fall away afterward.

This mirrors how retrieval-augmented systems already work, and getting the retrieval layer right — what to fetch, how to rank it, how much to bring back — is where advanced LLM integration earns its keep. Preloading feels safer; just-in-time is what actually scales.

Technique 5: Tool curation — the failure mode hiding in plain sight

Remember that tool definitions sit in context on every turn. That makes a bloated tool set a double tax: it burns tokens continuously, and it degrades decisions. Anthropic calls out one of the most common failure modes directly — tool sets that cover too much functionality or create ambiguous, overlapping choices about which tool to use.

The tell is a sharp one: if a human engineer can't say for certain which tool should be used in a given situation, an AI agent can't either. Fifteen tools with fuzzy, overlapping responsibilities will produce worse behavior than six sharp, non-overlapping ones — and cost more tokens doing it.

So curate the toolbox like you'd curate an API:

Fewer, sharper tools. Each with a clear, distinct job and an unambiguous "use this when…"
No overlap. Two tools that could both plausibly handle the same request is a decision point where the agent will sometimes pick wrong.
Prune ruthlessly. A tool that's rarely the right choice is paying rent in your context on every single turn. Cut it.

Tool curation is the least glamorous technique here and often the highest-ROI. It's pure subtraction, and subtraction is exactly what context engineering rewards.

How you know it's working: measure it

Every technique above is a change to your context, and changes to context are exactly the kind of thing that feels better while being worse — or vice versa. If your only signal is "I ran a few queries and it seemed fine," you're tuning blind, and you'll ship a regression the day you compact one turn too aggressively and the agent starts forgetting a constraint.

The fix is the same as anywhere in production ML: an evaluation set. Assemble representative tasks with known good outcomes, and re-run them every time you change how context is managed. Then you can say "compaction at phase boundaries held task success at 0.9 while cutting average tokens 40%" instead of "I think it's better now." Treating evaluation as first-class rather than an afterthought is what turns context engineering from a craft into engineering — a number that moves when you change something, not a vibe.

Putting it together

Context engineering isn't a framework you install; it's a posture you adopt toward the model's window. The whole discipline collapses to one principle applied relentlessly: spend the finite budget on the smallest set of high-signal tokens that does the job.

In practice, for a real agent, that means:

Compact the history at natural boundaries so a long run doesn't drown in its own transcript.
Offload durable state to external memory instead of carrying it forever.
Delegate noisy, independent work to sub-agents so the main window stays clean.
Retrieve just-in-time instead of preloading everything you might need.
Curate the tools hard — fewer, sharper, no overlap.
Measure with evals so every change is verified, not hoped.

None of these are exotic, and that's the point. The model is already capable. What makes the capability hold up over a twenty-step run isn't a cleverer prompt — it's disciplined management of everything the model reads along the way.

Conclusion

Prompt engineering taught us to write one good instruction. Context engineering is what you need the moment that instruction has to survive a long, tool-using, self-accumulating agent run — which is to say, the moment you build anything real. The failure you saw at step twelve was never the model getting dumber. It was the context getting noisier, and no one curating it.

Adopt the budget mindset, apply the five techniques, and put an eval set behind every change. Do that, and the same model that fell apart at step twelve will run to step fifty and still know exactly what it's doing — because you engineered what it was looking at the whole way.

The courses linked throughout are part of Cursuri-AI.ro, an AI-learning platform with hands-on, current tracks on context engineering, AI agents, LLM integration, and evaluating AI systems in production.

Sources & further reading:

Anthropic — Effective context engineering for AI agents (definition, the four context components, tool-set failure modes, compaction and memory)
Anthropic — Effective harnesses for long-running agents
Liu et al. — Lost in the Middle: How Language Models Use Long Contexts

This article is educational content. Model behavior, context limits, and tooling evolve quickly; validate approaches against your own workloads and current official documentation.

Run LLMs Locally with Ollama in 2026: The Practical Developer Guide

galian — Sun, 05 Jul 2026 21:11:24 +0000

For years, "run the model locally" was the option you mentioned and then didn't take: the models were too weak, the tooling too fiddly, and the cloud APIs too convenient. In 2026 that calculus has genuinely shifted. Open-weight models in the 12–35B range now handle real coding and agent workloads, Apple Silicon got a dedicated inference engine, and Ollama quietly became a drop-in backend for the same tools you already use against cloud APIs — including Claude Code.

I teach practical AI engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, and local inference has gone from a curiosity module to one of the questions companies ask us most — usually spelled "how do we use LLMs without sending our data anywhere?" This guide is the answer I give developers: what changed, what hardware you actually need, which models are worth pulling, and how to plug it all into a real workflow.

As always with this space: versions and model rankings move monthly. Everything below is verified against Ollama's official blog and docs as of early July 2026 — re-check before you build a budget or an architecture on it.

Why local, and why now

Three arguments have survived contact with production; the rest is mostly vibes.

Privacy and data residency. With a local model, prompts and outputs never leave your machine (or your VPC, if you self-host on a server). For anyone dealing with client data, medical text, legal documents, or GDPR-sensitive workloads, this eliminates the entire "what does the provider do with my data" conversation instead of managing it through contracts. This is the single biggest adoption driver we see in Europe, and it's the backbone of our course on local LLMs, self-hosting, and privacy.

Cost shape. Cloud APIs bill per token; local inference bills you once, in hardware you may already own. For high-volume, latency-tolerant workloads — batch classification, summarization pipelines, internal tooling — a mid-range GPU that's already on someone's desk can absorb work that would otherwise be a real monthly line item. (For low-volume or frontier-quality work, cloud still wins. More on that below.)

No external dependency. A local model doesn't get deprecated, rate-limited, price-changed, or suspended out from under you. After the model-availability surprises of the last year, "at least one workload runs on weights we control" has become a reasonable line item in a resilience plan, not paranoia.

What actually changed in Ollama in 2026

If you last touched Ollama when it was "a nice wrapper around llama.cpp," the 2026 releases are the reason to look again. All of this is from Ollama's official blog:

Anthropic API compatibility (January, v0.14.0). Ollama now exposes a native Anthropic-style /v1/messages endpoint. This is the sleeper feature of the year: Anthropic-native tools — most notably Claude Code — can talk to a local model directly, with no proxy or translation layer. There's a matching OpenAI-compatible endpoint too, so Codex and OpenAI-SDK apps work the same way.
ollama launch (January). A single command that configures and starts a coding agent against a local model — ollama launch claude sets up Claude Code, prompts you to pick a model, and you're in.
Experimental image generation (January). Early days, but the scope of "local model" is no longer text-only.
MLX engine on Apple Silicon (March preview → June release). Ollama moved its Mac inference path to Apple's MLX framework, which exploits unified memory. Ollama's own framing for the June release: its highest performance on Apple Silicon yet — faster output with reduced memory usage.
Ollama 0.30 and 0.31 (June). Version 0.30 brought improved performance and broader GGUF model compatibility through llama.cpp; 0.31 made Gemma 4 significantly faster on Apple Silicon via multi-token prediction (MTP), enabled by default.

The theme is clear: Ollama is positioning itself less as a hobbyist toy and more as the standard local backend for agentic tooling.

Getting started in five minutes

Install (macOS and Windows have installers at ollama.com/download; on Linux):

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model:

ollama pull gemma4
ollama run gemma4

That's a working local chat. Ollama also starts a local server on port 11434, which is where the interesting part begins — every API-based tool you have can point at it.

Useful daily commands: ollama ls (installed models), ollama ps (what's loaded and where — CPU vs GPU), ollama rm <model> (free disk space; models are multi-gigabyte).

Hardware: the honest sizing guide

The rule of thumb that matters: a model quantized to 4 bits needs very roughly 0.5–0.7 GB of memory per billion parameters, plus overhead for context. Everything else follows from that.

Your hardware	What runs comfortably	Experience
8 GB RAM, no GPU	3–8B models, quantized	Fine for chat, drafting, classification. Slow but usable on CPU
16 GB RAM (Apple Silicon)	8–14B models	Good daily-driver territory; MLX made this tier notably faster in 2026
24 GB+ (M-series Pro/Max or a 24 GB GPU)	27–35B models	Where local coding models get genuinely useful
48 GB+ unified memory / multi-GPU	Large MoE models	Server-class local inference

Two nuances that save people disappointment:

Quantization is why any of this works. Models ship in compressed 4–8 bit variants (the GGUF ecosystem) that trade a small quality loss for a 2–4× memory reduction. Ollama's default tags are already quantized — you rarely need to think about it, but it explains why a "27B model" fits in 24 GB.
Mixture-of-experts (MoE) models need memory for their total parameters but compute like their active subset. NVIDIA's Nemotron-3-Super, for example, is a 120B model with 12B active parameters: it runs faster than its size suggests, but you still need the RAM to hold it.

Context length eats memory too — an agent session with 32K+ tokens of context adds real overhead on top of the weights. If you're sizing for coding agents, budget for that, not just the model file.

The mid-2026 open-weight lineup worth knowing

Rankings churn monthly, so treat this as a map, not a leaderboard. From Ollama's model library, the families that matter right now:

Gemma 4 (12B–31B) — Google's open family, currently the most-pulled model on Ollama. Multimodal, tuned for reasoning and agentic work, and the main beneficiary of the MLX/MTP speedups on Macs.
Qwen3.5 / Qwen3.6 (0.8B–122B) — the ecosystem's Swiss army knife. Qwen3.5 spans everything from edge-tiny to server-large; Qwen3.6 (27B–35B) focuses on agentic coding. qwen3-coder is Ollama's own recommendation for coding-agent use.
GLM-5 family — flagship-class open weights (GLM-5 is 744B total / 40B active); strong at coding and long-horizon tasks. Too big for most desktops locally, but available as :cloud variants (see below) and self-hostable on serious hardware.
Nemotron-3-Super (120B MoE, 12B active) — NVIDIA's entry, aimed at multi-agent applications.
MiniMax-M3 — notable for a 1M-token context window, if your workload is long-document analysis.
Specialists: GLM-OCR for document understanding, TranslateGemma (4B–27B, 55 languages) for translation, LFM2 (24B) for on-device deployment, Ornith (9B–35B) for agentic coding.

Sensible defaults: on a 16 GB Mac, start with gemma4:12b. On 24 GB+, try qwen3-coder for code and gemma4:27b for general work. Then run your tasks on them — a model's rank on someone's benchmark tells you little about your use case.

The part that changes your workflow: Ollama as a drop-in API

Ollama's server speaks both major API dialects, which means "switch to a local model" is now a base-URL change, not a rewrite.

OpenAI-compatible (/v1) — any OpenAI-SDK app works:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "Explain GGUF quantization in one paragraph."}],
)
print(resp.choices[0].message.content)

Anthropic-compatible (/v1/messages) — and this is the one with teeth, because it means Claude Code runs against local models. Per Ollama's official docs:

export ANTHROPIC_AUTH_TOKEN=ollama       # accepted but not validated
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model qwen3-coder

Or let Ollama do the wiring for you:

ollama launch claude

Honest caveats before you get excited:

Ollama recommends at least 32K tokens of context for Claude Code, and its model suggestions for coding are qwen3-coder locally (30B — you want 24 GB+ of VRAM/unified memory) or glm-4.7:cloud / minimax-m2.1:cloud via Ollama's cloud, which keeps the same API surface but runs the weights remotely.
The compatibility layer doesn't cover everything: no prompt caching, no token-counting endpoint, no forced tool selection, no batches API, no PDF inputs (images must be base64). If your workflow leans on those, you'll feel it.
A 30B local model is not Opus, and it isn't trying to be. It's "capable pair of hands on an airplane / on confidential code," not "frontier reasoning."

The pattern that actually works in practice is routing: local models for the private, high-volume, or offline work; frontier cloud models for the hard reasoning. Deciding which tier a task belongs to — and building the escalation path — is an architecture skill, and it's exactly the kind of decision we drill in our AI system architecture course.

When local is the wrong choice

Being a fan of local inference means knowing where it loses:

Frontier-quality reasoning. For the hardest tasks, top cloud models remain clearly ahead of anything you can run on a workstation. If wrong answers are expensive, don't fight this.
Low-volume workloads. If you make a few thousand API calls a month, per-token billing is cheaper than any GPU. Local pays off at volume, at privacy constraints, or at both.
Ops you don't want. A self-hosted model is a service you now run: updates, monitoring, capacity. ollama run on a laptop is trivial; a team-wide inference server is real infrastructure.
Multimodal breadth and long-tail capabilities. Cloud APIs still bundle more (native PDF understanding, larger tool ecosystems, batch APIs) than the local stack replicates.

One more thing people conflate: running a model locally is different from customizing one. If your actual goal is a model that speaks your domain language or follows your house style, that's a fine-tuning question — LoRA adapters on an open-weight base, then serving the result through Ollama. That pipeline (when to fine-tune vs when to just engineer the prompt) is its own discipline, covered in our fine-tuning course.

Frequently asked questions

Is Ollama free?

The tool itself is open source and free. The models each carry their own licenses — Gemma, Qwen, GLM and friends have different terms, some with restrictions on commercial use. Check the license tab on the model's Ollama page before you ship a product on it. Ollama's optional cloud models are a paid service.

What hardware do I need to run LLMs locally in 2026?

As a rule of thumb at 4-bit quantization: 8 GB of RAM runs 3–8B models, 16 GB runs 8–14B comfortably (especially on Apple Silicon with the MLX engine), and 24 GB+ opens up the 27–35B class where local coding models get genuinely useful. More context = more memory on top of the weights.

Can a local model replace GPT or Claude?

For a growing set of tasks — summarization, classification, drafting, routine coding on mid-size codebases — yes, credibly. For frontier reasoning and the highest-stakes accuracy, no. Production teams typically route: local for private/high-volume work, cloud for the hard 10%.

Can I really use Claude Code with Ollama?

Yes. Since Ollama v0.14.0 (January 2026) there's native Anthropic Messages API compatibility: set ANTHROPIC_BASE_URL=http://localhost:11434, run claude --model qwen3-coder — or just ollama launch claude. Expect a capable assistant, not Opus-level reasoning, and note that prompt caching and a few other API features aren't supported through the compatibility layer.

Ollama vs llama.cpp vs vLLM — which should I use?

Ollama for developer experience: one command, model management, dual API compatibility. llama.cpp (which powers Ollama's GGUF path) for maximum control and minimal footprint. vLLM for high-throughput multi-user serving on server GPUs. Most developers should start with Ollama and only move down the stack when they hit a concrete limit.

The skill underneath the tool

Here's the uncomfortable part: pulling a model is the easy 5%. The value shows up when you can answer the questions around it — which model for which task, how to measure whether the local model is good enough for your workload instead of guessing, how to build the routing and fallback so privacy-sensitive work stays local while hard problems escalate to a frontier model. That's engineering judgment, not tooling trivia.

That judgment is what we teach at our AI education platform — hands-on courses built around real repositories and an interactive AI instructor, covering the full local-to-cloud spectrum: self-hosting and privacy, fine-tuning, architecture, and the agentic workflow on top.

Conclusion

In 2026, local LLMs crossed the line from hobby to infrastructure option. Ollama's dual API compatibility means your existing tools — including Claude Code — can run against open weights with a base-URL change; the MLX engine made a 16 GB MacBook a legitimate inference machine; and the open-weight lineup in the 12–35B range is good enough for a real slice of production work.

The play isn't "cancel your API keys." It's knowing which slice of your workload belongs on weights you control — then running it there deliberately, measured, with an escalation path for everything else. Start with ollama run gemma4 tonight; you're one evening away from having an informed opinion.

Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from local LLMs and self-hosting to agentic coding, evals, and AI system architecture.

Sources: Ollama Blog · Anthropic API compatibility — Ollama Docs · Claude Code with Anthropic API compatibility — Ollama · Ollama Model Library · Download Ollama

Claude Sonnet 5 Just Made Running Agents Cheap — What Builders Actually Need to Know

galian — Tue, 30 Jun 2026 22:06:37 +0000

Anthropic shipped Claude Sonnet 5 on June 30, 2026, and the framing in the announcement is unusually blunt for a model launch: it's pitched as the most agentic Sonnet yet — a model built to make plans, drive tools like browsers and terminals, and run autonomously at a level that, a few months ago, took something bigger and more expensive.

For anyone building on top of these models — agents, pipelines, coding tools — that's the headline that matters. Not "it's smarter," but "near-frontier capability just got cheaper to run in a loop." I write and teach about agentic engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, so I'll keep this grounded in what changes for people who actually ship on these APIs — not the launch-day benchmark theater.

One disclaimer up front: model pricing and availability in this space change almost monthly, and this is a day-one snapshot. Verify the current numbers on Anthropic's official pages before you wire anything to a budget. I'm deliberately not quoting benchmark scores here — the launch materials presented them in a way that's easy to misread, so for hard numbers go straight to the Sonnet 5 System Card.

The one-sentence version

Sonnet 5 moves "good enough to run agents autonomously" down a price tier — and ships a new tokenizer that can quietly inflate your token counts by up to 35%.

Both halves of that sentence matter, and the second one is the part nobody puts on a launch slide. Let's take them in order.

What's actually new

Stripping the marketing down to verifiable claims from Anthropic's own announcement, here's what Sonnet 5 is:

The most agentic Sonnet so far. It's described as able to "make plans, use tools like browsers and terminals, and run autonomously," with improvements specifically in multi-step tool use — the exact workload that defines an agent rather than a chatbot.
Close to Opus 4.8 — at a lower price. Anthropic's own phrasing is that its "performance is close to that of Opus 4.8, but at lower prices." That's the whole pitch: most of the capability, a fraction of the cost.
A real step up from Sonnet 4.6. Called a "substantial improvement over its predecessor, Sonnet 4.6, on important aspects of agentic performance like reasoning, tool use, coding, and knowledge work."
Safer in agentic contexts. Anthropic reports an "overall lower rate of undesirable behaviors than Sonnet 4.6," plus lower rates of hallucination and sycophancy — which matters more than it sounds when a model is acting in a loop without a human reading every step.
Deliberately weaker at offensive cyber. It shows "substantially poorer performance than models such as Opus 4.8" on dangerous cyber tasks and was "never able to develop a full working exploit." That's a safety design choice, not an oversight — worth knowing if security tooling is your domain.

Two things Anthropic did not publish that I'm not going to invent for you: an official context window and max output token figure for Sonnet 5 weren't stated in the launch materials at the time of writing. If you need those for capacity planning, pull them from the official API docs rather than trusting a blog (including this one). Guessing is how teams ship broken truncation logic.

The economics shift is the real story

Here's why builders should care more than end users.

When you chat with a model, price-per-token is almost noise — you send a few thousand tokens and read the answer. When you run an agent, the model is in a loop: read context, call a tool, read the result, reason, call another tool, repeat. A single "task" can burn hundreds of thousands of tokens across dozens of turns. At that volume, the price-per-million-tokens line is your unit economics.

So a model that lands near Opus-4.8 quality at Sonnet pricing doesn't just make chat cheaper — it changes which agent designs are economically viable at all. Workflows you'd previously gate behind Opus (multi-step research, autonomous refactors, long tool-using runs) become defensible on a Sonnet budget. That's the unlock.

Here's the day-one pricing picture, with the rest of the current Anthropic lineup for context:

Model	Input / 1M tokens	Output / 1M tokens	Notes
Sonnet 5 (intro, through Aug 31 2026)	$2	$10	Promotional launch pricing
Sonnet 5 (standard, from Sep 1 2026)	$3	$15	Same as Sonnet 4.6's tier
Opus 4.8	$5	$25	Top accuracy; default in Claude Code
Haiku 4.5	$1	$5	Cheapest / fastest tier

A few honest notes:

The introductory $2 / $10 runs through August 31, 2026, then settles to $3 / $15 — the same standard tier Sonnet has occupied. So the long-run story isn't "Sonnet got cheaper"; it's "the Sonnet tier got dramatically more capable for the same price."
Sonnet 5 is the default model on Free and Pro plans, and is available to Max, Team, and Enterprise users — in Claude Code, the Claude platform, and the API. So if you're on Claude Code, you may already be one model-switch away from it.
Against Opus 4.8 the price ratio is roughly 1.7× (output $25 vs $15). When you're running agents at scale, that multiple compounds fast — which is exactly why the "close to Opus" claim is worth pressure-testing on your workload, not taking on faith.

The tokenizer gotcha that will mess up your cost math

This is the part I most want builders to internalize, because it's the easiest way to get a nasty surprise on your next invoice.

Sonnet 5 ships with an updated tokenizer. Anthropic states that the same input text now maps to roughly 1.0–1.35× as many tokens as before, depending on content type. Read that again: identical prompts can cost up to 35% more tokens on Sonnet 5 than the token count you measured on an older model — before any change in per-token price.

Why it bites:

Your cost dashboards, budget alerts, and per-request estimates were calibrated on the old tokenizer. Swap the model without re-measuring and your "same" workload silently costs more.
Code, structured data (JSON/XML), and non-English text tend to sit at the higher end of that multiplier — and those are precisely the inputs agentic and coding workloads are made of.
It interacts with context windows and truncation: more tokens for the same text means you hit limits sooner than your old math predicts.

The fix is boring and non-negotiable: re-baseline. Before you flip production traffic to Sonnet 5, measure real token counts on a representative sample of your prompts with the new tokenizer, recompute cost per task, and update your budgets and alerts. The headline price drop is real — but the effective saving is (price delta) × (token inflation), and you can't know the second factor without measuring. Anyone who tells you "it's 33% cheaper" did half the arithmetic.

This is also where good evals earn their keep. A model swap isn't just a cost change; it's a behavior change. Run your task suite on Sonnet 5 against the model you're replacing before you commit — quality, tool-call success rate, and cost together. If you don't have an eval harness yet, this is the launch that should convince you to build one; it's a discipline we treat as core, not optional, in our course on building LLM evals for production.

When to still reach for Opus 4.8

"Close to Opus" is not "Opus." The honest read on where Sonnet 5 fits:

Reach for Sonnet 5 as your default agent workhorse: high-volume tool-using loops, coding assistance, research and summarization, anything where you're paying per turn and the marginal quality of Opus isn't worth ~1.7× the output cost.
Stay on Opus 4.8 for the hardest reasoning, the highest-stakes accuracy, and security-sensitive work where Sonnet 5 is intentionally weaker (offensive-cyber tasks). When a wrong answer is expensive, the price gap is cheap insurance.

The pattern most production teams land on isn't "pick one." It's a router: Sonnet 5 handles the bulk of turns, and you escalate to Opus 4.8 for the steps that genuinely need it — with a human in the loop on the consequential ones. Getting that routing logic right (and knowing which task belongs in which tier) is a real engineering skill, and it's the through-line of our model-comparison course, which treats "which model for which job" as a decision you make with data rather than vibes.

A pragmatic migration checklist

If you're considering moving an agent or pipeline to Sonnet 5, here's the order I'd do it in:

Re-baseline tokens. Run a representative sample through the new tokenizer. Recompute cost per task. Update budget alerts.
Run your evals. Quality, tool-call success, latency, and cost, head-to-head against the model you're replacing. No eval suite? Build a small one first — even 30 representative tasks beats a gut call.
Shadow, then canary. Route a slice of real traffic to Sonnet 5, compare outputs, then scale gradually. Don't flip 100% on day one.
Keep an escalation path. Wire Opus 4.8 as the fallback for tasks that fail Sonnet 5's quality bar. Routing beats an all-or-nothing bet.
Re-read your safety posture. Lower hallucination and sycophancy is good news for autonomous runs, but "safer" isn't "supervise nothing." Keep guardrails and human checkpoints where consequences are real.

None of this is exotic. It's the same discipline that separates teams who run agents in production from teams who demo them — and it's exactly the muscle we build in our hands-on track on AI agents and automation, taught around real repositories rather than toy notebooks.

Frequently asked questions

Is Claude Sonnet 5 better than Opus 4.8?

Not across the board. Anthropic positions Sonnet 5's performance as close to Opus 4.8 at a lower price — so for high-volume agentic and coding work it's often the better value, but Opus 4.8 still leads on the hardest reasoning, top-end accuracy, and (deliberately) on offensive-cyber capability. Match the tier to the task instead of picking a favorite.

How much does Claude Sonnet 5 cost?

It launched with introductory pricing of $2 per million input tokens and $10 per million output tokens through August 31, 2026, then moves to a standard $3 / $15 — the same tier Sonnet 4.6 occupied. Your effective cost also depends on the new tokenizer (see below), so measure before you budget.

Does the new tokenizer really change my costs?

Yes. Anthropic states the same input can map to roughly 1.0–1.35× as many tokens under Sonnet 5's updated tokenizer, depending on content type — code and structured data sit at the higher end. Re-measure your real prompts before assuming the headline price drop equals your actual saving.

Can I use Sonnet 5 in Claude Code?

Yes. It's available in Claude Code, the Claude platform, and the API, and it's the default model on Free and Pro plans (and available to Max, Team, and Enterprise). If you're already in Claude Code, switching is a model selection, not a migration.

Should I migrate my agents to Sonnet 5 immediately?

Don't flip production on day one. Re-baseline token counts, run your eval suite head-to-head against your current model, then canary a slice of traffic before scaling — and keep an escalation path to Opus 4.8 for tasks that need it.

The skill underneath the model

Here's the part the launch posts skip: a cheaper, more agentic model doesn't make anyone a better builder. It just makes the consequences of your design bigger — cheaper to be right at scale, and cheaper to be confidently wrong at scale. Point Sonnet 5's autonomy at a vague spec and you get a fast, plausible wall of actions you didn't design and can't fully audit.

The developers getting real leverage from this launch aren't the ones who memorized the new price-per-token. They're the ones who understand agent architecture, context engineering, evals, and cost modeling well enough to know when the cheap-and-autonomous option is the right call and when it's a trap. That foundation — taught around real repositories with an interactive AI instructor, not slide decks — is what we build at our Eastern European AI education platform, including a dedicated, hands-on track on agentic coding with Claude Code.

Conclusion

Claude Sonnet 5 is a genuinely significant release for builders, but not for the reason most coverage leads with. The story isn't a benchmark number — it's that near-frontier agentic capability just moved down a price tier, which changes which agent designs are economically worth shipping. The catch is the new tokenizer: the real saving is the price drop minus token inflation, and you only learn the second number by measuring.

So don't migrate on the headline. Re-baseline your tokens, run your evals, canary your traffic, and keep Opus 4.8 one route away for the work that needs it. Do that, and Sonnet 5 is one of the better deals in the 2026 model lineup. Skip it, and you'll find out the hard way — on your invoice.

Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from agentic coding and AI agents to evals, context engineering, and the modern AI-native workflow.

Sources: Introducing Claude Sonnet 5 — Anthropic · Claude Sonnet 5 System Card · Claude Platform — Pricing · Claude Pricing

Cursor vs GitHub Copilot vs Claude Code: Which AI Coding Tool in 2026?

galian — Mon, 29 Jun 2026 14:59:29 +0000

If you write code for a living in 2026, you're not asking whether to use an AI coding tool — you're asking which one. And the three names that dominate every team's Slack debate are Cursor, GitHub Copilot, and Claude Code. They look similar from a distance (type intent, get code) but they're built on three genuinely different bets about how software gets written.

I've spent serious time in all three on real, multi-file, multi-repo work — not toy demos — and this is the comparison I wish someone had handed me before I burned a month figuring it out. I write and teach about agentic engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, so I'll keep this grounded in how these tools actually behave in production, not in launch-day marketing.

A note before we start: pricing and features in this category change almost monthly. Everything below is a mid-2026 snapshot — verify the current numbers on each tool's official page before you budget for a team.

TL;DR — three different philosophies

Here's the one-sentence version of each, before we go deep:

Cursor is an AI-native editor — it rebuilt the IDE around the agent. Best for developers who want fast, fluid, in-the-flow generation with deep editor integration.
GitHub Copilot is the ecosystem play — it lives where your code, issues, and PRs already are. Best for teams standardized on GitHub who want AI woven through the whole SDLC.
Claude Code is the terminal-first agent — it treats the command line as the primary surface and excels at autonomous, multi-step, multi-file work. Best for engineers comfortable orchestrating agents rather than babysitting autocomplete.

None of them is "the best." They optimize for different moments, and the real skill is knowing which to reach for. Let's break down why.

What is Cursor?

Cursor is an AI-native IDE built as a fork of VS Code, so the editor feels instantly familiar — your extensions, keybindings, and themes mostly carry over. What's different is that the AI isn't bolted on as a plugin; the whole editing experience is designed around it.

Its signature features:

Tab completion — a multi-line, context-aware autocomplete that predicts your next edit, not just the next token. It's the feature people miss most when they switch away.
Composer — Cursor's agentic, multi-file editing mode. You describe a change in natural language and it edits across files, runs commands, and iterates. Cursor now ships Composer 2.5, its own model trained specifically for agentic coding, alongside routing to frontier models from Anthropic, OpenAI, and Google.
Cloud Agents — introduced in the Cursor 3.5 release (May 20, 2026), these run in isolated cloud VMs with terminal and browser access, can work across multiple repos in parallel, and report results back to your IDE asynchronously. It's Cursor's answer to "I want the agent working while I do something else."

Cursor's center of gravity is in-the-flow coding: you stay in the editor, you see every diff, and the AI keeps pace with your thinking. It rewards developers who want speed without giving up granular control over the code.

What is GitHub Copilot?

Copilot is the most widely deployed of the three, and its biggest advantage is gravitational: it lives inside the tools and platform most teams already use. It runs in VS Code, JetBrains IDEs, Visual Studio, and on GitHub itself.

By 2026 Copilot has grown well past autocomplete:

Agent mode became generally available across both VS Code and JetBrains in March 2026 (previously VS Code only) — a multi-step agent that plans, edits across files, and runs commands inside your editor.
The autonomous coding agent is the standout. You assign a GitHub issue to Copilot, and it works asynchronously in the background — analyzing the repo, making changes, and opening a ready-to-review pull request. Assign, walk away, come back to a PR. It's the closest any mainstream tool comes to "fire-and-forget" feature work.
Agentic code review gathers full project context before suggesting changes and can hand fixes straight to the coding agent.
GitHub Spark lets you describe an app in plain English and get generated code with a live preview.

The strategic point: Copilot's value isn't any single feature — it's that AI is now threaded through the entire GitHub-centric SDLC, from issue to PR to review. If your team lives on GitHub, that integration is hard to beat.

One billing change worth flagging: as of June 1, 2026, GitHub moved to GitHub AI Credits (token-based billing) in place of the older Premium Request Units. You're now billed by tokens processed at published model rates, which makes heavy agent usage more transparent — and easier to accidentally overspend if you're not watching.

What is Claude Code?

Claude Code, from Anthropic, takes the opposite stance from Cursor: instead of building an editor, it makes the terminal the primary surface (with IDE extensions available on top). That sounds minimalist until you see what it does with full shell access.

Its defining strengths:

Agentic, multi-file, repo-aware work from the command line — it reads your codebase, makes coordinated changes across many files, runs your tests, and handles git operations and CI-aware workflows natively.
Subagents — reusable agent configurations with their own custom prompts and tool access, so you can define a "reviewer," a "test-writer," or a "migration" agent and invoke it on demand.
Agent teams and multi-agent orchestration — coordinate multiple agent sessions working in parallel, with an agent view dashboard to manage them.

Claude Code runs on Anthropic's models — currently Claude Opus 4.8 as the default, with the newer Claude Fable 5 as the most capable tier — and it's deliberately model-opinionated rather than a router. The tradeoff is real: it's the most powerful for autonomous, complex tasks, and the least hand-holdy. It assumes you're comfortable thinking like an orchestrator of agents rather than a writer of lines.

A word of caution that applies to every agent platform but bites hardest here: parallel agents multiply your token spend. Running ten agents at once consumes your quota roughly ten times faster. The autonomy is exhilarating; the bill is real. Set limits before you scale up.

Head-to-head: the dimensions that actually matter

The editing model

Cursor wins on in-editor flow. Tab completion and inline diffs keep you in control of every change.
Copilot wins on breadth of surface — it's good everywhere your code already is.
Claude Code wins on autonomous depth — it goes furthest without supervision, but you give up the inline, line-by-line feel.

Agents and autonomy

All three now have agents, but the philosophy differs. Cursor's Cloud Agents and Copilot's coding agent are both "assign work, get a result later." Claude Code goes further with explicit multi-agent orchestration and reusable subagents. If your work is increasingly delegating rather than typing, this is the dimension to weigh most — and it's exactly the shift that makes understanding AI agent architecture and automation a genuine career edge rather than a nice-to-have.

Ecosystem and integration

This is Copilot's home turf. The issue-to-PR loop, native code review, and presence across every major IDE make it the path of least resistance for GitHub-standardized teams. Cursor integrates deeply but inside its editor; Claude Code integrates deeply with your shell and git, which is either liberating or intimidating depending on your comfort with the command line.

Models

Cursor routes across many frontier models and adds its own Composer model. Copilot offers a model picker. Claude Code is Anthropic-only by design. If model choice matters to you (and for some workloads it genuinely does), Cursor and Copilot give you more knobs; Claude Code bets that a tightly-integrated, top-tier model beats a buffet.

Pricing, side by side (mid-2026 snapshot)

Tool	Entry	Mid tier	Power / team
Cursor	Hobby (free)	Pro — $20/user/mo	Teams — $40/user/mo (Standard), $120/user/mo (Premium)
GitHub Copilot	Free	Pro — $10/mo · Pro+ — $39/mo	Max — $100/mo · Business / Enterprise seats
Claude Code	Pro — $20/mo	Max 5× — $100/mo	Max 20× — $200/mo · API pay-per-token

A few honest caveats on cost:

Copilot has the cheapest entry paid tier ($10), but token-based AI Credits mean heavy agent use can climb fast beyond the included allotment.
Cursor's $20 Pro includes a fixed amount of frontier-model usage; power users hit the ceiling and either upgrade or switch to its cheaper Auto/Composer routing.
Claude Code's Max tiers are priced for sustained, agent-heavy sessions — and again, parallel agents are a multiplier, not an add.

Prices and tiers shift constantly in this category. Treat the table as a snapshot, not a quote, and confirm before committing a team budget.

So which one should you choose?

Here's the honest, persona-based answer:

Choose Cursor if you want the best in-editor experience, you value fast inline generation and tight control over every diff, and you're happy living inside a (very good) VS Code fork. It's the most natural upgrade for a developer who loves their editor and wants AI to keep pace with their flow.

Choose GitHub Copilot if your team is standardized on GitHub and you want AI woven through the entire lifecycle — issues, PRs, reviews — across whatever IDEs your team already uses. The issue-to-PR autonomous agent alone can change how a team ships. It's the safest institutional bet.

Choose Claude Code if you're comfortable in the terminal, your work skews toward complex multi-file refactors and autonomous tasks, and you want to orchestrate agents rather than supervise autocomplete. It has the highest ceiling for autonomy — and asks the most of you in return.

And the answer most senior engineers actually land on? More than one. Plenty of us keep Cursor open for flow-state editing, lean on Copilot inside the GitHub workflow, and fire up Claude Code for the gnarly autonomous jobs. The tools overlap, but they're not redundant — they're a toolkit. The real meta-skill isn't loyalty to one editor; it's fluency across the category so you instinctively reach for the right one per task.

The skill underneath the tools

Here's the uncomfortable truth that the demos hide: these tools amplify the engineer you already are. Point a powerful agent at a vague intent and you get a fast, confident wall of code you didn't design and can't fully maintain. The developers getting outsized leverage from Cursor, Copilot, and Claude Code aren't the ones who learned the keyboard shortcuts — they're the ones who understand agent architecture, context engineering, and how to specify intent precisely enough that autonomy becomes an asset instead of a liability.

That foundation is exactly what we build at our AI education platform for Eastern Europe — practical, project-based courses taught around real repositories with an interactive AI instructor, not slide decks. If you want to go from "I use these tools" to "I get serious leverage from them," we maintain dedicated, hands-on tracks for using Cursor as a pro and for agentic coding with Claude Code — both built around real multi-file, real-repo workflows rather than toy examples.

Conclusion

In 2026, "AI coding tool" isn't one product category — it's three philosophies wearing similar clothes. Cursor bet on the editor, Copilot bet on the ecosystem, and Claude Code bet on the terminal-native agent. Each is genuinely excellent at the thing it optimized for, and genuinely compromised at the things it didn't.

So don't ask "which is best." Ask "best at what, for whom, doing which task" — and then build the judgment to switch fluently between them. That judgment, not the tool, is what compounds over a career. Try each one on a real feature, not a demo, and you'll feel the differences fast.

Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from agentic coding and AI agents to context engineering and the modern AI-native IDE workflow.

Sources: Cursor Models & Pricing · GitHub Copilot Plans & Pricing · GitHub Copilot Plans (Docs) · Claude Pricing · Claude Platform Docs — Pricing

Kiro: A Practical Guide to AWS's Spec-Driven Agentic IDE"

galian — Fri, 26 Jun 2026 21:45:36 +0000

If you've spent any time with AI coding assistants, you know the failure mode: you write a vague prompt, the agent generates a wall of plausible-looking code, and twenty minutes later you're debugging something you didn't design and don't fully understand. Kiro, the agentic IDE from AWS, is a bet that the fix isn't a smarter autocomplete — it's making a specification the unit of work instead of a prompt.

I've been digging into how Kiro actually works, and this is the practical guide I wish I'd had on day one: what spec-driven development really means, how agent hooks and steering files change your workflow, where Kiro fits next to tools like Cursor and Claude Code, and when it's worth it. I write and teach about agentic engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, so I'll keep this grounded in how these tools behave in real projects rather than in launch-day hype.

What is Kiro?

Kiro is an agentic IDE built on the Code OSS platform — the same open-source foundation behind VS Code — which means the editor itself feels immediately familiar. What's different is the engine. Instead of treating each request as a one-off chat turn, Kiro is designed to turn a high-level prompt into a structured spec, then drive implementation, tests, and documentation from that spec.

The headline idea, in Kiro's own framing, is "moving beyond AI coding to agentic engineering." That sounds like marketing until you see the artifacts it produces. A feature request doesn't become a blob of code — it becomes three reviewable files: requirements, design, and tasks. You stay in the loop at each stage. The agent does the typing; you keep the judgment.

It's worth being precise about what Kiro is not: it isn't an AWS cloud service you provision in a console, and it doesn't lock you into AWS infrastructure to write code. It's a desktop IDE. You can point it at any project.

Spec-driven development: the core idea

Most AI coding tools optimize for speed-to-first-keystroke. Spec-driven development optimizes for correctness-to-intent — does the code match what you actually meant? Kiro does this by formalizing the part of engineering we usually skip when we're moving fast: writing down what we're building before we build it.

When you describe a feature, Kiro generates a spec in three phases:

1. Requirements

Kiro turns your prompt into user stories with explicit acceptance criteria, written in EARS notation (Easy Approach to Requirements Syntax). EARS is a lightweight, real technique for writing testable requirements — patterns like "When [trigger], the system shall [response]". The value is that ambiguity gets surfaced before code exists. If your one-line prompt was underspecified, you'll see it in the requirements draft and can correct it in seconds, not after a debugging session.

2. Design

Next, Kiro produces a technical design: the architecture, the components, the data flow, and the implementation approach. This is the document a senior engineer would normally write (or wish a junior had written) before touching the codebase. You review it, push back, and refine.

3. Tasks

Finally, the design becomes a sequenced task list — discrete, trackable units of work the agent implements in order. Because tasks are explicit, you get accountability: you can see what's done, what's in progress, and what's left, instead of trusting a black box.

The payoff is maintainability. A spec that lives in your repo is documentation that doesn't rot, because it is the thing the agent built from. Six months later, the requirements and design files explain why the code looks the way it does.

Agent hooks: automation that runs itself

The second pillar is agent hooks — automated triggers that fire agent prompts or shell commands when something happens in your IDE. Instead of remembering to run the linter, regenerate tests, or scan for secrets, you wire those actions to events once and forget about them.

Hooks can be triggered by:

File events — a file is created, saved, or deleted
Prompt and agent lifecycle events — prompt submit, agent stop, pre/post tool use
Spec task events — before or after a task executes
Manual triggers — a button you press on demand

Under the hood, hooks are just JSON files. Workspace-level hooks live in .kiro/hooks/, and user-level hooks in ~/.kiro/hooks/. You can create them three ways: describe what you want in plain English and let Kiro generate the JSON, fill out a form, or write the JSON by hand. The practical version of this: every time you save a file, a hook can run your tests and a security scan automatically, so problems surface the moment they're introduced — not in CI an hour later.

Steering files: stop repeating yourself

If you've ever pasted "remember, we use tabs not spaces, we use Vitest not Jest, and never import from the legacy module" into chat for the hundredth time, steering files are the fix. Steering gives Kiro persistent knowledge about your project through markdown files, so your conventions, libraries, and standards are applied consistently without re-explaining them every session.

Steering files can be scoped two ways:

Workspace steering lives in .kiro/steering/ and applies only to that project
Global steering applies across everything you build

This is essentially context engineering applied to a coding agent — encoding the durable knowledge an agent needs so it behaves like a teammate who's read your style guide, not a contractor seeing the repo for the first time. If you want to go deep on the discipline behind this, persistent memory and context strategy for agents is exactly what our context engineering and agent memory course covers end to end.

MCP and agentic chat

Beyond specs, hooks, and steering, Kiro ships the features you'd expect from a modern AI editor. It supports the Model Context Protocol (MCP) for connecting external tools and data sources to the agent, and it includes an agentic chat with context providers for files, URLs, and docs for the ad-hoc work that doesn't justify a full spec.

MCP support matters more than it sounds. It's the open standard that lets an agent reach your database, your ticketing system, your internal docs — without bespoke glue for each one. If MCP is new to you, building and integrating MCP servers is its own skill set; our MCP course walks through standing up real servers and wiring them into agentic workflows.

Your first hour with Kiro

The fastest way to understand the workflow is to feel the loop once on a real, small feature. In practice it looks like this:

Describe the feature in plain language. Not "build an app" — something concrete like "add an endpoint that returns a user's last five orders, paginated."
Review the requirements. Kiro drafts user stories and acceptance criteria. This is where you catch the ambiguity: did you mean five orders total, or five per page? Fix it in the spec, where it costs nothing.
Review the design. Check that the proposed architecture matches your codebase's conventions — and if it doesn't, that's a sign your steering files need to capture those conventions.
Let it work the task list. The agent implements tasks in sequence; you watch and intervene where judgment is needed.
Wire one hook. Even a single "run tests on save" hook changes how the session feels — feedback becomes immediate instead of deferred.

Do that once and the abstract pitch — "specs as the unit of work" — turns concrete. The discipline isn't heavy; it's front-loaded, and the front-loading is where the bugs you didn't ship were quietly avoided.

Kiro vs. vibe coding

"Vibe coding" — prompting your way to an app on feel, accepting whatever the model produces — is genuinely useful for prototypes, throwaway scripts, and learning. It's also where a lot of teams get burned when that "prototype" quietly becomes production.

Kiro is, in a sense, the structured opposite. The spec phase forces the requirements-and-design thinking that vibe coding skips. That doesn't make vibe coding wrong — it makes them tools for different moments. Reaching for a spec to build a one-off script is overkill; vibe-coding a payment flow is asking for trouble. Knowing which mode fits which task is the actual skill, and it's the through-line of our vibe coding course, which treats prompt-to-app speed and structured engineering as complementary, not rival, philosophies.

How Kiro compares to other agentic tools

Kiro isn't alone — the agentic IDE space is crowded, and the tools overlap. A few honest distinctions:

Cursor is an AI-native editor built around fast in-editor generation, multi-file edits, and an agent mode. Its center of gravity is fluid, in-the-flow coding.
Claude Code is a terminal-first agentic tool that excels at multi-file changes, git operations, and CI-aware work from the command line.
Kiro distinguishes itself by making the spec the artifact — front-loading requirements and design before implementation.

These aren't mutually exclusive; plenty of engineers use more than one and switch by task. The meta-skill is fluency across the category rather than loyalty to one editor. If you want a structured path through these tools, we maintain dedicated, hands-on courses on agentic coding with Claude Code and on Cursor as a pro — both built around real multi-file, real-repo workflows rather than toy demos.

Pricing

At the time of writing, Kiro uses a credit-based model measured in agent interactions, with no daily or weekly rate limits and pre-paid overages so you don't hit a hard wall mid-task:

Free — 50 agent interactions per user per month (fine for experimentation, not serious daily work)
Pro — $19 per user per month for 1,000 agent interactions
Pro+ — $39 per user per month for 3,000 interactions

Pricing and tiers for tools in this category change often, so verify the current numbers on Kiro's official pricing page before you budget for a team. Treat the figures above as a snapshot, not a contract.

When Kiro is worth it — and when it isn't

Spec-driven development has a cost: the spec phase is overhead. That overhead pays off when the work is durable and shared, and it's pure friction when the work is disposable.

Kiro shines when:

You're building features meant to live and be maintained, not prototypes you'll throw away
More than one person (or one agent) touches the codebase and conventions matter
You want an auditable trail of why the code is the way it is
You're tired of re-explaining your standards every session

Reach for something lighter when:

You're exploring, prototyping, or scripting something you'll delete tomorrow
The task is small enough that writing the spec costs more than writing the code
You just need a quick answer or a one-file change

The honest take: spec-driven development is a discipline, and Kiro is tooling that makes the discipline cheaper to follow. The tool won't supply the engineering judgment — it removes the excuse not to apply it.

Want to go deeper?

Tools like Kiro lower the cost of doing engineering properly, but they reward people who already understand specs, agent architecture, context management, and the MCP ecosystem underneath. That foundation is what turns an agentic IDE from a faster autocomplete into genuine leverage.

At Cursuri-AI.ro, Eastern Europe's AI education platform, we build practical, project-based courses on exactly this stack — agentic coding, MCP, context engineering, and the modern AI-native IDE workflow — taught around real repositories with an interactive AI instructor, not slide decks. If Kiro made you curious about agentic engineering as a craft rather than a buzzword, that's the rabbit hole our catalog is built for.

Conclusion

Kiro's core bet is simple and, I think, correct: the bottleneck in AI-assisted development was never typing speed — it was the gap between what you meant and what the model built. By making specs the unit of work, adding agent hooks for automation and steering files for persistent context, Kiro turns "AI coding" into something closer to engineering with an agent.

It won't replace judgment, and it isn't the right tool for every task. But for durable, maintainable software built with an AI in the loop, spec-driven development is a genuinely different — and more accountable — way to work. Try it on a real feature, not a toy, and you'll feel the difference fast.

Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from agentic coding and MCP to context engineering and AI-native IDE workflows.

Sources: kiro.dev · Kiro Specs docs · Kiro Hooks docs · Kiro Steering docs

Stop Vibe-Checking Your LLM: A Developer's Guide to Evals

galian — Mon, 22 Jun 2026 08:24:22 +0000

You tweaked the system prompt, ran the same two test questions you always run, the answers looked good, and you shipped. A week later support is forwarding you screenshots of the model confidently doing the exact thing your prompt was supposed to stop. You never saw it, because "did it get better?" was answered by vibes.

This is the single most common failure mode in shipping LLM features, and it has nothing to do with which model you picked. If your only quality gate is reading a handful of outputs and nodding, every change you make is a coin flip. You can't tell whether a prompt edit helped, hurt, or just moved the failures somewhere you didn't look. Evals are how you replace the nod with a number.

This is a practical guide to building that number — from a 30-row eval set you can write this afternoon, through code-based checks and LLM-as-judge scoring, to wiring the whole thing into CI so regressions get blocked instead of discovered by users. No new framework to adopt; just the discipline that separates a demo from a system.

Why you can't just `assert output == expected`

Traditional tests work because the output space is small and exact. add(2, 2) is 4 or it's a bug. LLM output breaks all three assumptions that make assertEqual work:

It's non-deterministic. The same prompt can produce different text on two calls. Even at temperature 0 you are not guaranteed byte-identical output across runs or model versions.
It's open-ended. "Summarize this ticket" has thousands of correct answers. None of them are string-equal to your reference, and that's fine — a good summary isn't the summary.
It fails softly. A wrong answer isn't a stack trace. It's a fluent, plausible, well-formatted paragraph that happens to be incorrect. Nothing crashes. Nothing logs an error.

So the goal of an eval isn't "is the output identical to the expected string." It's "does the output satisfy the properties I care about" — is it grounded in the provided context, does it stay on policy, does it actually answer the question, is it valid JSON. You're testing behavior against criteria, not bytes against bytes. Once that clicks, the rest is mechanics.

Start with the eval set, not the metric

The instinct is to reach for a fancy metric first. Wrong order. The asset that makes everything else work is a small, representative eval set: a fixed collection of inputs paired with what a good output looks like (or the criteria a good output must meet). This is your golden dataset, your regression suite, your source of truth.

You do not need thousands of examples to start. Thirty to fifty well-chosen pairs turn LLM tuning from vibes into engineering, because now every change is measured against the same fixed bar. Build the set like this:

Mine real failures. Every time the system gets something wrong in dev or prod, that exact input goes into the eval set with a note on what the right behavior is. Your bug reports are your test cases. This is the highest-signal source you have.
Cover the categories, not just the happy path. Easy questions, ambiguous ones, adversarial ones, out-of-scope ones ("I don't know" is the correct answer and you should test that it says so), and the edge cases specific to your domain.
Freeze it and version it. The eval set lives in your repo next to the code. When you add a case, that's a commit. A moving target can't measure progress.
Keep a holdout. If you start tuning prompts against the eval set, you'll overfit to it. Keep a slice you don't look at until you think you're done.

A minimal eval set is just data — JSON, a CSV, a Python list. Here's the shape:

# evals/dataset.py
EVAL_SET = [
    {
        "id": "refund-window-basic",
        "question": "What is our refund window?",
        "context": "Refunds are accepted within 14 days of purchase.",
        "expected": "14 days",
        "must_not_say": ["30 days", "no refunds"],
    },
    {
        "id": "out-of-scope",
        "question": "What's the weather in Cluj tomorrow?",
        "context": "Refunds are accepted within 14 days of purchase.",
        "expected": "REFUSE",  # correct behavior: decline, don't invent
    },
    # ... 30-50 of these, grown from real failures
]

That's the foundation. Everything below scores outputs against this set.

The two halves of every LLM eval

Separate two questions that get mushed together when you eval by eyeball, because they have different fixes:

Did the system retrieve / set up the right context? (a retrieval or pipeline question)
Given that context, did the model produce a good answer? (a generation question)

If you're building RAG, the first half is its own discipline — measuring recall@k and precision@k on questions with known relevant documents tells you whether the right chunk even reached the prompt. That's a deep enough topic that it deserves its own treatment; a dedicated course on RAG and retrieval-augmented generation spends real time there, and the failure modes are different from the ones below. This guide focuses on the second half: scoring the generated answer. The techniques split into two families — code-based checks and model-based judges — and you want both.

Code-based checks: cheaper and more reliable than you think

Before you reach for an LLM to grade an LLM, a surprising amount of quality is checkable with plain code. These checks are deterministic, free, instant, and never hallucinate. Use them for everything they can cover:

Structural validity. If the output should be JSON matching a schema, validate it. A response that doesn't parse is a hard failure, no judgment call needed.
Must-contain / must-not-contain. The answer about a 14-day refund window must contain "14" and must not contain "30." Keyword and regex assertions catch a whole class of factual regressions for free.
Format and bounds. Length limits, required citations present, no leaked system-prompt text, no forbidden phrases (the "as an AI language model" tax), valid enum values.
Semantic similarity. For open-ended answers, embed the output and your reference answer and check cosine similarity passes a threshold. It's fuzzy, but it catches "the answer wandered off topic" without needing a judge model.

# evals/checks.py
import json

def check_structural(output: str, schema_keys: list[str]) -> bool:
    try:
        data = json.loads(output)
    except json.JSONDecodeError:
        return False
    return all(k in data for k in schema_keys)

def check_must_not_say(output: str, banned: list[str]) -> bool:
    low = output.lower()
    return not any(b.lower() in low for b in banned)

The rule of thumb: anything a regex or a schema can catch, don't pay a model to catch. Reserve the expensive, fuzzy judge for the genuinely subjective stuff.

LLM-as-judge: powerful, biased, and fixable

For the subjective half — "is this answer faithful to the source?", "is this helpful?", "is the tone right?" — you use a strong model to grade outputs. This is LLM-as-judge, and it scales human-quality judgment to thousands of examples for the price of an API call. Two metrics carry most of the weight for RAG-style apps:

Faithfulness / groundedness — does every claim in the answer trace back to the provided context, or did the model invent things? This is your hallucination detector.
Answer relevance — does the response actually address the question that was asked, or is it a fluent dodge?

The catch: LLM judges have well-documented biases, and if you ignore them your eval numbers are noise dressed up as signal. The big ones, all reported in the research on using models as evaluators:

Position bias — when comparing two answers, judges favor the one shown first (or in a fixed slot) regardless of quality.
Verbosity bias — judges tend to rate longer, more elaborate answers higher even when a short answer is more correct.
Self-preference — a judge model can favor text written in its own style or by its own family.

You don't abandon the technique; you engineer around the bias:

Score against a rubric, not a vibe. Ask for a 1–5 score with explicit criteria for each level, and require the judge to output its reasoning before the score. A judge forced to justify itself is more consistent.
For pairwise comparisons, randomize and swap. Run each comparison twice with the order flipped; only count it as a win if the judge picks the same answer both times. This cancels position bias directly.
Calibrate against humans. Hand-label 20–30 examples yourself, run the judge on them, and check it agrees with you. If it doesn't, fix the rubric before trusting it on 2,000. An uncalibrated judge is a random number generator with good grammar.
Use a strong model as the judge. Grading is harder than answering. Use a current frontier model for the judge even if your app runs on a smaller, cheaper one.

# evals/judge.py — sketch of a rubric-based faithfulness judge
JUDGE_PROMPT = """You are grading whether an ANSWER is fully supported by the CONTEXT.

CONTEXT:
{context}

ANSWER:
{answer}

Rules:
- A claim is "supported" only if the CONTEXT states or directly implies it.
- Outside knowledge does NOT count as support.

First write one sentence of reasoning. Then output a JSON object:
{{"reasoning": "...", "faithful": true|false}}"""

def judge_faithfulness(client, context: str, answer: str) -> bool:
    resp = client.complete(
        JUDGE_PROMPT.format(context=context, answer=answer),
        temperature=0,
    )
    return json.loads(resp)["faithful"]

Designing judges that hold up — picking the rubric, calibrating, knowing when a model is the wrong tool for the grade — is exactly the muscle a course on AI evals in production builds, because it's the difference between "the new prompt feels better" and "faithfulness went from 0.78 to 0.91 on the holdout."

Wire it into CI, or it won't survive contact with deadlines

An eval you run by hand when you remember to is an eval you'll stop running the week things get busy. The whole point is to make regressions impossible to ship silently, and that means the eval runs automatically on every change to a prompt, a retrieval setting, or a model version.

The pattern is a regression gate: run the eval set, compute the aggregate score, and fail the build if the score drops below a threshold (or below the last known-good baseline). It looks like an ordinary test suite, because that's what it is.

# tests/test_evals.py
import pytest
from evals.dataset import EVAL_SET
from evals.checks import check_must_not_say
from myapp import answer_question

PASS_THRESHOLD = 0.90  # 90% of eval cases must pass to ship

def run_case(case) -> bool:
    output = answer_question(case["question"], case["context"])
    if case["expected"] == "REFUSE":
        return "i don't know" in output.lower() or "can't" in output.lower()
    if not check_must_not_say(output, case.get("must_not_say", [])):
        return False
    return case["expected"].lower() in output.lower()

def test_eval_suite_meets_threshold():
    results = [run_case(c) for c in EVAL_SET]
    score = sum(results) / len(results)
    failed = [c["id"] for c, ok in zip(EVAL_SET, results) if not ok]
    assert score >= PASS_THRESHOLD, f"Eval score {score:.2f} below {PASS_THRESHOLD}. Failed: {failed}"

A few practical notes that keep this sane in CI:

Pin the model version. Provider model IDs update, and an unpinned model means your eval baseline shifts under you for reasons unrelated to your code. Pin it, and treat a model upgrade as its own deliberate eval run.
Budget for cost and flakiness. LLM calls cost money and occasionally time out. Cache where you can, run the judge-heavy suite on a schedule rather than every commit if needed, and set a slightly forgiving threshold so one stochastic blip doesn't red-X a good PR.
Log the failures, not just the score. When the gate trips, the output should name which cases regressed so the fix is obvious. A bare "0.86 < 0.90" sends you debugging blind.

Now a prompt change is a PR with a number attached. The reviewer sees faithfulness went up and refusal rate held steady, or they see it tanked and the build is red. That's the entire difference between hoping and knowing.

Five mistakes that quietly poison your evals

Even teams that build evals often undermine them. Watch for these:

Testing only the happy path. If every case in your set is a question the system already answers well, your score is a flattering lie. Adversarial and out-of-scope cases are where the signal is.
Tuning on your test set. Optimize prompts against the same examples you grade on and you'll overfit to them. Keep a holdout you don't peek at.
An uncalibrated judge. Trusting an LLM judge you never checked against your own labels is trusting a number you made up. Calibrate first.
One giant blended score. A single average hides that faithfulness improved while refusals broke. Track metrics separately so a regression in one can't be masked by a gain in another.
Letting the set rot. Your product changes; cases that no longer reflect real usage drag the signal down. Prune and grow the set as part of normal work, the same way you maintain any test suite.

None of these are exotic. They're the eval equivalent of not testing error paths — obvious in hindsight, easy to skip under deadline.

How this connects to the rest of your LLM stack

Evals aren't a standalone chore; they're the measurement layer that makes every other improvement legible. When you tighten a prompt, evals tell you if it worked — which is why structured prompt engineering and a real eval loop are two halves of the same skill. When you redesign what goes into the context window — what to include, what to cut, how to order it — evals are how you know the redesign helped rather than just felt cleaner; that discipline of deciding what earns a place in the prompt is increasingly called context engineering and has its own dedicated course. And when you wire up function calling, multi-tool orchestration, and the production concerns of a real integration, evals are what keep the whole pipeline honest as it grows — the kind of end-to-end build covered in a deeper course on advanced LLM integration. The pattern is always the same: build the measurement first, then every change becomes verifiable instead of hopeful.

Conclusion

The teams whose LLM features actually hold up in production aren't using a secret model or a magic prompt. They're disciplined about measurement. They have a versioned eval set grown from real failures, code-based checks for everything a regex can catch, calibrated LLM judges for the subjective rest, and a CI gate that blocks regressions before users find them.

Start smaller than you think you can. Write thirty cases this afternoon — half of them things your system currently gets wrong — add three code checks and one rubric-based judge, and put a threshold in your test suite. The first time a red build stops you from shipping a prompt change that would have quietly broken refusals, you'll never go back to vibe-checking. That's the moment an LLM demo becomes an LLM system people can trust.

The courses linked throughout are part of Cursuri-AI.ro, an AI-learning platform with hands-on, current tracks on evaluating AI systems in production, prompt engineering, RAG, and advanced LLM integration.

Sources & further reading:

Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (documents position, verbosity, and self-enhancement bias in LLM judges)
Liu et al. — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Liang et al. — Holistic Evaluation of Language Models (HELM)

This article is educational content. Techniques and tooling evolve quickly; validate approaches against your own data and current library documentation.

How Click Fraud Works — and How to Detect Bot Clicks in Real Time

galian — Thu, 18 Jun 2026 13:35:47 +0000

Every time someone clicks your Google or Bing ad, you pay. The uncomfortable part:
a meaningful share of those clicks were never going to become customers. They come
from bots, competitors burning your budget, click farms, and misconfigured scripts
hammering your landing page. Independent studies — and our own data at
ProtectAds —
put invalid traffic at roughly 15–30% of paid-search spend.

If you run PPC, that's not a rounding error. On a €10,000/month budget, it's
€1,500–€3,000 quietly leaking out every month. This article breaks down how click
fraud actually works, why the ad platforms don't fully stop it for you, and why the
"just block the IP" approach falls apart at scale.

First, the vocabulary: invalid traffic vs. click fraud

The ad industry (IAB/MRC) splits Invalid Traffic (IVT) into two buckets:

GIVT — General Invalid Traffic: the obvious stuff. Known data-center IPs, declared bots and crawlers, spiders. Detectable with lists and simple rules.
SIVT — Sophisticated Invalid Traffic: the expensive stuff. Hijacked devices, headless browsers pretending to be Chrome, residential-proxy networks, click farms with real fingerprints, automation that mimics human timing.

Click fraud is the malicious, intent-driven slice of IVT aimed specifically at
your ads — a competitor draining your daily budget, or a click farm monetizing fake
engagement. GIVT you can filter with a blocklist. SIVT is where real detection
earns its keep. (We go deeper on this distinction
here.)

Why the platforms don't fully solve this for you

Google and Microsoft do filter obvious invalid clicks and issue some credits. But
their filtering is conservative, opaque, and applied after the fact — you find out
in aggregate, weeks later, with little per-click evidence. They're also structurally
conflicted: invalid clicks are still billed first and credited later, if at all.

That leaves a gap for sophisticated invalid traffic that looks human enough to pass
platform filters but never converts. Closing that gap is an engineering problem:
score every click in real time and act on it before the budget is gone.

Why sophisticated click fraud is so hard to catch

The obvious stuff (GIVT) is easy to filter. The expensive stuff (SIVT) is hard on
purpose — it's built to look human:

It rotates through fresh IPs and residential proxies, so any static blocklist is out of date the moment you save it.
It runs on real or convincingly spoofed devices, so a single attribute rarely gives it away.
It mimics human timing and behavior well enough to slip past simple rules.

That's why "detect fraud" isn't one trick — it's continuous analysis at scale,
correlating signals across many clicks and campaigns over time, with thresholds
tuned so you catch the bad traffic without blocking real buyers. Doing that
reliably, and fast enough to act before the budget is gone, is the hard part — and
it's why most advertisers are better off with a dedicated system than a homegrown
script. (Here's
how ProtectAds approaches it.)

Why "just block the IP" doesn't scale

The first instinct is a spreadsheet of bad IPs pasted into Google Ads exclusions.
It breaks down fast:

Problem	Manual blocklist	Automated protection
New IPs every day	You're always behind	Handled continuously
Attackers rotate IPs constantly	Whack-a-mole	Doesn't rely on a static list
Ad platforms cap exclusion lists	Fills up fast	Prioritizes the worst offenders automatically
Evidence for refund claims	None	Per-click log you can export
Across many campaigns/accounts	Doesn't	Centralized

IPs are cheap and disposable for attackers; your manual list is neither. Sustainable
protection has to run continuously, automate the exclusions, and keep an audit trail
you can take back to the ad platform.

Turning detection into protection

Detection is only half the job — you have to act on it. A working pipeline:

Observe ad clicks as they hit your site.
Assess each one in real time to tell genuine visitors from invalid traffic.
Block offending IPs by pushing them to your campaign exclusion lists automatically — no manual copy-paste.
Report with a per-click, per-campaign log so you can request ad credits with evidence instead of vibes.

This is exactly what ProtectAds
does for Google Ads (including Performance Max / PMax) and Microsoft (Bing)
Ads. You connect your account once, and it runs the detect → block → report loop
continuously. Agencies running multiple client accounts and domains get dedicated
agency plans
for managing protection at scale.

A note on scope: ProtectAds protects Google Ads and Bing Ads. Meta/Facebook Ads
aren't covered — paid search is where competitor and bot click fraud hits
hardest, so that's where we focus.

How much is this worth to you?

Take your monthly paid-search spend and multiply by 15–30%. That range is your
realistic exposure to invalid traffic. Even at the low end, recovering it usually
dwarfs the cost of detection — which is the entire economic argument for automating
this instead of eyeballing reports once a quarter. (You can sanity-check your own
number with the
click fraud calculator.)

Try it on your own campaigns

If you're spending on Google or Bing ads and you've ever wondered why your
click-through looks healthy but conversions don't follow, invalid traffic is a prime
suspect. The fastest way to find out is to point real detection at your live
campaigns and watch what it flags.

Start a free ProtectAds trial —
connect your Google Ads or Bing account, see the invalid clicks in your own data,
and cancel anytime. Your ad budget is better spent on people who might actually buy.

Written by the team at ProtectAds —
real-time click fraud detection and protection for Google Ads and Microsoft (Bing) Ads.

Model Context Protocol Explained: Build Your First MCP Server in Python

galian — Thu, 18 Jun 2026 12:58:29 +0000

If you've integrated an LLM with a database, a ticketing system, and an internal API, you've written the same glue three times — and you'll write it again for the next model and the next tool. That M×N integration problem is exactly what the Model Context Protocol (MCP) was built to kill. Instead of every application hand-rolling a bespoke connector for every tool, MCP defines one open standard that any model and any tool can speak.

The analogy its authors at Anthropic use is deliberately mundane: MCP is "a USB-C port for AI applications." You don't wire each device to each laptop with a custom cable; you agree on one connector and everything interoperates. That framing is the whole point, and it's why MCP went from an Anthropic open-source release in late 2024 to something adopted across the industry — including by OpenAI and Google — by 2026.

This is a practical guide. We'll cover what MCP actually is, the three-part architecture, the primitives you'll use every day, and then build a real, working MCP server in Python that a host like Claude Code or an IDE can call. No hand-waving — by the end you'll have code that runs.

What problem MCP actually solves

Before MCP, "give the model access to our systems" meant writing function-calling glue specific to one provider's SDK, one tool's API, and one application's plumbing. Swap the model and you rewrote the tool layer. Add a tool and you touched every app that needed it. With M applications and N tools, you were on the hook for roughly M×N integrations.

MCP turns that into M+N. Tool authors write one MCP server. Application authors add one MCP client. Any host that speaks MCP can use any server that speaks MCP — no per-pair glue. The server you write for your company's CRM works in Claude Code, in your custom agent, and in whatever host ships next year, without changes.

That's the strategic shift: tools and models become decoupled, and the integration surface stops growing quadratically. Everything below is just the mechanics of how that's achieved.

The architecture: host, client, server

MCP has exactly three roles. Getting these straight makes everything else click.

Host — the LLM application the user interacts with. Claude Code, an AI-enabled IDE, a desktop assistant, or your own agent. The host orchestrates the model and decides which servers to connect to.
Client — a connector that lives inside the host. The host spins up one client per server, and each client keeps a dedicated 1:1 connection to its server. You rarely write this yourself; the host's framework provides it.
Server — a lightweight program that exposes capabilities (tools, data, prompt templates) over the protocol. This is what you build. A server can wrap a local SQLite file, a SaaS API, a filesystem, or anything you can reach with code.

Under the hood, client and server exchange JSON-RPC 2.0 messages over a transport. There are two you care about:

stdio — the server runs as a local subprocess and communicates over standard input/output. Perfect for local tools, dev work, and anything that touches the user's own machine.
Streamable HTTP — the server runs as a remote service reachable over HTTP, with streaming for long-running responses. This is the modern remote transport (it superseded the older HTTP+SSE approach) and it's what you deploy when the server lives somewhere central.

You write your server logic once; choosing stdio vs. HTTP is mostly a deployment decision, not a rewrite.

The three primitives you'll actually use

MCP servers expose capability through three primitives. The distinction between them isn't bureaucratic — it encodes who is in control, which matters enormously for safety and UX.

Tools — model-controlled

Tools are functions the model decides to call: query a database, send an email, hit an API, run a calculation. They can have side effects, so a well-behaved host asks for user approval before executing one. If you've used function calling, tools are the MCP-native, portable version of it. This is the primitive you'll reach for most.

Resources — application-controlled

Resources are read-only data the application pulls into context: a file's contents, a database row, a config blob, a documentation page. They're identified by URI (for example file:///logs/today.log or db://customers/42) and they don't do anything — they inform. The host decides when and whether to load them, which keeps the context window under deliberate control rather than at the model's whim.

Prompts — user-controlled

Prompts are reusable templates the user invokes intentionally — think a slash-command like "summarize this PR" or "draft a release note." They standardize the high-value interactions your server enables so users don't have to re-type elaborate instructions.

The mental model: tools are for the model, resources are for the app, prompts are for the user. Designing on the correct side of that line is the difference between an integration that feels safe and predictable and one that surprises people. That separation of control is also at the heart of building reliable agents, which is why a structured course on designing autonomous AI agents spends real time on it rather than treating every capability as "just a tool."

Build your first MCP server in Python

Enough theory. Let's build a server that exposes a tool, a resource, and a prompt — and actually runs.

The official Python SDK ships a high-level helper, FastMCP, that handles the JSON-RPC plumbing, schema generation, and transport for you. You describe capabilities with decorators; the SDK infers the input schema from your type hints and the description from your docstring.

Setup

The modern toolchain uses uv, but plain pip works too:

# with uv (recommended)
uv init mcp-demo && cd mcp-demo
uv add "mcp[cli]"

# or with pip
pip install "mcp[cli]"

The server

Create server.py:

from mcp.server.fastmcp import FastMCP

# Name your server — hosts show this to the user.
mcp = FastMCP("demo-tools")


@mcp.tool()
def word_count(text: str) -> int:
    """Count the number of words in a piece of text."""
    return len(text.split())


@mcp.tool()
def days_between(start: str, end: str) -> int:
    """Return the number of days between two ISO dates (YYYY-MM-DD)."""
    from datetime import date
    s = date.fromisoformat(start)
    e = date.fromisoformat(end)
    return abs((e - s).days)


@mcp.resource("notes://team")
def team_notes() -> str:
    """Expose the team's shared notes as read-only context."""
    # In real life this would read a file, a DB row, or an API.
    return "Release freeze starts Friday. Owner: platform team."


@mcp.prompt()
def code_review(language: str, code: str) -> str:
    """A reusable prompt template for reviewing a code snippet."""
    return (
        f"You are a senior {language} engineer. Review the code below for "
        f"correctness, security, and readability. Be specific.\n\n{code}"
    )


if __name__ == "__main__":
    # Default transport is stdio — ideal for local hosts.
    mcp.run()

That's a complete, valid MCP server. Notice what you did not write: no JSON-RPC handling, no schema definitions, no transport code. The type hints on word_count(text: str) -> int become the tool's input/output schema automatically, and the docstring becomes the description the model reads to decide when to call it. That docstring is not decoration — it's the model's only instruction manual for the tool, so write it like an API contract.

Inspect it before wiring it to a model

The SDK includes a dev inspector so you can poke at your server without an LLM in the loop:

uv run mcp dev server.py

This launches the MCP Inspector, a local UI where you can list the server's tools, resources, and prompts, call them with hand-entered arguments, and see exactly what comes back. Debugging here — before a model is involved — is the single biggest time-saver in MCP development. If a tool misbehaves with the inspector, the problem is your server, not the model.

Connect it to a host

To use the server from Claude Code or another MCP-aware host, you register it. For Claude Code, that's a one-liner:

claude mcp add demo-tools -- uv run server.py

For hosts configured by file, you add an entry pointing at the command that launches your server:

{
  "mcpServers": {
    "demo-tools": {
      "command": "uv",
      "args": ["run", "server.py"]
    }
  }
}

Restart the host, and your tools, resource, and prompt show up — the model can now call word_count, the app can pull in notes://team, and the user can invoke the code_review prompt. The same server.py, unchanged, works in every one of them. That portability is the entire payoff, and pushing a server like this from a local toy to something production-grade — auth, logging, error handling, deployment over Streamable HTTP — is exactly the jump covered in this hands-on course on building real AI applications in Python.

From toy to production: what the quickstart doesn't tell you

The server above works, but shipping MCP to real users surfaces concerns the happy path hides. These are the ones that bite teams:

Authentication and authorization. A remote MCP server is a service on the internet. Streamable HTTP servers support OAuth-based auth, and you need it — an unauthenticated tool that can query your database or send email is an incident waiting to happen. Treat the server's tool surface as your real attack surface.

The model can be tricked into calling tools. Because tools are model-controlled, a prompt-injection payload hidden in a document or web page the model reads can try to coax it into calling a destructive tool. The mitigations are concrete: keep destructive tools behind user approval, scope each server's permissions narrowly, validate every argument server-side, and never assume the model's call is benign just because it's well-formed. This intersection of capability and risk is precisely why agentic systems need a security mindset, not just a features mindset — the subject of a dedicated course on AI security and ethical engineering.

Tool descriptions are part of your context budget. Every tool's name, description, and schema get loaded into the model's context. Twenty sprawling tools with verbose docstrings quietly eat thousands of tokens and degrade the model's ability to choose well. Curate your tool surface like an API you have to maintain: fewer, sharper tools beat a kitchen sink. Managing what occupies the context window — tools included — is its own discipline, which a course on context engineering for AI agents treats as a first-class skill rather than an afterthought.

Errors must be legible to a model, not just a human. When a tool fails, return a structured, descriptive error the model can reason about and recover from — not a raw stack trace. "Customer 42 not found; verify the ID" lets the model self-correct; a 500 with a Python traceback does not.

Stateful vs. stateless. stdio servers are naturally per-session and local; HTTP servers may serve many clients and need you to think about concurrency and isolation. Decide early, because retrofitting state handling is painful.

None of these are reasons to avoid MCP — they're the normal engineering of turning a protocol demo into a dependable system, and the same skills you'd apply to any service boundary apply here.

Why this matters for how you build in 2026

MCP's quiet significance is that it makes tools a portable asset instead of a per-app liability. Write a great server for your internal systems once, and it appreciates: every new host, every new model, every teammate's agent can use it without you lifting a finger. That's the opposite of the function-calling glue we used to throw away every time a model changed.

It also pushes good architecture by default. The host/client/server split forces a clean seam between "the model and the app" and "the capability," which is exactly the boundary you want when models get swapped, upgraded, or — as 2026 has reminded everyone — occasionally yanked. Building agents on top of well-designed MCP servers, with the right model routed to the right step, is where a lot of the real engineering leverage lives now; if you want that end-to-end picture, there's a focused course on the Model Context Protocol and building enterprise integrations that goes far deeper than a single tutorial can.

Conclusion

The Model Context Protocol is not hype — it's plumbing, and good plumbing is what lets a field scale. It replaces M×N bespoke integrations with M+N reusable ones, gives you three clear primitives with sane control boundaries, and lets you ship a server in a dozen lines of Python that works across every MCP-aware host.

Start small: build the demo-tools server above, poke it with the inspector, wire it into a host you already use. Then point it at something real in your own stack — a read-only resource over your logs, a single well-scoped tool over an internal API. The first time you watch a model use a capability you exposed once and never re-integrated, the M+N promise stops being abstract.

Write the server once. Let every model use it.

Sources & further reading:

Anthropic — Introducing the Model Context Protocol
Model Context Protocol — Official specification and documentation
Model Context Protocol — Python SDK

This article is educational content. APIs and SDK details evolve; check the official MCP documentation for the current specification before building production systems.

Anthropic disables access to Fable 5 and Mythos 5 to comply with government directive

galian — Sat, 13 Jun 2026 10:04:07 +0000

On June 12, 2026, a frontier AI model disappeared overnight for everyone outside the United States. Not because of an outage, and not because the vendor changed its pricing — but because of a US government export-control directive. Anthropic had to disable Claude Fable 5 and Claude Mythos 5 for all customers, just three days after making Fable 5 — its first "Mythos-class" model — public.

If you build on LLMs, the interesting part of this story isn't the geopolitics. It's the engineering question it forces: what happens to your application when your model is gone tomorrow morning, through no fault of your own?

This is a practical guide to answering that question. We'll cover what actually happened (briefly, and sourced), why a piece of software can fall under export controls at all, and then spend most of our time on the part that matters: how to architect an LLM application so that no single model — from any provider — can take it down.

What happened, briefly and sourced

Anthropic published a statement titled "Statement on the US government directive to suspend access to Fable 5 and Mythos 5." Per that statement, the US government issued an export-control directive requiring the company to suspend access to both models for any non-US national, whether inside or outside the United States.

To comply, Anthropic disabled both models for all customers — not just the targeted group. Everything else stayed online: the statement notes that "access to all other Anthropic models will not be affected." Opus, Sonnet, and Haiku kept working; only the Mythos-class tier went dark.

The government's stated rationale was that it believed it had identified a method of jailbreaking Fable 5's safeguards. Anthropic publicly disagreed with the reasoning — arguing the vulnerability was narrow, that comparable capabilities already exist in other public models (it named GPT-5.5), and that applying this standard across the board "would essentially halt all new model deployments for all frontier model providers." The company called it a misunderstanding and said it was working to restore access.

The takeaway for builders has nothing to do with whether Anthropic or the government is right. It's this: a model you depend on can become unavailable for reasons entirely outside both your control and your vendor's control — and on essentially no notice.

Why software can be "export-controlled" in the first place

It feels strange that an API you call over HTTPS can be subject to export law. It makes sense the moment you see frontier AI the way regulators do: as a dual-use technology.

A dual-use technology serves legitimate and potentially dangerous purposes at the same time. A sufficiently capable model can accelerate both useful research and things nobody wants proliferating — from assisting cyberattacks to sensitive biological domains. That's precisely why Anthropic layers safety classifiers (for cyber and bio) on top of its base model to produce the public Fable 5, while the unrestricted Mythos 5 ships only in a limited program.

Under the US Export Administration Regulations (EAR), the government can restrict the export of strategic technology to non-US persons. The key concept is the "deemed export": giving a foreign national access to controlled technology — even one physically located in the US — counts as an export in its own right. That's why the directive targeted "any non-US national, regardless of location," and why the simplest compliant move was to shut the models off for everyone rather than try to filter access by nationality in real time.

If you're a developer in the EU, the UK, India, or anywhere outside the US, the practical reading is blunt: from the directive's point of view, you are the "foreign national," and the model can be pulled out from under you with no warning.

The real lesson: a model is a dependency, not a constant

This episode is a textbook case of vendor risk and business continuity. The model wasn't deprecated, didn't get more expensive, and wasn't beaten by a competitor. It vanished for reasons external to your product and even to your vendor.

For anyone who wired a product, an internal workflow, or a customer promise to one specific model, the message is clear: an LLM is an external dependency — exactly like a cloud provider, a payments processor, or any third-party API. And you manage external dependencies with redundancy, not hope.

The lesson is not "don't use the best model" or "don't trust Anthropic." Anthropic remains one of the most serious vendors in the market, and the transparency of this disclosure proves it. The lesson is architectural: don't build such that the disappearance of any single model stops your product.

Designing for provider resilience

Resilience here doesn't mean running ten models in parallel. It means being able to switch — quickly, deliberately, and without rewriting your application. Four building blocks get you there.

1. An abstraction layer (don't call the SDK from everywhere)

The single most damaging habit is sprinkling client.messages.create(...) across dozens of files. When you need to switch providers, you're now doing surgery on your whole codebase under time pressure. Put one internal interface between your app and any vendor.

from dataclasses import dataclass
from typing import Protocol

@dataclass
class Completion:
    text: str
    provider: str
    model: str

class LLMProvider(Protocol):
    name: str
    def complete(self, prompt: str, *, max_tokens: int = 1024) -> Completion:
        ...

Every provider — Anthropic, OpenAI, Google, a self-hosted model — implements the same complete() contract. Your application only ever talks to LLMProvider. Swapping a vendor becomes a config change, not a refactor. Designing these seams well is core system-architecture work; if you want the full treatment of boundaries, adapters, and scaling concerns, this course on AI system architecture at scale goes well beyond a single code sample.

2. A router with failover and a circuit breaker

With a common interface, the router becomes simple: try the primary, fall back on failure, and stop hammering a provider that's clearly down.

import time

class ModelRouter:
    def __init__(self, providers: list[LLMProvider], cooldown_s: int = 30):
        self.providers = providers           # ordered: primary first
        self.cooldown_s = cooldown_s
        self._down_until: dict[str, float] = {}

    def _available(self, p: LLMProvider, now: float) -> bool:
        return now >= self._down_until.get(p.name, 0)

    def complete(self, prompt: str, **kw) -> Completion:
        now = time.monotonic()
        last_error: Exception | None = None
        for p in self.providers:
            if not self._available(p, now):
                continue
            try:
                return p.complete(prompt, **kw)
            except Exception as e:        # timeout, 4xx/5xx, or a hard 403 like a suspension
                last_error = e
                self._down_until[p.name] = now + self.cooldown_s   # trip the breaker
        raise RuntimeError(f"All providers unavailable; last error: {last_error}")

This is deliberately small. The point is the shape: an ordered list of interchangeable providers, a breaker that quarantines a failing one for a cooldown, and an app that never sees the difference. A sudden 403/access revocation — exactly what a suspension looks like to your code — is just another failure that trips the breaker and routes to the next provider.

3. Redundancy across jurisdictions, not just vendors

Keep at least two providers ready, ideally under different regulatory regimes. The Fable 5 episode is precisely why jurisdictional diversity matters and not just vendor diversity: a single government action took out one vendor's top tier for a whole class of users. Two US-based providers don't fully protect you from a US-wide policy event; a mix does.

Choosing those alternates well — on real capability, cost, and now exposure — is a skill in itself. If you want a structured, side-by-side framework instead of vibes, there's a dedicated AI model comparison course that covers how to evaluate and route across the 2026 lineup.

4. A safety net you actually control

For your most critical paths, keep an open-weight model you can self-host (something from the Llama or Qwen families). It doesn't have to be the best model in the world — it has to be yours and good enough to keep you running. A model on infrastructure you control cannot be revoked by anyone's directive. That's the difference between "degraded service" and "outage" on the day a managed model disappears.

Portability and switch drills

Two cross-cutting habits make all of the above real:

Portable prompts and evals. If your prompts and evaluation suites are tuned to one model's quirks, you've created a hidden dependency. Treat them as portable artifacts and test them across providers, so a switch doesn't silently tank quality.
Rehearsed failover. A fallback plan you've never executed is an assumption, not a guarantee. Trigger a manual switch to your secondary on a schedule and watch what breaks — discover problems in a drill, not mid-incident.

If your application is agentic — multi-step loops, tool use, sub-agents — provider resilience gets harder, because a mid-loop failover has to preserve state and tool context. Building that correctly is its own discipline, covered in this course on designing autonomous AI agents, and the end-to-end practice of wiring real applications across SDKs (Anthropic, OpenAI, and self-hosted) is the focus of this hands-on course on building AI apps in Python.

A note on governance, not just plumbing

There's a second, quieter lesson here for technical leaders. The trigger was a safety-and-governance dispute: who decides, on what evidence, and how fast, that a frontier capability is too risky for some users. As AI moves deeper into critical systems, "is this model available?" becomes a governance question as much as an uptime one — and understanding why safety classifiers, dual-use controls, and red-teaming exist is part of building responsibly. That intersection of security, ethics, and engineering is exactly what this course on AI security and ethical engineering is about.

Conclusion

The suspension of Fable 5 and Mythos 5 isn't, at its core, a story about a broken model. It's a story about how quickly a frontier capability can move from "available to everyone" to "unavailable by government order" — and about who pays the bill when it does. Not the vendor. The person who built on that model assuming it would always be there.

Anthropic calls it a misunderstanding and is working to restore access; the models may well return in some form. But resilience isn't built on "probably." It's built on architecture: an abstraction layer, at least one fallback from a different jurisdiction, a safety net you control, and the habit of rehearsing the switch before you need it.

Engineers who treat models like the external dependencies they are will look back on the Fable 5 episode as a useful lesson rather than a costly outage. The difference isn't which model you pick. It's the architecture you wrap around it.

The courses linked throughout are part of Cursuri-AI.ro, an AI-learning platform with deep, hands-on tracks on system architecture, model selection, agent design, and the practical engineering of LLM applications — kept current with the 2026 lineup.

Sources:

Anthropic — Statement on the US government directive to suspend access to Fable 5 and Mythos 5
Anthropic — Claude Fable 5 and Claude Mythos 5
CNBC — Anthropic disables access to Fable 5 and Mythos 5 to comply with government directive
9to5Mac — Anthropic pulls Claude Mythos 5 and Claude Fable 5 following US government directive
U.S. Bureau of Industry and Security — Export Administration Regulations (EAR)

This article is for informational purposes and is not legal advice. Dates and named individuals drawn from press reporting may be updated as the situation evolves; check official sources for the current access status.