<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vaibhav Doddihal</title>
    <description>The latest articles on DEV Community by Vaibhav Doddihal (@vibbsdod).</description>
    <link>https://dev.to/vibbsdod</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2123059%2F932c1902-7bf3-4d07-92c2-d94d43d16ca8.jpg</url>
      <title>DEV Community: Vaibhav Doddihal</title>
      <link>https://dev.to/vibbsdod</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vibbsdod"/>
    <language>en</language>
    <item>
      <title>The Leap to Agentic AI: Introduction to Multi-Agent Systems</title>
      <dc:creator>Vaibhav Doddihal</dc:creator>
      <pubDate>Thu, 02 Jul 2026 10:28:31 +0000</pubDate>
      <link>https://dev.to/vibbsdod/the-leap-to-agentic-ai-introduction-to-multi-agent-systems-5cdh</link>
      <guid>https://dev.to/vibbsdod/the-leap-to-agentic-ai-introduction-to-multi-agent-systems-5cdh</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blocksimplified.com/blog/leap-to-agentic-ai-multi-agent-systems" rel="noopener noreferrer"&gt;BlockSimplified&lt;/a&gt; — 24 min read&lt;/em&gt;&lt;/p&gt;



&lt;blockquote&gt;
&lt;p&gt;This post is part of my &lt;strong&gt;AI Fluency&lt;/strong&gt; series. We've covered single agents in Module 4; now we're scaling up. Module 5 is about getting multiple agents to work together, which is harder than it sounds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I remember the moment I realized single agents weren't enough. I had built a research assistant that could search the web, summarize articles, and answer questions. It worked well for simple queries. Then I asked it to "research the AI agent landscape, compare the top 5 frameworks, and write a technical blog post with code examples." It choked. The context window filled up, the output became unfocused, and the code examples were hallucinated garbage.&lt;/p&gt;

&lt;p&gt;That's when I started exploring &lt;strong&gt;multi-agent systems&lt;/strong&gt;. The idea is simple: instead of one agent doing everything, you create a team. A researcher agent gathers information. A writer agent crafts the prose. A coder agent handles the technical examples. A reviewer agent catches errors. Each specialist does what it's good at, and together they produce something none could create alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Single Agents Hit a Wall
&lt;/h2&gt;

&lt;p&gt;Single agents are powerful. With the right tools and prompts, they can do impressive things. The question is where they break down.&lt;/p&gt;

&lt;p&gt;Here are the walls I've hit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context window limits.&lt;/strong&gt; Complex tasks pile up information fast: research results, previous outputs, tool responses, conversation history. A single agent running a 10-step workflow runs out of context space before it finishes the job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specialization beats generalization.&lt;/strong&gt; A system prompt can only stretch one agent so far. Ask it to be a world-class researcher AND writer AND coder AND editor, and you get the jack-of-all-trades problem: competent everywhere, excellent nowhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No second opinion.&lt;/strong&gt; A single agent can hallucinate, make logical errors, or drift off-task, and nobody is watching. A second agent that reviews the first one's work catches mistakes that would otherwise slip through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parallelization.&lt;/strong&gt; A single agent works through tasks one at a time. When subtasks are independent, multiple agents can research different parts of a problem at the same time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Human Team Analogy
&lt;/h2&gt;

&lt;p&gt;Think about how a software team actually ships a feature. Nobody assigns one person to design, build, test, and document it solo. You split the work.&lt;/p&gt;

&lt;p&gt;Here's how that maps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;Agent Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Product Manager&lt;/td&gt;
&lt;td&gt;Defines requirements, prioritizes&lt;/td&gt;
&lt;td&gt;Planner Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Researcher&lt;/td&gt;
&lt;td&gt;Investigates solutions, gathers context&lt;/td&gt;
&lt;td&gt;Research Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer&lt;/td&gt;
&lt;td&gt;Writes the code&lt;/td&gt;
&lt;td&gt;Coder Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Reviewer&lt;/td&gt;
&lt;td&gt;Catches bugs, suggests improvements&lt;/td&gt;
&lt;td&gt;Reviewer Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical Writer&lt;/td&gt;
&lt;td&gt;Documents the work&lt;/td&gt;
&lt;td&gt;Writer Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA Tester&lt;/td&gt;
&lt;td&gt;Validates the implementation&lt;/td&gt;
&lt;td&gt;Tester Agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each person has deep expertise in their area. They communicate through defined channels (standups, PRs, docs). A project manager coordinates the workflow. Sound familiar?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent orchestration&lt;/strong&gt; is the AI equivalent of project management. Someone (or something) needs to decide: which agent handles this task? In what order? What happens when an agent fails?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdyuixw1md5ydwm8c18ah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdyuixw1md5ydwm8c18ah.png" alt="A software team mapped to its AI agent counterparts, shown as two mirrored isometric groups connected by glowing lines" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Role Specialization: What Makes Each Agent Unique
&lt;/h2&gt;

&lt;p&gt;In a multi-agent system, each agent has a distinct &lt;strong&gt;role&lt;/strong&gt;. The role defines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What the agent knows&lt;/strong&gt; (system prompt, context)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What the agent can do&lt;/strong&gt; (available tools)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What the agent is responsible for&lt;/strong&gt; (its piece of the workflow)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a concrete example. Let's say you're building a "research and write" system for technical blog posts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Research Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Role: Technical Researcher
Goal: Gather accurate, comprehensive information on the given topic
Backstory: You're a meticulous researcher who digs deep into technical topics.
           You cite sources, verify claims, and organize findings clearly.

Tools: web_search, read_documentation, fetch_github_repos
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This agent's entire job is research. It doesn't write prose or format content. It searches, reads, and compiles facts. Its output is structured research notes that another agent will use.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Writer Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Role: Technical Writer
Goal: Transform research into engaging, clear technical content
Backstory: You're an experienced technical writer who explains complex topics
           in accessible language. You use analogies, examples, and structure.

Tools: none (pure generation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The writer takes the researcher's output and crafts it into readable content. It doesn't search the web or verify facts; that was done upstream. It focuses purely on writing quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Editor Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Role: Technical Editor
Goal: Review content for accuracy, clarity, and consistency
Backstory: You're a detail-oriented editor who catches errors others miss.
           You verify technical claims, improve sentence structure, and ensure
           the content matches the target audience.

Tools: fact_check, grammar_check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The editor is the quality gate. It reviews the writer's output, flags issues, and either approves or requests revisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Collaboration Patterns: Centralized vs. Decentralized
&lt;/h2&gt;

&lt;p&gt;Once you have multiple agents, you need to decide how they collaborate. The two main &lt;strong&gt;agent collaboration patterns&lt;/strong&gt; are centralized and decentralized.&lt;/p&gt;

&lt;h3&gt;
  
  
  Centralized: The Manager Pattern
&lt;/h3&gt;

&lt;p&gt;In centralized orchestration, one agent (the "manager" or "coordinator") controls the workflow. It receives the initial task, breaks it into subtasks, assigns each to the appropriate specialist agent, collects results, and delivers the final output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│                                                 │
│              ┌──────────────┐                   │
│              │   MANAGER    │                   │
│              │    AGENT     │                   │
│              └──────┬───────┘                   │
│                     │                           │
│         ┌──────────┼──────────┐                 │
│         ▼          ▼          ▼                 │
│   ┌──────────┐ ┌──────────┐ ┌──────────┐        │
│   │ RESEARCH │ │  WRITER  │ │  EDITOR  │        │
│   │  AGENT   │ │  AGENT   │ │  AGENT   │        │
│   └──────────┘ └──────────┘ └──────────┘        │
│                                                 │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear control flow&lt;/li&gt;
&lt;li&gt;Easy to debug (you can trace every decision through the manager)&lt;/li&gt;
&lt;li&gt;Single point of coordination&lt;/li&gt;
&lt;li&gt;Works well for defined workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manager is a bottleneck&lt;/li&gt;
&lt;li&gt;If the manager fails, everything fails&lt;/li&gt;
&lt;li&gt;Doesn't scale well to large agent networks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams should start here. It's simpler, and you can always evolve to decentralized later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decentralized: Agent-to-Agent
&lt;/h3&gt;

&lt;p&gt;In decentralized patterns, agents communicate directly with each other based on protocols or discovery mechanisms. There's no central manager; agents negotiate, delegate, and collaborate autonomously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│                                                 │
│   ┌──────────┐         ┌──────────┐             │
│   │ RESEARCH │◄───────►│  WRITER  │             │
│   │  AGENT   │         │  AGENT   │             │
│   └────┬─────┘         └────┬─────┘             │
│        │                    │                   │
│        │    ┌──────────┐    │                   │
│        └───►│  EDITOR  │◄───┘                   │
│             │  AGENT   │                        │
│             └──────────┘                        │
│                                                 │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No single point of failure&lt;/li&gt;
&lt;li&gt;Scales to large agent networks&lt;/li&gt;
&lt;li&gt;Agents can dynamically discover collaborators&lt;/li&gt;
&lt;li&gt;Works for marketplace/negotiation scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Harder to debug (who decided what?)&lt;/li&gt;
&lt;li&gt;Requires protocols and discovery mechanisms&lt;/li&gt;
&lt;li&gt;More failure modes&lt;/li&gt;
&lt;li&gt;Higher complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Decentralized patterns make sense when you have many agents, dynamic environments, or need agents from different vendors/platforms to collaborate. This is where protocols like &lt;strong&gt;A2A (Agent2Agent)&lt;/strong&gt; come in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Communication: How Agents Talk to Each Other
&lt;/h2&gt;

&lt;p&gt;Agents need to exchange information. The mechanism you choose affects how easy the system is to debug, how reliably it delivers results, and whether agents from different vendors can work together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Message Passing
&lt;/h3&gt;

&lt;p&gt;The simplest approach: agents send messages to each other. The message includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who it's from&lt;/li&gt;
&lt;li&gt;Who it's for&lt;/li&gt;
&lt;li&gt;The content (task, results, questions)&lt;/li&gt;
&lt;li&gt;Any relevant context
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified message structure
&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;writer_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Multi-agent systems&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;findings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for simple systems but gets messy as you scale. Who manages the message queue? How do you handle failed delivery? What if agents speak different "languages"?&lt;/p&gt;

&lt;h3&gt;
  
  
  Shared Memory
&lt;/h3&gt;

&lt;p&gt;Instead of passing messages, agents read from and write to a shared state (like a database or in-memory store). Each agent checks the shared memory for new tasks, updates it with results, and other agents see those updates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│                                                 │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
│   │ RESEARCH │   │  WRITER  │   │  EDITOR  │    │
│   │  AGENT   │   │  AGENT   │   │  AGENT   │    │
│   └────┬─────┘   └────┬─────┘   └────┬─────┘    │
│        │              │              │          │
│        ▼              ▼              ▼          │
│   ┌──────────────────────────────────────┐      │
│   │         SHARED MEMORY / STATE        │      │
│   │  (Redis, Vector Store, Database)     │      │
│   └──────────────────────────────────────┘      │
│                                                 │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use shared memory:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agents need access to the same data&lt;/li&gt;
&lt;li&gt;You want to decouple producers from consumers&lt;/li&gt;
&lt;li&gt;State needs to persist across agent restarts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Communication Protocols: A2A (and where ACP went)
&lt;/h3&gt;

&lt;p&gt;For agents to communicate across platforms or vendors, you need standardized protocols. The picture got a lot clearer in 2025-2026, and it's worth knowing how it shook out so you don't bet on a dead standard.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;**Agent2Agent (A2A)&lt;/em&gt;&lt;em&gt;:&lt;/em&gt;* Originally Google's open protocol for agent interoperability, &lt;a href="https://developers.googleblog.com/en/google-cloud-donates-a2a-to-linux-foundation/" rel="noopener noreferrer"&gt;donated to the Linux Foundation in June 2025&lt;/a&gt; under neutral governance. Agents publish "Agent Cards" describing what they can do, and other agents can discover and invoke them. A2A &lt;a href="https://a2a-protocol.org/latest/announcing-1.0/" rel="noopener noreferrer"&gt;reached v1.0 in April 2026&lt;/a&gt; with &lt;strong&gt;Signed Agent Cards&lt;/strong&gt; (cryptographic identity verification so a receiving agent can confirm a card really came from its claimed owner), multi-tenancy, and JSON-RPC/gRPC bindings. By its one-year mark it had &lt;a href="https://www.linuxfoundation.org/press/a2a-protocol-surpasses-150-organizations-lands-in-major-cloud-platforms-and-sees-enterprise-production-use-in-first-year" rel="noopener noreferrer"&gt;150+ supporting organizations&lt;/a&gt; and production integrations across Azure AI Foundry, Amazon Bedrock AgentCore, and Google Cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Communication Protocol (ACP):&lt;/strong&gt; IBM's REST-based protocol (launched March 2025 to power the BeeAI platform) for lightweight agent invocation. In August 2025, &lt;a href="https://lfaidata.foundation/communityblog/2025/08/29/acp-joins-forces-with-a2a-under-the-linux-foundations-lf-ai-data/" rel="noopener noreferrer"&gt;ACP merged into A2A under the Linux Foundation&lt;/a&gt; — the BeeAI platform now uses A2A. So if you saw "ACP vs A2A" debates from early 2025, that question has been answered: it's A2A.&lt;/p&gt;

&lt;p&gt;These protocols matter for the future of multi-agent systems. Today, most teams still let frameworks handle communication internally. But as agent ecosystems grow and you need agents from different vendors to collaborate, A2A is becoming the interoperability layer worth learning.&lt;/p&gt;

&lt;p&gt;One concrete sign of maturity: A2A now has an official payments extension, the &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol" rel="noopener noreferrer"&gt;Agent Payments Protocol (AP2)&lt;/a&gt;, built with 60+ payments and tech companies (Mastercard, PayPal, American Express, Coinbase, and others) so agents can securely initiate and authorize transactions on a user's behalf. Agentic commerce is moving from demo to standard.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Simple Multi-Agent Example
&lt;/h2&gt;

&lt;p&gt;Here's pseudocode for a two-agent "research and write" system with centralized orchestration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode for a simple multi-agent system
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_multi_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Manager breaks down the task
&lt;/span&gt;    &lt;span class="n"&gt;manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Manager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Coordinate research and writing tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;subtasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# subtasks = ["Research X", "Write article about X"]
&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Research agent handles research
&lt;/span&gt;    &lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read_docs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Writer agent handles writing, using research results
&lt;/span&gt;    &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;draft&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: Manager reviews and returns
&lt;/span&gt;    &lt;span class="n"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;draft&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is about 20 lines of pseudocode, but it captures the core pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A manager agent plans the workflow&lt;/li&gt;
&lt;li&gt;Specialist agents execute their piece&lt;/li&gt;
&lt;li&gt;Results flow from one agent to the next&lt;/li&gt;
&lt;li&gt;The manager delivers the final output&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Real implementations add error handling, retries, logging, and more sophisticated orchestration. But the foundation is this simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Things Go Wrong: Handling Agent Failures
&lt;/h2&gt;

&lt;p&gt;In single-agent systems, failure is straightforward: the agent errors, you handle it. In multi-agent systems, failures cascade. Agent A fails, so Agent B doesn't get input, so Agent C produces garbage.&lt;/p&gt;

&lt;p&gt;This isn't hand-waving. It's measured. UC Berkeley's &lt;a href="https://arxiv.org/abs/2503.13657" rel="noopener noreferrer"&gt;MAST study ("Why Do Multi-Agent LLM Systems Fail?")&lt;/a&gt; hand-annotated 150 conversation traces across seven popular open-source multi-agent frameworks and found 14 distinct failure modes that cluster into three buckets: &lt;strong&gt;system/specification design&lt;/strong&gt; (~41%), &lt;strong&gt;inter-agent misalignment&lt;/strong&gt; (~37%), and &lt;strong&gt;task verification&lt;/strong&gt; (~21%). The headline takeaway: most multi-agent failures aren't reasoning failures — they're coordination and verification failures. Agents act on stale or divergent views of shared state, or nobody checks the final output. That's exactly where you should spend your engineering effort.&lt;/p&gt;

&lt;p&gt;Here's how I think about failure handling:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Fail Fast with Clear Errors
&lt;/h3&gt;

&lt;p&gt;Each agent should validate its inputs and outputs. If an agent receives garbage, it should fail immediately with a clear error rather than produce garbage output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_with_validation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Validate input
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; received empty input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Validate output
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; produced insufficient output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Retry with Backoff
&lt;/h3&gt;

&lt;p&gt;Transient failures (rate limits, network blips) should trigger retries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;TransientError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Exponential backoff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Fallback Agents
&lt;/h3&gt;

&lt;p&gt;For critical tasks, have a backup. If your primary research agent fails, maybe a simpler agent with different tools can provide basic results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;research_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primary_research_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;AgentError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Primary researcher failed, using fallback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallback_research_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Human Escalation
&lt;/h3&gt;

&lt;p&gt;Sometimes the right answer is "ask a human." Build escalation paths for high-stakes or ambiguous situations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_with_escalation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confidence_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;confidence_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;request_human_review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Practical Starting Point: CrewAI
&lt;/h2&gt;

&lt;p&gt;If you want to try multi-agent systems today, &lt;a href="https://docs.crewai.com/" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt; is a good starting point. It provides clear abstractions for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agents:&lt;/strong&gt; Define role, goal, backstory, tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks:&lt;/strong&gt; Define what needs to be done, expected output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crews:&lt;/strong&gt; Group agents and tasks into workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processes:&lt;/strong&gt; Sequential or hierarchical execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what a minimal CrewAI setup looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;

&lt;span class="c1"&gt;# Define agents
&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Senior Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find accurate, comprehensive information&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re an expert researcher with attention to detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scrape_tool&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Technical Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create clear, engaging technical content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You explain complex topics in simple terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define tasks
&lt;/span&gt;&lt;span class="n"&gt;research_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research multi-agent systems and their applications&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Structured research notes with sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;writing_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a blog post based on the research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1500-word blog post&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;writer&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create and run the crew
&lt;/span&gt;&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;research_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writing_task&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is real code (not pseudocode). CrewAI handles the orchestration, message passing, and execution order. You focus on defining agents and tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;You've got the foundations: why single agents hit walls, how to carve up roles, and when centralized beats decentralized. The next posts go deeper:&lt;/p&gt;

&lt;p&gt;{/* TODO: Uncomment when these articles are published&lt;/p&gt;

&lt;p&gt;*/}&lt;/p&gt;

&lt;p&gt;For now, pick a task that naturally splits into two phases (research + writing is a good one), build one agent for each, and run it. You'll hit a failure mode or an unexpected behavior within the first few runs. That's the real learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Concepts Recap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Multi-Agent Systems&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent Orchestration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Collaboration Patterns&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Agents&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agentic AI&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agentic Systems&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A2A Protocol&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Function Calling&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is a multi-agent system in AI?
&lt;/h3&gt;

&lt;p&gt;A multi-agent system is a setup where several specialized AI agents collaborate to solve problems that would overwhelm a single agent. Each agent has a specific role, set of tools, and responsibility, like a researcher, a writer, and an editor working as a team instead of one person trying to do all three jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I use a multi-agent system instead of a single agent?
&lt;/h3&gt;

&lt;p&gt;Reach for multi-agent when you hit real walls with a single agent: context window limits on long workflows, the need for true specialization that one system prompt can't deliver, wanting a second agent to review for errors, or genuine parallelization. If a well-crafted prompt chain already solves your problem, stop there. Multi-agent adds real complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between centralized and decentralized agent orchestration?
&lt;/h3&gt;

&lt;p&gt;Centralized orchestration uses a manager agent to assign tasks and collect results, which gives you clear control flow, easy debugging, and a good fit for defined workflows. Decentralized lets agents communicate directly without a central coordinator, which scales better in dynamic environments but is harder to debug and more complex to build. Start centralized.&lt;/p&gt;

&lt;h3&gt;
  
  
  How many agents should I use in a multi-agent system?
&lt;/h3&gt;

&lt;p&gt;Start with the minimum: usually 2-3 agents with clearly distinct roles. Every additional agent adds coordination overhead, more API calls, and more potential failure points. Add agents only when you've identified a clear capability gap. In practice, 3-5 agents handle most use cases. If you need more, you might be over-engineering or could restructure into sub-crews.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can different agents use different LLM providers?
&lt;/h3&gt;

&lt;p&gt;Yes, and sometimes you should. A research agent might benefit from a model with strong web browsing capabilities, while a coding agent works better with a model optimized for code generation. Mix and match based on each agent's needs. Just watch out for increased complexity in error handling, cost tracking, and latency management.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between multi-agent systems and prompt chaining?
&lt;/h3&gt;

&lt;p&gt;Prompt chaining is sequential: output from prompt A becomes input for prompt B. It is linear and deterministic. Multi-agent systems add autonomy: agents can decide what to do, use tools, and communicate in non-linear ways. Prompt chains are simpler and sufficient for many use cases. Multi-agent systems handle more complex, dynamic workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Continue Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enjoyed this article?&lt;/strong&gt; Put your knowledge to the test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://blocksimplified.com/blog/leap-to-agentic-ai-multi-agent-systems" rel="noopener noreferrer"&gt;Take the interactive quiz on BlockSimplified&lt;/a&gt;&lt;/strong&gt; to see how much you retained&lt;/li&gt;
&lt;li&gt;Explore 14 linked Learning Blocks, curated resources, FAQs for deeper understanding&lt;/li&gt;
&lt;li&gt;Follow for more insights on AI, development, and tech&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>multiagent</category>
      <category>agenticai</category>
    </item>
    <item>
      <title>The "Why": A Framework for AI Ethics</title>
      <dc:creator>Vaibhav Doddihal</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:45:53 +0000</pubDate>
      <link>https://dev.to/vibbsdod/the-why-a-framework-for-ai-ethics-53fe</link>
      <guid>https://dev.to/vibbsdod/the-why-a-framework-for-ai-ethics-53fe</guid>
      <description>&lt;h1&gt;
  
  
  The "Why": A Framework for AI Ethics
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blocksimplified.com/blog/framework-for-ai-ethics" rel="noopener noreferrer"&gt;BlockSimplified&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This article is part of my &lt;strong&gt;AI Fluency Curriculum&lt;/strong&gt;, documenting my learnings around AI Fluency &amp;amp; Applied AI.&lt;br&gt;
This is the first post in &lt;strong&gt;Module 8: Ethics, Safety, and Governance&lt;/strong&gt;. We're starting with the foundational question: why do ethics matter for AI, and how do we actually practice them?&lt;/p&gt;

&lt;p&gt;Ethics in AI isn't a box to check at the end of a project. It's a way of thinking that shapes every decision, from data collection to deployment. Get it wrong, and your AI fails the people who use it. Real people. At scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current as of June 2026&lt;/strong&gt;&lt;br&gt;
This guide reflects the latest landscape: the EU AI Act's general-purpose AI obligations went live in August 2025, but in May 2026 EU negotiators provisionally agreed a "Digital Omnibus" that pushes the high-risk-system obligations back from August 2026 to December 2027; the 2025-26 wave of AI hiring-bias litigation (Mobley v. Workday's certified collective action, still escalating in 2026, and Harper v. Sirius XM); Fairlearn 0.14 (June 2026); and the growing role of ISO/IEC 42001 and the NIST Generative AI Profile as governance baselines.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I want to start with a confession. When I first heard "AI ethics" years ago, I mentally filed it under "compliance stuff that slows down real work." I was wrong, and the cases in this post are what changed my mind.&lt;/p&gt;

&lt;p&gt;Look closely at how the famous failures actually happened and the same shape shows up every time. A well-intentioned team. No malice. No obvious bug. The system worked exactly as designed, but the design hadn't accounted for the ethical implications of the data patterns it learned. Amazon spent years building a recruiting tool it eventually threw away. A healthcare algorithm shaped care decisions for an estimated 100 million people before anyone caught what it was doing. None of these were cheap to fix, and none were caught early.&lt;/p&gt;

&lt;p&gt;Ethics isn't the enemy of shipping. It's the prerequisite for shipping something that doesn't blow up in your face.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The four pillars of AI ethics, in brief&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI ethics breaks down into four pillars (FATP): fairness, accountability, transparency, and privacy.&lt;/li&gt;
&lt;li&gt;AI amplifies bias in its training data: Amazon's recruiting tool, COMPAS risk scores, and a healthcare algorithm affecting 100 million patients all show how.&lt;/li&gt;
&lt;li&gt;Fairness definitions are mathematically incompatible when base rates differ, so you must pick one and document the trade-off.&lt;/li&gt;
&lt;li&gt;Treat ethics as a design constraint with concrete tools like Fairlearn, not a post-launch box to check.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;p&gt;By the end of this post, you'll be able to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Understand and apply&lt;/strong&gt; the four pillars of AI ethics: fairness, accountability, transparency, and privacy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recognize real-world examples&lt;/strong&gt; of AI systems causing ethical harm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use technical interventions&lt;/strong&gt; like Fairlearn for measuring and mitigating bias&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Navigate the trade-offs&lt;/strong&gt; between competing ethical principles&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We'll cover three levels: Beginner (real-world case studies of AI bias), Intermediate (technical fairness interventions), and Advanced (societal trade-offs and formal mitigation strategies).&lt;/p&gt;




&lt;h2&gt;
  
  
  Beginner: Real-World Case Studies of AI Harm
&lt;/h2&gt;

&lt;p&gt;Let me start with stories. Not because I want to scare you, but because ethical failures aren't abstract. They happen to real people, and understanding what went wrong is the first step to building differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case Study 1: Amazon's Hiring Algorithm
&lt;/h3&gt;

&lt;p&gt;In 2018, Reuters reported that Amazon had scrapped an internal AI recruiting tool after discovering it was biased against women. The system was trained on resumes submitted over 10 years, a period when the tech industry was predominantly male. The AI learned that male candidates were preferable, penalizing resumes that included the word "women's" (as in "women's chess club") or graduates of all-women's colleges.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwa2emljj28ig5zyfyy6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwa2emljj28ig5zyfyy6d.png" alt="neural network visualization with resume documents flowing in, showing the system learning from historical hiring patterns" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AI wasn't malicious. It was doing exactly what it was trained to do: find patterns in historical data and replicate them. The problem was that historical data encoded historical discrimination. The algorithm automated and scaled what had been individual human bias.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; AI systems don't transcend their training data. They amplify it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case Study 2: COMPAS Criminal Risk Assessment
&lt;/h3&gt;

&lt;p&gt;ProPublica's 2016 investigation of COMPAS, a criminal risk assessment algorithm used across the US, found that the system was significantly more likely to incorrectly flag Black defendants as high-risk compared to white defendants. A white defendant and a Black defendant with similar criminal histories would receive different risk scores.&lt;/p&gt;

&lt;p&gt;The company that made COMPAS, Northpointe (now Equivant), disputed the methodology. They argued that the algorithm met a different fairness criterion. Both sides were technically correct; they were just using different definitions of fairness.&lt;/p&gt;

&lt;p&gt;The COMPAS case surfaces something the Amazon case didn't: algorithmic fairness isn't a single thing. There are multiple, mathematically incompatible definitions of what "fair" means. You literally cannot satisfy all of them simultaneously when different groups have different base rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; "Make it fair" isn't a specification. You have to choose which type of fairness matters most for your context, and be honest about the trade-offs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case Study 3: Healthcare Algorithm Racial Bias
&lt;/h3&gt;

&lt;p&gt;A 2019 study published in Science found that a widely used healthcare algorithm systematically underestimated the health needs of Black patients. The algorithm used healthcare costs as a proxy for health needs, but because Black patients historically had less access to healthcare (and thus lower costs), the algorithm concluded they were healthier than they actually were.&lt;/p&gt;

&lt;p&gt;The result? Sicker Black patients were deprioritized for care programs compared to healthier white patients. The algorithm affected an estimated 100 million patients annually.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;100 million&lt;/strong&gt;: patients affected annually by a healthcare algorithm that underestimated the health needs of Black patients. It used healthcare costs as a proxy for health, and Black patients historically had lower costs due to less access to care. (Obermeyer et al., Science (2019))&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy variables are dangerous&lt;/strong&gt;&lt;br&gt;
The healthcare algorithm never used race directly. It used healthcare costs, which correlated with race due to systemic inequities. This is called "bias by proxy," and it's one of the sneakiest ways discrimination enters AI systems. You can build a discriminatory system without ever touching protected attributes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Fairness through unawareness (not using sensitive attributes) doesn't work. Other variables carry the same signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case Study 4: The 2025 AI Hiring-Bias Lawsuits
&lt;/h3&gt;

&lt;p&gt;The earlier cases were investigations and academic studies. By 2025, these failures became courtroom liability. In May 2025, a federal court in California granted preliminary certification of a nationwide age-discrimination collective action in &lt;a href="https://fortune.com/2025/07/05/workday-amazon-alleged-ai-employment-bias-hiring-discrimination/" rel="noopener noreferrer"&gt;Mobley v. Workday&lt;/a&gt;, where the plaintiff alleges that an AI applicant-screening platform rejected him from over 100 jobs (his broader suit also claims race and disability discrimination). The case has only escalated since: in March 2026 the court rejected Workday's argument that age-discrimination law doesn't cover job applicants, keeping the collective action alive. The legal theory is &lt;em&gt;disparate impact&lt;/em&gt;: even with no intent to discriminate, a screening tool that disproportionately filters out a protected group can be unlawful.&lt;/p&gt;

&lt;p&gt;Then in August 2025, &lt;a href="https://www.workforcebulletin.com/artificial-intelligence-bias-harper-v-sirius-xm-challenges-algorithmic-discrimination-in-hiring" rel="noopener noreferrer"&gt;Harper v. Sirius XM&lt;/a&gt; echoed the healthcare case's proxy problem in a hiring context: the complaint alleges the AI screener used educational background and zip codes as stand-ins that correlated with race. Same "bias by proxy" mechanism, brand-new legal exposure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Ethical failures aren't just reputational anymore. If your AI screens, ranks, or scores people, "we didn't mean to discriminate" is not a defense, disparate impact looks at outcomes, not intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case Study 5: The 2023-2024 Wave (When Shipped Products Failed)
&lt;/h3&gt;

&lt;p&gt;The cases above are famous partly because they're old enough to have a verdict. But this isn't ancient history. The same failure modes keep shipping in mainstream AI products.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google's Gemini (February 2024).&lt;/strong&gt; Google &lt;a href="https://www.washingtonpost.com/technology/2024/02/22/google-gemini-ai-image-generation-pause/" rel="noopener noreferrer"&gt;paused&lt;/a&gt; Gemini's ability to generate images of people after it produced historically inaccurate results: racially diverse "founding fathers," a female pope, and Black and Asian Nazi soldiers. This one is the opposite of the Amazon case. Google had tuned the model to force diversity, countering the well-documented tendency of image models to default to white faces, and the correction overshot into rewriting history. CEO Sundar Pichai called the responses "completely unacceptable." The lesson is uncomfortable: mitigating bias is a judgment call, not a switch you flip. Over-correct, and you trade one failure for another.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SafeRent (settled November 2024).&lt;/strong&gt; A tenant-screening AI gave each applicant a score that landlords used to accept or reject them. A class action alleged the score disproportionately downgraded Black and Hispanic applicants and housing-voucher holders, because it leaned on credit history and ignored the vouchers that make rent affordable. SafeRent settled for &lt;a href="https://www.cohenmilstein.com/rental-applicants-using-housing-vouchers-settle-ground-breaking-discrimination-class-action-against-saferent-solutions/" rel="noopener noreferrer"&gt;$2.3 million&lt;/a&gt; and agreed to stop showing the score for voucher applicants. This is the healthcare case's proxy problem, in production, with a price tag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stable Diffusion (analyzed 2023).&lt;/strong&gt; When Bloomberg &lt;a href="https://www.bloomberg.com/news/newsletters/2023-06-09/how-generative-ai-can-amplify-racial-gender-stereotypes-big-take" rel="noopener noreferrer"&gt;generated more than 5,000 images&lt;/a&gt; across occupations, the model amplified real-world bias rather than just mirroring it: higher-paying jobs skewed lighter-skinned and male, lower-paying jobs skewed darker-skinned, and "a person" defaulted to a light-skinned man. A University of Washington study presented at EMNLP 2023 found the same pattern. Generative AI, the tech most of us now use daily, makes the "AI scales the bias in its data" problem worse, not better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The throughline:&lt;/strong&gt; 2016 to 2024, different companies, different domains, identical root cause. Anyone who tells you AI bias is a solved problem is selling something.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why These Cases Matter
&lt;/h3&gt;

&lt;p&gt;Across a decade of mainstream AI systems, built by smart people with good intentions, the same root causes show up every time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Training data encoded historical inequities&lt;/strong&gt; (hiring algorithm)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fairness definitions conflict&lt;/strong&gt; and choices weren't made explicit (COMPAS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy variables carried discriminatory signal&lt;/strong&gt; (healthcare algorithm, SafeRent, and the 2025 hiring lawsuits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outcomes, not intentions, create liability&lt;/strong&gt; (Mobley, Harper, and the rising wave of disparate-impact litigation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-correcting backfires too&lt;/strong&gt; (Gemini's forced diversity rewrote history)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your response to these cases is "my team would never do that," you're not paying attention. The path from "reasonable business metric" to "systematically disadvantaging vulnerable populations" is shorter than most engineers realize.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Pillars: A Framework You Can Actually Use
&lt;/h2&gt;

&lt;p&gt;I organize AI ethics around four pillars. I call it FATP: Fairness, Accountability, Transparency, and Privacy. Use them as design constraints, not a post-launch checklist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 1: Fairness
&lt;/h3&gt;

&lt;p&gt;Fairness is about ensuring AI systems don't create or reinforce unfair bias against individuals or groups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key questions to ask:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who could be harmed by this system's errors?&lt;/li&gt;
&lt;li&gt;Are outcomes equitable across different demographic groups?&lt;/li&gt;
&lt;li&gt;What fairness metric are we optimizing for, and why that one?&lt;/li&gt;
&lt;li&gt;Have we tested for bias using real, representative data?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The impossible trade-off:&lt;/strong&gt; Different fairness definitions (demographic parity, equalized odds, predictive parity) cannot all be satisfied simultaneously when base rates differ between groups. This is mathematically proven. You have to choose.&lt;/p&gt;

&lt;p&gt;For a hiring algorithm, you might prioritize equal selection rates across groups (demographic parity). For a medical diagnosis system, you might prioritize equal false negative rates across groups (equalized odds) because missing a disease is the critical error. The right choice depends on what harms you're most trying to prevent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 2: Accountability
&lt;/h3&gt;

&lt;p&gt;Accountability establishes who is responsible when AI systems cause harm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key questions to ask:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who owns this AI system's outcomes?&lt;/li&gt;
&lt;li&gt;What happens when it makes a mistake?&lt;/li&gt;
&lt;li&gt;Can affected individuals appeal or challenge AI decisions?&lt;/li&gt;
&lt;li&gt;Is there human oversight for high-stakes decisions?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The accountability gap:&lt;/strong&gt; Traditional accountability frameworks assumed human decision-makers. AI breaks that. When an autonomous system denies your loan, who do you hold accountable? The developer who wrote the algorithm? The company that deployed it? The data provider whose dataset contained bias? The accountability gap is what you're left with when nobody clearly owns the outcome.&lt;/p&gt;

&lt;p&gt;Practical accountability requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear role assignments (who can stop a deployment, who reviews outcomes)&lt;/li&gt;
&lt;li&gt;Audit trails (records of what decisions were made and why)&lt;/li&gt;
&lt;li&gt;Redress mechanisms (how harmed individuals can seek remedy)&lt;/li&gt;
&lt;li&gt;Meaningful human oversight, where reviewers can actually change or stop a decision&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pillar 3: Transparency
&lt;/h3&gt;

&lt;p&gt;Transparency means making AI systems understandable to stakeholders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key questions to ask:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can affected individuals understand why a decision was made about them?&lt;/li&gt;
&lt;li&gt;Is the AI system's involvement disclosed?&lt;/li&gt;
&lt;li&gt;Are the limitations documented and communicated?&lt;/li&gt;
&lt;li&gt;Can the system be audited by external parties?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Levels of transparency:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Technical explainability:&lt;/strong&gt; What features drove this prediction? (SHAP values, attention weights)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User interpretability:&lt;/strong&gt; Why did I get this result, in plain language?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational disclosure:&lt;/strong&gt; Are people told when AI is making decisions about them?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different stakeholders need different types of transparency. An ML engineer debugging a model needs technical explainability. An end user denied a loan needs human-understandable reasoning. A regulator needs audit access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 4: Privacy
&lt;/h3&gt;

&lt;p&gt;Privacy protects individuals from unauthorized collection, use, and exposure of their data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key questions to ask:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What data does this system collect, and is collection minimized?&lt;/li&gt;
&lt;li&gt;How long is data retained?&lt;/li&gt;
&lt;li&gt;Can individuals access, correct, or delete their data?&lt;/li&gt;
&lt;li&gt;Are there protections against re-identification from "anonymized" data?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AI-specific privacy concerns:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Training data privacy:&lt;/strong&gt; Models can memorize and regurgitate training data, including personal information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference attacks:&lt;/strong&gt; Sophisticated attackers can extract training data from model outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation risks:&lt;/strong&gt; Combining multiple non-sensitive attributes can reveal sensitive information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Privacy by design means building protections into the system from the start, not bolting them on later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Intermediate: Technical Fairness Interventions
&lt;/h2&gt;

&lt;p&gt;Theory is nice, but let's get practical. How do you actually measure and mitigate bias in a real AI system?&lt;/p&gt;

&lt;h3&gt;
  
  
  Measuring Bias with Fairlearn
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://fairlearn.org/" rel="noopener noreferrer"&gt;Fairlearn&lt;/a&gt; is an open-source toolkit (now a community-governed project, originally from Microsoft) that helps you assess and improve fairness. The current release is &lt;a href="https://github.com/fairlearn/fairlearn/releases" rel="noopener noreferrer"&gt;0.14.0 (June 2026)&lt;/a&gt;, which brought scikit-learn 1.6 compatibility and made the &lt;code&gt;CorrelationRemover&lt;/code&gt; (handy for stripping proxy signal from features) fully scikit-learn-compatible. Here's a practical walkthrough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Define your fairness metric&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before you measure anything, decide what "fair" means for your use case. Fairlearn supports multiple metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Demographic parity:&lt;/strong&gt; Selection rates are equal across groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Equalized odds:&lt;/strong&gt; True positive and false positive rates are equal across groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictive parity:&lt;/strong&gt; Precision is equal across groups
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: Measuring demographic parity difference
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairlearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;demographic_parity_difference&lt;/span&gt;

&lt;span class="c1"&gt;# y_true: actual outcomes, y_pred: predicted outcomes
# sensitive_features: group membership (e.g., gender, race)
&lt;/span&gt;&lt;span class="n"&gt;dp_diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;demographic_parity_difference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensitive_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sensitive_features&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Demographic Parity Difference: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dp_diff&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Closer to 0 = more fair (by this definition)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Visualize disparities&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fairlearn provides dashboards to visualize how your model performs across groups.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairlearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MetricFrame&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recall_score&lt;/span&gt;

&lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precision&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recall&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;recall_score&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;metric_frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MetricFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensitive_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sensitive_features&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# See performance broken down by group
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric_frame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;by_group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prints accuracy, precision, and recall side by side for each group, so a gap that the aggregate score hides becomes obvious at a glance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Apply mitigation techniques&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fairlearn offers algorithms to reduce unfairness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threshold optimization:&lt;/strong&gt; Find different decision thresholds for different groups to equalize a fairness metric&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduction approaches:&lt;/strong&gt; Constrain the model during training to satisfy fairness constraints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-processing:&lt;/strong&gt; Adjust predictions after the model is trained
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairlearn.postprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ThresholdOptimizer&lt;/span&gt;

&lt;span class="c1"&gt;# Optimize thresholds to achieve equalized odds
&lt;/span&gt;&lt;span class="n"&gt;postprocess_est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ThresholdOptimizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equalized_odds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prefit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;postprocess_est&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensitive_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sensitive_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y_pred_fair&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;postprocess_est&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensitive_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sensitive_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mitigation has costs&lt;/strong&gt;&lt;br&gt;
Fairness mitigation almost always reduces some other metric, often accuracy. This is not a bug; it's a fundamental trade-off. You're explicitly trading overall predictive performance for more equitable outcomes across groups. This is a values decision, not a technical one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Fairness-Accuracy Trade-off in Practice
&lt;/h3&gt;

&lt;p&gt;Here's what the trade-off looks like in practice. Say you have a loan approval model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Overall Accuracy&lt;/th&gt;
&lt;th&gt;Approval Gap (Group A vs B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;25 percentage points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After mitigation&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;8 percentage points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You've reduced the approval gap from 25 points to 8 points, but accuracy dropped 3 points. Is that worth it? The answer depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How much harm does the disparity cause?&lt;/li&gt;
&lt;li&gt;What are the consequences of reduced accuracy?&lt;/li&gt;
&lt;li&gt;What do stakeholders value?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's no formula to answer these questions. They're ethical choices that require human judgment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Advanced: Societal Trade-offs and Formal Mitigation
&lt;/h2&gt;

&lt;p&gt;Here's where it gets genuinely uncomfortable: the FATP pillars sometimes conflict with each other, and the fairness definitions from the previous section are mathematically incompatible. No clever engineering eliminates the trade-off. You have to choose.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Impossibility Theorem
&lt;/h3&gt;

&lt;p&gt;In 2016, researchers proved that three common fairness definitions (calibration, balance for the positive class, and balance for the negative class) cannot all be satisfied simultaneously unless the base rates are equal across groups or the classifier is perfect.&lt;/p&gt;

&lt;p&gt;This is known as the impossibility theorem. In any real-world scenario with different base rates, you will violate at least one reasonable-sounding fairness criterion. That's not a model flaw. It's a mathematical limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; In criminal recidivism prediction, if one group actually has a higher base rate of re-offending (due to systemic factors like poverty, lack of opportunity, etc.), then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Equalizing prediction rates (demographic parity) will over-predict risk for the lower-base-rate group&lt;/li&gt;
&lt;li&gt;Equalizing false positive rates will under-serve the higher-base-rate group&lt;/li&gt;
&lt;li&gt;Equalizing calibration will result in different prediction rates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You literally cannot have all three. Which do you choose?&lt;/p&gt;

&lt;h3&gt;
  
  
  A Framework for Trade-off Decisions
&lt;/h3&gt;

&lt;p&gt;When facing ethical trade-offs, I use this framework:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identify the stakeholders&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who benefits from the AI system?&lt;/li&gt;
&lt;li&gt;Who bears the risks?&lt;/li&gt;
&lt;li&gt;Who has no voice in the decision?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Map the harms&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens if the system is unfair by metric A?&lt;/li&gt;
&lt;li&gt;What happens if it's unfair by metric B?&lt;/li&gt;
&lt;li&gt;Which harms are reversible? Which are permanent?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Consider power asymmetries&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the system affect vulnerable populations disproportionately?&lt;/li&gt;
&lt;li&gt;Do affected individuals have recourse?&lt;/li&gt;
&lt;li&gt;Who profits from the system vs. who bears the risk?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Make and document the choice&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which fairness criterion are you prioritizing and why?&lt;/li&gt;
&lt;li&gt;What are you explicitly trading off?&lt;/li&gt;
&lt;li&gt;How will you monitor for unintended consequences?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The 2026 Regulatory Reality
&lt;/h3&gt;

&lt;p&gt;Ethics used to be mostly voluntary. As of mid-2026, the FATP pillars increasingly map to legal obligations, so documenting your choices is also how you stay compliant.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EU AI Act:&lt;/strong&gt; The bans on prohibited practices (since February 2025) and the obligations for general-purpose AI model providers (since &lt;a href="https://artificialintelligenceact.eu/implementation-timeline/" rel="noopener noreferrer"&gt;2 August 2025&lt;/a&gt;) are live and unchanged. But the headline 2 August 2026 deadline for high-risk systems has moved: under the &lt;strong&gt;"Digital Omnibus"&lt;/strong&gt; that the Council and Parliament &lt;a href="https://www.consilium.europa.eu/en/press/press-releases/2026/05/07/artificial-intelligence-council-and-parliament-agree-to-simplify-and-streamline-rules/" rel="noopener noreferrer"&gt;provisionally agreed on 7 May 2026&lt;/a&gt; (formal adoption expected before August), Annex III high-risk obligations are deferred to &lt;strong&gt;2 December 2027&lt;/strong&gt; and AI embedded in regulated products to 2 August 2028. The deferral is a pragmatic admission that the supporting standards and infrastructure weren't ready, not a softening of the substance. Penalties for prohibited practices still reach up to EUR 35 million or 7% of global turnover, and the voluntary &lt;a href="https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai" rel="noopener noreferrer"&gt;General-Purpose AI Code of Practice&lt;/a&gt; (published July 2025, signed by Anthropic, Google, Microsoft, OpenAI, and others) still operationalizes the transparency and safety expectations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;US state laws are in flux.&lt;/strong&gt; Colorado's pioneering AI Act (SB 24-205) targeted "algorithmic discrimination" in high-risk systems, but it never took effect: its start date was pushed to 30 June 2026, then the whole framework was &lt;a href="https://www.dwt.com/blogs/privacy--security-law-blog/2026/05/colorado-ai-act-repeal-new-transparency-law" rel="noopener noreferrer"&gt;repealed and replaced by SB 26-189&lt;/a&gt; (signed 14 May 2026), a narrower transparency-and-consumer-rights law that takes effect 1 January 2027. The lesson for builders: regulation is moving fast and unevenly, so design to the principles, not to a single statute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voluntary standards as a baseline.&lt;/strong&gt; &lt;a href="https://learn.microsoft.com/en-us/compliance/regulatory/offering-iso-42001" rel="noopener noreferrer"&gt;ISO/IEC 42001&lt;/a&gt; (the first certifiable AI management-system standard) and the &lt;a href="https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence" rel="noopener noreferrer"&gt;NIST Generative AI Profile (AI 600-1)&lt;/a&gt; have become the de facto frameworks organizations adopt to demonstrate due care, increasingly as a procurement requirement for vendors.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Compliance follows ethics, not the other way around&lt;/strong&gt;&lt;br&gt;
If you've already worked through the FATP checklist, fairness testing, accountability ownership, transparency documentation, and privacy controls, you're most of the way to satisfying the EU AI Act, ISO/IEC 42001, and the NIST profile. Teams that bolt compliance on at the end pay for it the expensive way: retraining, scrapped work, legal exposure, and lost trust, the kind of bill the case studies above all ran up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Writing a Formal Mitigation Report
&lt;/h3&gt;

&lt;p&gt;For high-stakes AI systems, document your ethical analysis formally. Here's a template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Ethical Mitigation Report: [System Name]&lt;/span&gt;

&lt;span class="gu"&gt;## 1. System Description&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Purpose and intended use
&lt;span class="p"&gt;-&lt;/span&gt; Affected populations
&lt;span class="p"&gt;-&lt;/span&gt; Decision types (recommendations, predictions, automatic actions)

&lt;span class="gu"&gt;## 2. Fairness Analysis&lt;/span&gt;

&lt;span class="gu"&gt;### Metrics Assessed&lt;/span&gt;
| Metric | Definition | Result |
|--------|------------|--------|
| Demographic Parity | Equal selection rates | 0.15 gap |
| Equalized Odds | Equal TPR/FPR | 0.08 gap |

&lt;span class="gu"&gt;### Groups Analyzed&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Group A vs Group B]
&lt;span class="p"&gt;-&lt;/span&gt; [Other relevant comparisons]

&lt;span class="gu"&gt;## 3. Trade-off Analysis&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Fairness metric A] vs [Fairness metric B]: We prioritized [A] because [reasoning]
&lt;span class="p"&gt;-&lt;/span&gt; Accuracy impact: [X]% reduction in overall accuracy

&lt;span class="gu"&gt;## 4. Mitigation Applied&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Technique: [ThresholdOptimizer / Reductions / etc.]
&lt;span class="p"&gt;-&lt;/span&gt; Parameters: [settings]
&lt;span class="p"&gt;-&lt;/span&gt; Resulting metrics: [post-mitigation numbers]

&lt;span class="gu"&gt;## 5. Residual Risks&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Remaining disparities]
&lt;span class="p"&gt;-&lt;/span&gt; [Known limitations]

&lt;span class="gu"&gt;## 6. Monitoring Plan&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Metrics tracked post-deployment
&lt;span class="p"&gt;-&lt;/span&gt; Alerting thresholds
&lt;span class="p"&gt;-&lt;/span&gt; Review frequency

&lt;span class="gu"&gt;## 7. Accountability&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; System owner: [name]
&lt;span class="p"&gt;-&lt;/span&gt; Ethics review: [approver]
&lt;span class="p"&gt;-&lt;/span&gt; Redress contact: [process for affected individuals]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Make it a living document&lt;/strong&gt;&lt;br&gt;
This isn't a one-time report. Update it as the system evolves, as you gather production data, and as your understanding deepens. Ethical analysis is iterative, not waterfall.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The FATP Checklist
&lt;/h2&gt;

&lt;p&gt;Before deploying any AI system, work through this checklist:&lt;/p&gt;

&lt;h3&gt;
  
  
  Fairness
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Identified protected groups relevant to the use case&lt;/li&gt;
&lt;li&gt;[ ] Measured outcomes across demographic groups&lt;/li&gt;
&lt;li&gt;[ ] Chosen and documented primary fairness metric&lt;/li&gt;
&lt;li&gt;[ ] Applied mitigation if disparities exceeded threshold&lt;/li&gt;
&lt;li&gt;[ ] Tested for bias using data representative of production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Accountability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Assigned clear ownership for system outcomes&lt;/li&gt;
&lt;li&gt;[ ] Defined human oversight requirements for high-stakes decisions&lt;/li&gt;
&lt;li&gt;[ ] Established redress mechanism for affected individuals&lt;/li&gt;
&lt;li&gt;[ ] Created audit trail for decisions&lt;/li&gt;
&lt;li&gt;[ ] Documented escalation path for ethical concerns&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Transparency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Documented model limitations and known failure modes&lt;/li&gt;
&lt;li&gt;[ ] Provided explanation capability appropriate to stakeholders&lt;/li&gt;
&lt;li&gt;[ ] Disclosed AI involvement to affected parties&lt;/li&gt;
&lt;li&gt;[ ] Made system auditable by authorized parties&lt;/li&gt;
&lt;li&gt;[ ] Published model card or equivalent documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Privacy
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Minimized data collection to what's necessary&lt;/li&gt;
&lt;li&gt;[ ] Implemented appropriate data retention policies&lt;/li&gt;
&lt;li&gt;[ ] Provided individual access and deletion rights&lt;/li&gt;
&lt;li&gt;[ ] Assessed re-identification risks&lt;/li&gt;
&lt;li&gt;[ ] Protected against training data extraction&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Honest Summary
&lt;/h2&gt;

&lt;p&gt;AI ethics comes down to building systems you can defend when things go wrong. Systems that don't compound existing inequities at scale. Being a good person helps, but good intentions don't survive contact with bad design choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treating ethics as a design constraint, not a post-launch audit&lt;/li&gt;
&lt;li&gt;Using concrete tools like Fairlearn to measure and mitigate bias&lt;/li&gt;
&lt;li&gt;Documenting trade-offs explicitly rather than hiding them&lt;/li&gt;
&lt;li&gt;Building accountability structures before you need them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's hard:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Navigating mathematically incompatible fairness definitions&lt;/li&gt;
&lt;li&gt;Convincing stakeholders that ethical constraints are worth the accuracy trade-off&lt;/li&gt;
&lt;li&gt;Detecting bias from proxy variables you didn't know were proxies&lt;/li&gt;
&lt;li&gt;Maintaining ethical vigilance as systems evolve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that get AI ethics right share one trait: they ask "who could this hurt?" before they ask "how fast can we ship?" That question is the design constraint. Catch the answer early and you're fixing a decision. Catch it in production and you're fixing a lawsuit.&lt;/p&gt;

&lt;p&gt;Next up in Module 8: AI Safety and Security, where we tackle the technical vulnerabilities that make ethical AI possible or impossible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pillar&lt;/th&gt;
&lt;th&gt;Key Question&lt;/th&gt;
&lt;th&gt;Primary Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fairness&lt;/td&gt;
&lt;td&gt;Are outcomes equitable across groups?&lt;/td&gt;
&lt;td&gt;Fairlearn, AIF360&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accountability&lt;/td&gt;
&lt;td&gt;Who's responsible when things go wrong?&lt;/td&gt;
&lt;td&gt;RACI matrix, audit trails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transparency&lt;/td&gt;
&lt;td&gt;Can stakeholders understand decisions?&lt;/td&gt;
&lt;td&gt;SHAP, model cards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy&lt;/td&gt;
&lt;td&gt;Is data collection minimized and protected?&lt;/td&gt;
&lt;td&gt;Privacy by design&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Our company doesn't have an AI ethics team. How do we get started?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You don't need a dedicated team to start practicing AI ethics. Begin with the FATP checklist for your next AI project. Assign an "ethics owner" (could be the tech lead or PM) who ensures the checklist gets attention. Run fairness metrics on your existing systems to establish baselines. The goal isn't perfection from day one; it's building the muscle of asking ethical questions consistently. As you mature, you might invest in dedicated roles, but many organizations practice effective AI ethics with distributed responsibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Doesn't focusing on fairness reduce model accuracy? How do I justify that to stakeholders?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, fairness interventions often reduce overall accuracy. Frame it this way: What's the cost of the current unfairness? If your hiring algorithm systematically excludes qualified candidates from certain groups, you're leaving talent on the table. If your loan algorithm denies creditworthy applicants unfairly, you're losing good business. Calculate the cost of false negatives across groups, not just aggregate accuracy. Often, "reducing accuracy" means "reducing accuracy for the majority group while improving it for minority groups." The aggregate number goes down, but the system becomes more useful for more people.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I know which fairness metric to prioritize?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with the harms you're trying to prevent. If the main concern is equal access (loan approvals, hiring), demographic parity matters more. If the concern is equal treatment of actual positives (medical diagnosis), equalized odds matters more. If you need the predictions to mean the same thing across groups (risk scores), calibration matters. There's no universal answer; it depends on the domain, the stakes, and the values of your organization. The key is to make the choice explicitly and document your reasoning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Read the Full Curriculum
&lt;/h2&gt;

&lt;p&gt;This piece is one post in my AI Fluency Curriculum, where I document what I'm learning about building and shipping AI responsibly. The full version on BlockSimplified includes an interactive quiz, linked Learning Blocks for the key terms, and a curated resource list. If ethics-as-design-constraint resonated, &lt;a href="https://blocksimplified.com/blog/framework-for-ai-ethics" rel="noopener noreferrer"&gt;read the full article and the rest of the series&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ethics</category>
      <category>aifluency</category>
      <category>appliedai</category>
    </item>
    <item>
      <title>Evaluating LLM Systems: Metrics, Methods, and Scorecards</title>
      <dc:creator>Vaibhav Doddihal</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:45:13 +0000</pubDate>
      <link>https://dev.to/vibbsdod/evaluating-llm-systems-metrics-methods-and-scorecards-3a9m</link>
      <guid>https://dev.to/vibbsdod/evaluating-llm-systems-metrics-methods-and-scorecards-3a9m</guid>
      <description>&lt;h1&gt;
  
  
  Evaluating LLM Systems: Metrics, Methods, and Scorecards
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blocksimplified.com/blog/evaluating-llm-systems-metrics-methods-scorecards" rel="noopener noreferrer"&gt;BlockSimplified&lt;/a&gt; — 11 min read&lt;/em&gt;&lt;/p&gt;



&lt;blockquote&gt;
&lt;p&gt;This post is part of the &lt;strong&gt;AI Fluency&lt;/strong&gt; series, where I document my learnings around applied AI concepts. The goal is to help you build practical skills you can apply in real projects.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is the hard truth about LLM development: most teams ship without proper evaluation. They run a few manual tests, the outputs "look good," and they call it done. Then users start complaining about weird responses, and suddenly nobody knows if the problem is the prompt, the model, or the retrieval pipeline.&lt;/p&gt;

&lt;p&gt;I have been there. Early in my LLM projects, I would tweak a prompt, eyeball a few outputs, and deploy. It felt productive. But when something broke in production, I had no baseline to compare against. Did the new prompt actually help? Was the model always this bad at edge cases? No idea.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation &amp;amp; Testing&lt;/strong&gt; is not just about catching bugs. It is your compass for improvement. Without systematic evaluation, you are navigating by feel in a space where intuition often fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Evaluation is Hard (And Why Most Teams Skip It)
&lt;/h2&gt;

&lt;p&gt;LLMs are not like traditional software. When you test a function that adds two numbers, the expected output is clear. With LLMs, the "correct" answer is subjective, context-dependent, and often impossible to define precisely.&lt;/p&gt;

&lt;p&gt;Consider this: you ask an LLM to summarize an article. There are dozens of valid summaries. Some focus on the main argument, others on supporting details. Some are formal, others conversational. How do you score that?&lt;/p&gt;

&lt;p&gt;This ambiguity leads teams to skip evaluation entirely. It feels like too much work for uncertain benefit. But skipping evaluation means you are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flying blind when making prompt changes&lt;/li&gt;
&lt;li&gt;Unable to compare models objectively&lt;/li&gt;
&lt;li&gt;Missing regressions that hurt users&lt;/li&gt;
&lt;li&gt;Building on a foundation you cannot trust&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The good news: evaluation does not have to be perfect to be useful. Even rough metrics beat no metrics. Let me show you how to build a practical evaluation system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evaluation Pyramid: Three Levels of Rigor
&lt;/h2&gt;

&lt;p&gt;I think about LLM evaluation as a pyramid with three levels. Each level trades off between accuracy and scalability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: Human Evaluation (The Gold Standard)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Human Evaluation&lt;/strong&gt; is the most accurate but least scalable. Real people assess real outputs against criteria like helpfulness, accuracy, and tone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validating that your automated metrics correlate with actual quality&lt;/li&gt;
&lt;li&gt;Evaluating subjective criteria like "does this sound natural?"&lt;/li&gt;
&lt;li&gt;High-stakes applications where errors are costly&lt;/li&gt;
&lt;li&gt;Creating the initial labels for your golden dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to do it well:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Define clear criteria.&lt;/strong&gt; Vague instructions like "rate quality" lead to inconsistent scores. Instead, specify: "Rate helpfulness from 1-5, where 1 means the response does not address the question at all, and 5 means it fully answers with actionable details."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use multiple annotators.&lt;/strong&gt; At minimum, have 3 people rate each response. Calculate inter-rater agreement using Cohen's Kappa. If agreement is low (below 0.6), your guidelines need work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Include calibration examples.&lt;/strong&gt; Show annotators examples of responses at each score level before they start. This anchors their judgments.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The practical reality:&lt;/strong&gt; Human evaluation is expensive. You cannot have humans review every response in production. That is why we need automated methods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faix0ktsuig73sutd5zlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faix0ktsuig73sutd5zlt.png" alt="LLM evaluation methods hierarchy showing the three tiers of AI system assessment" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Level 2: LLM-as-a-Judge (Scalable Quality Assessment)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-a-Judge&lt;/strong&gt; uses a capable model to evaluate outputs from your system. It is faster and cheaper than humans while being more nuanced than simple metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The basic pattern:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are an expert evaluator. Rate the following response on a scale of 1-5
for HELPFULNESS.

Scoring rubric:
1 - Does not address the question
2 - Partially addresses but missing key information
3 - Addresses the question but could be clearer
4 - Good answer with minor room for improvement
5 - Excellent, comprehensive answer

User question: {question}
Response to evaluate: {response}

Provide your score and a brief justification.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key considerations:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use a stronger model as judge.&lt;/strong&gt; If you are evaluating GPT-3.5 outputs, use GPT-4 as the judge. The judge should be at least as capable as the model being evaluated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Validate against human labels.&lt;/strong&gt; Run your judge on a set where you have human scores. If correlation is below 0.7, refine your rubric.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Watch for biases.&lt;/strong&gt; LLMs prefer verbose responses and may favor outputs similar to their training data. Check for these patterns in your evaluations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use reference-guided judging when possible.&lt;/strong&gt; Providing the judge with a reference answer improves consistency.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Level 3: Automated Metrics (Fast but Limited)
&lt;/h3&gt;

&lt;p&gt;Automated &lt;strong&gt;Evaluation Metrics&lt;/strong&gt; are the fastest and cheapest option. They compute scores algorithmically without any LLM calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional NLP metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Good for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BLEU&lt;/td&gt;
&lt;td&gt;N-gram overlap with reference&lt;/td&gt;
&lt;td&gt;Translation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ROUGE&lt;/td&gt;
&lt;td&gt;Recall of reference n-grams&lt;/td&gt;
&lt;td&gt;Summarization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERTScore&lt;/td&gt;
&lt;td&gt;Semantic similarity via embeddings&lt;/td&gt;
&lt;td&gt;General text comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exact Match&lt;/td&gt;
&lt;td&gt;String equality&lt;/td&gt;
&lt;td&gt;Factoid QA with single correct answer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; These metrics measure surface-level similarity, not actual quality. A response could be helpful and accurate but score poorly because it uses different words than the reference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When automated metrics work:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks with clearly correct answers (math, coding with test cases)&lt;/li&gt;
&lt;li&gt;Detecting obvious failures (empty responses, errors)&lt;/li&gt;
&lt;li&gt;Tracking trends over time (even imperfect metrics show direction)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  RAG-Specific Evaluation: The Metrics That Matter
&lt;/h2&gt;

&lt;p&gt;If you are building a &lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt; system, generic evaluation is not enough. You need metrics that assess both retrieval quality and generation quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval Metrics
&lt;/h3&gt;

&lt;p&gt;Before the LLM even sees the context, did you retrieve the right documents?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context Precision:&lt;/strong&gt; Of the documents retrieved, how many were actually relevant?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Recall:&lt;/strong&gt; Of all relevant documents in your corpus, how many did you retrieve?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall@K / Precision@K:&lt;/strong&gt; Versions of above limited to top K results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; If retrieval fails, even the best LLM cannot give good answers. Always evaluate retrieval independently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generation Metrics (RAG-specific)
&lt;/h3&gt;

&lt;p&gt;These metrics assess the LLM's output given the retrieved context:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Faithfulness:&lt;/strong&gt; Does the response stick to what is in the context? This catches &lt;strong&gt;hallucinations&lt;/strong&gt; where the model makes up facts not supported by the retrieved documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer Relevance:&lt;/strong&gt; Does the response actually answer the user's question? A response could be faithful to the context but still miss what the user asked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer Correctness:&lt;/strong&gt; Is the response factually correct? This compares against a ground truth answer if available.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAGAS Framework
&lt;/h3&gt;

&lt;p&gt;The RAGAS (Retrieval Augmented Generation Assessment) framework provides a structured approach to these metrics. It uses LLM-as-a-Judge internally to score each dimension.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conceptual example - actual RAGAS API may differ
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_precision&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_test_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Agent Evaluation: Beyond Single-Turn Responses
&lt;/h2&gt;

&lt;p&gt;Evaluating &lt;strong&gt;AI Agents&lt;/strong&gt; is a different challenge. Agents take multiple steps, use tools, and their success depends on achieving a goal, not just producing good text.&lt;/p&gt;

&lt;h3&gt;
  
  
  Goal Completion Rate
&lt;/h3&gt;

&lt;p&gt;Did the agent accomplish what the user asked? For a travel planning agent, did it actually book the flight? For a research agent, did it find the information?&lt;/p&gt;

&lt;p&gt;This is a binary metric (success/failure) but incredibly important. An agent that produces fluent text but fails to complete tasks is useless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool-Use Accuracy
&lt;/h3&gt;

&lt;p&gt;When the agent decides to use a tool, does it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose the right tool for the situation?&lt;/li&gt;
&lt;li&gt;Provide correct parameters?&lt;/li&gt;
&lt;li&gt;Use tool results appropriately?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track each of these separately. You might find your agent is good at choosing tools but bad at formatting parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trajectory Analysis
&lt;/h3&gt;

&lt;p&gt;For multi-step tasks, examine the full trajectory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many steps did it take? (Efficiency)&lt;/li&gt;
&lt;li&gt;Did it recover from errors? (Robustness)&lt;/li&gt;
&lt;li&gt;Did it take unnecessary detours? (Planning quality)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Safety Violation Rate
&lt;/h3&gt;

&lt;p&gt;Especially important for agents with real-world actions. Did the agent ever:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attempt unauthorized actions?&lt;/li&gt;
&lt;li&gt;Leak sensitive information?&lt;/li&gt;
&lt;li&gt;Violate explicit constraints?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even a 0.1% violation rate is too high for production agents with meaningful capabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building Your Evaluation Scorecard
&lt;/h2&gt;

&lt;p&gt;A scorecard brings all your metrics together in one view. It tells you at a glance whether your system is healthy.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to Include
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Core metrics&lt;/strong&gt; (track always):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overall quality score (LLM-as-Judge, 1-5)&lt;/li&gt;
&lt;li&gt;Faithfulness (for RAG)&lt;/li&gt;
&lt;li&gt;Goal completion rate (for agents)&lt;/li&gt;
&lt;li&gt;Safety violation rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Diagnostic metrics&lt;/strong&gt; (dig in when core metrics drop):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context precision/recall (RAG retrieval health)&lt;/li&gt;
&lt;li&gt;Tool-use accuracy (agent capability)&lt;/li&gt;
&lt;li&gt;Latency and token usage (operational health)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Golden Dataset
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Golden Dataset&lt;/strong&gt; is your foundation for reliable evaluation. It is a curated set of inputs with verified expected outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to build one:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with real queries.&lt;/strong&gt; Pull from production logs (anonymized). These represent actual user needs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Include edge cases.&lt;/strong&gt; Add queries that have caused failures. These are your regression tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Get expert verification.&lt;/strong&gt; Have domain experts validate or write reference answers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep it manageable.&lt;/strong&gt; 50-100 high-quality examples beat 1,000 sloppy ones. Quality over quantity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;How to use it:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Run your golden dataset after any change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New prompt version? Run the golden dataset.&lt;/li&gt;
&lt;li&gt;Model upgrade? Run the golden dataset.&lt;/li&gt;
&lt;li&gt;Retrieval pipeline change? Run the golden dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare scores to your baseline. If quality drops, investigate before deploying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Regression Testing
&lt;/h3&gt;

&lt;p&gt;Integrate golden dataset evaluation into your CI/CD pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conceptual GitHub Actions workflow&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Evaluation&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;evaluate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run golden dataset evaluation&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python evaluate.py --dataset golden_set.json&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check quality thresholds&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;if [ $(cat results.json | jq '.faithfulness') &amp;lt; 0.85 ]; then&lt;/span&gt;
            &lt;span class="s"&gt;echo "Faithfulness dropped below threshold"&lt;/span&gt;
            &lt;span class="s"&gt;exit 1&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Practical Implementation: Where to Start
&lt;/h2&gt;

&lt;p&gt;If you are just getting started with LLM evaluation, here is my recommended sequence:&lt;/p&gt;

&lt;h3&gt;
  
  
  Week 1: Create Your Golden Dataset
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pull 30-50 representative queries from production or brainstorming&lt;/li&gt;
&lt;li&gt;Write reference answers for each&lt;/li&gt;
&lt;li&gt;Include 10+ edge cases or known failure scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Week 2: Set Up LLM-as-a-Judge
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Create a judge prompt for your primary quality criterion&lt;/li&gt;
&lt;li&gt;Run it on your golden dataset&lt;/li&gt;
&lt;li&gt;Manually review judge outputs to check reasonableness&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Week 3: Validate and Iterate
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Have humans rate a subset of the same responses&lt;/li&gt;
&lt;li&gt;Compare human scores to judge scores&lt;/li&gt;
&lt;li&gt;Refine your judge prompt until correlation is decent (aim for 0.7+)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Week 4: Automate
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Integrate evaluation into your deployment process&lt;/li&gt;
&lt;li&gt;Set quality thresholds that block bad deploys&lt;/li&gt;
&lt;li&gt;Create a dashboard to track metrics over time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ongoing: Expand and Maintain
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add new examples to golden dataset as you find failures&lt;/li&gt;
&lt;li&gt;Add metrics for new dimensions (safety, latency, etc.)&lt;/li&gt;
&lt;li&gt;Review and update quarterly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Optimizing for the metric, not the goal.&lt;/strong&gt; Your metric is a proxy for quality, not quality itself. If you tune prompts to maximize your judge scores, you might overfit to the judge's preferences rather than actual user needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too few examples in golden dataset.&lt;/strong&gt; You need coverage of your use cases. Fifty examples is a minimum; one hundred is better. But focus on quality and diversity, not raw quantity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not validating your judge.&lt;/strong&gt; An LLM judge can have systematic biases. Always check correlation with human judgment before trusting it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluating in isolation.&lt;/strong&gt; A component might score well individually but fail in the full pipeline. Test end-to-end, not just pieces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static evaluation sets.&lt;/strong&gt; Your application evolves. Your evaluation set should too. Review and update regularly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;




&lt;h2&gt;
  
  
  Key Concepts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Evaluation &amp;amp; Testing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation Metrics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM-as-a-Judge&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human Evaluation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Golden Dataset&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Agents&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hallucinations&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Continue Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enjoyed this article?&lt;/strong&gt; Put your knowledge to the test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://blocksimplified.com/blog/evaluating-llm-systems-metrics-methods-scorecards" rel="noopener noreferrer"&gt;Take the interactive quiz on BlockSimplified&lt;/a&gt;&lt;/strong&gt; to see how much you retained&lt;/li&gt;
&lt;li&gt;Explore 16 linked Learning Blocks, curated resources for deeper understanding&lt;/li&gt;
&lt;li&gt;Follow for more insights on AI, development, and tech&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>evaluation</category>
      <category>llmevaluation</category>
    </item>
    <item>
      <title>What is Generative AI? A Practical Introduction</title>
      <dc:creator>Vaibhav Doddihal</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:44:35 +0000</pubDate>
      <link>https://dev.to/vibbsdod/what-is-generative-ai-a-practical-introduction-573j</link>
      <guid>https://dev.to/vibbsdod/what-is-generative-ai-a-practical-introduction-573j</guid>
      <description>&lt;h1&gt;
  
  
  What is Generative AI? A Practical Introduction
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blocksimplified.com/blog/what-is-generative-ai-practical-introduction" rel="noopener noreferrer"&gt;BlockSimplified&lt;/a&gt; — 13 min read&lt;/em&gt;&lt;/p&gt;



&lt;blockquote&gt;
&lt;p&gt;Welcome to the &lt;strong&gt;AI Fluency Curriculum&lt;/strong&gt;, a series I'm building to help engineers and technical folks get genuinely comfortable with applied AI. Not the hype. The actual mechanics.&lt;br&gt;
This is the first post in &lt;strong&gt;Module 1: Foundations of Generative AI&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I remember the first time I got ChatGPT to write a bash script for me. I'd described what I needed in plain English, and out came working code. My first reaction: "How does it &lt;em&gt;know&lt;/em&gt; this?" My second reaction: "Wait, that variable name is wrong." That tension between impressive capability and subtle wrongness is what we're going to unpack.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;p&gt;By the end of this post, you'll be able to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Explain GenAI&lt;/strong&gt; in simple terms (to your manager, your team, your confused relatives)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Differentiate it&lt;/strong&gt; from traditional software and predictive AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identify real capabilities and limitations&lt;/strong&gt;, not just the marketing version&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We'll cover three depth levels: Beginner, Intermediate, and Advanced. Skip around based on what you need.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beginner: GenAI as "Autocomplete on Steroids"
&lt;/h2&gt;

&lt;p&gt;Let me start with an analogy that helped me get it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Restaurant Analogy
&lt;/h3&gt;

&lt;p&gt;Imagine a restaurant kitchen:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional Software&lt;/strong&gt; is like a recipe book. You give it inputs (ingredients), it follows exact steps, you get a predictable output. Same input = same dish. Every. Single. Time. A calculator works this way. Your banking app works this way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Predictive AI&lt;/strong&gt; (the old kind) is like a sommelier who looks at your order and predicts: "Based on customers who ordered the lamb, you'll probably want the Malbec." It classifies, predicts, and recommends, but it doesn't &lt;em&gt;create&lt;/em&gt; anything new.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generative AI&lt;/strong&gt; is like a chef who's eaten at thousands of restaurants, read millions of recipes, and watched countless cooking shows. Give them a prompt ("I want something spicy, Italian-inspired, but with Thai flavors") and they'll &lt;em&gt;generate&lt;/em&gt; something entirely new. Sometimes brilliant. Sometimes... experimental.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzj8tunyu7fkqgb83z3ty.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzj8tunyu7fkqgb83z3ty.png" alt="split triptych showing three kitchen scenes: (1) A robotic arm following a printed recipe exactly, (2) A sommelier AI analyzing data charts to suggest" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key insight: GenAI doesn't look up answers. It &lt;em&gt;generates&lt;/em&gt; them by predicting what tokens (words, code, pixels) should come next, based on patterns learned from massive training data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Your First API Call
&lt;/h3&gt;

&lt;p&gt;Let's stop talking and actually run something. Here's a minimal Python example using OpenAI's API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# genai_hello.py
# Your first Generative AI API call
# Requires: pip install openai
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Uses OPENAI_API_KEY env variable
&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain Generative AI in one sentence, like I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m a software engineer.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You send a prompt (the "user" message)&lt;/li&gt;
&lt;li&gt;The model processes it through billions of parameters&lt;/li&gt;
&lt;li&gt;It generates a response, token by token&lt;/li&gt;
&lt;li&gt;You get back text that &lt;em&gt;didn't exist&lt;/em&gt; before your request&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Run this a few times. Notice how the response varies slightly each time? That's not a bug. It's the core mechanic.&lt;/p&gt;

&lt;h3&gt;
  
  
  What GenAI Can (and Can't) Do
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Genuine capabilities I use daily:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drafting documentation, emails, technical specs&lt;/li&gt;
&lt;li&gt;Explaining unfamiliar code or concepts&lt;/li&gt;
&lt;li&gt;Generating boilerplate code (with review!)&lt;/li&gt;
&lt;li&gt;Brainstorming approaches to problems&lt;/li&gt;
&lt;li&gt;Summarizing long documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real limitations that have bitten me:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations&lt;/strong&gt;: confidently wrong answers that sound perfect&lt;/li&gt;
&lt;li&gt;No actual reasoning: it's pattern matching, not thinking&lt;/li&gt;
&lt;li&gt;Knowledge cutoffs: models don't know recent events&lt;/li&gt;
&lt;li&gt;Inconsistency: same prompt can yield different quality outputs&lt;/li&gt;
&lt;li&gt;Context limits: can't read your entire codebase (yet)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Intermediate: The Paradigm Shift from Deterministic to Probabilistic
&lt;/h2&gt;

&lt;p&gt;Here's where things get interesting. If you've been writing software for a while, you've internalized a core assumption: &lt;strong&gt;same input → same output&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;GenAI breaks that contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It's Probabilistic
&lt;/h3&gt;

&lt;p&gt;Under the hood, &lt;strong&gt;LLMs&lt;/strong&gt; work by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizing&lt;/strong&gt; your input (breaking text into chunks)&lt;/li&gt;
&lt;li&gt;Processing tokens through neural network layers&lt;/li&gt;
&lt;li&gt;For each position, calculating probability distributions over ALL possible next tokens&lt;/li&gt;
&lt;li&gt;Sampling from that distribution to pick the actual next token&lt;/li&gt;
&lt;li&gt;Repeating until done&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output isn't retrieved from a database. It's &lt;em&gt;constructed&lt;/em&gt; on the fly, one token at a time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvnrco883y8441snslbh6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvnrco883y8441snslbh6.png" alt="visualization showing an LLM generating text token by token, with probability bars appearing above each word choice, showing the top 5 candidates with" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Temperature Parameter: Your Control Dial
&lt;/h3&gt;

&lt;p&gt;Here's the single most important parameter you should understand: &lt;strong&gt;temperature&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# temperature_demo.py
# See how temperature affects output variability
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a one-sentence description of what Python is.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;temp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- Temperature: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# Run 3 times to see variance
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What you'll observe:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Temperature 0:&lt;/strong&gt; Nearly identical outputs every run. The model always picks the highest-probability token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature 0.5:&lt;/strong&gt; Slight variations, still coherent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature 1.0:&lt;/strong&gt; More creative, occasional surprises&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature 1.5:&lt;/strong&gt; Wild variations, sometimes off the rails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of temperature like the spice level at a restaurant. Zero is the safe, house recipe every time. Higher values let the chef improvise, sometimes inspired, sometimes questionable.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Trust the Output
&lt;/h3&gt;

&lt;p&gt;This probabilistic nature means you can't treat GenAI outputs like database queries. Here's my mental model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Trust Level&lt;/th&gt;
&lt;th&gt;Verification Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Brainstorming&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;None needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drafting&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Human review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;Low-Medium&lt;/td&gt;
&lt;td&gt;Tests + code review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Factual claims&lt;/td&gt;
&lt;td&gt;Very Low&lt;/td&gt;
&lt;td&gt;Always verify sources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Critical decisions&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Don't delegate these&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The chef analogy again: You'd happily let them experiment with appetizer specials, but you'd want to taste-test before serving to customers, and you'd never let them guess at food allergy information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Advanced: Transformers and Emergent Abilities
&lt;/h2&gt;

&lt;p&gt;Alright, let's pop the hood. If you're comfortable with software architecture, this section explains &lt;em&gt;how&lt;/em&gt; these systems actually work.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Transformer Architecture (The Short Version)
&lt;/h3&gt;

&lt;p&gt;Before 2017, sequence models like &lt;strong&gt;RNNs&lt;/strong&gt; processed text token-by-token, like reading a book one word at a time while trying to remember everything. Slow, and information from early in the sequence got fuzzy.&lt;/p&gt;

&lt;p&gt;The transformer architecture (from the "Attention Is All You Need" paper) introduced a radical idea: &lt;strong&gt;process all tokens in parallel&lt;/strong&gt; using something called "attention."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attention in plain terms:&lt;/strong&gt; Instead of reading sequentially, the model can directly look at relationships between ANY two tokens in the input. When processing "The cat sat on the mat because &lt;strong&gt;it&lt;/strong&gt; was tired," attention lets the model directly connect "it" to "cat" rather than hoping that connection survives through sequential processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7yf6577auo1z0cpad11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7yf6577auo1z0cpad11.png" alt="diagram showing the attention mechanism: a sentence with arrows connecting the word " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parallel processing → trainable on massive datasets&lt;/li&gt;
&lt;li&gt;Attention patterns → models can handle long-range dependencies&lt;/li&gt;
&lt;li&gt;Stacking transformer layers → each layer learns more abstract patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The models you're using (GPT-4, Claude, Gemini) are just &lt;em&gt;really big&lt;/em&gt; stacks of transformer blocks, trained on &lt;em&gt;really big&lt;/em&gt; datasets, with clever fine-tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Emergent Abilities: The Weird Part
&lt;/h3&gt;

&lt;p&gt;Here's something that still surprises me: abilities that "emerge" at scale without being explicitly trained.&lt;/p&gt;

&lt;p&gt;When you train small models, they get gradually better at their training task. But at certain scale thresholds, capabilities appear that weren't in the training objective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chain-of-thought reasoning&lt;/li&gt;
&lt;li&gt;Following complex multi-step instructions&lt;/li&gt;
&lt;li&gt;In-context learning (learning from examples in the prompt)&lt;/li&gt;
&lt;li&gt;Code debugging and generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nobody trained GPT-4 on "how to debug Python code." It emerged from training on enough text that contained code discussions, Stack Overflow answers, and technical documentation.&lt;/p&gt;

&lt;p&gt;This is both exciting and concerning. Exciting because we get useful capabilities "for free." Concerning because we don't fully understand &lt;em&gt;when&lt;/em&gt; or &lt;em&gt;why&lt;/em&gt; they emerge, or when they might fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stress Test: Long-Context Degradation
&lt;/h3&gt;

&lt;p&gt;Let's run an experiment that exposes real limitations. Models advertise large &lt;strong&gt;context windows&lt;/strong&gt; (100K+ tokens), but performance isn't uniform across that window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# long_context_stress_test.py
# Test the "Lost in the Middle" phenomenon
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_retrieval_position&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;needle_position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Hide a fact in different positions within a long context
    and test if the model can retrieve it.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# The "needle" - a specific fact to retrieve
&lt;/span&gt;    &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The secret project code name is AURORA-7.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# "Haystack" - filler paragraphs about various topics
&lt;/span&gt;    &lt;span class="n"&gt;filler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Cloud computing has transformed how organizations deploy applications. 
    The shift from on-premise servers to managed cloud services has enabled 
    rapid scaling and reduced operational overhead. Major providers include 
    AWS, Azure, and Google Cloud Platform, each with distinct strengths.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;  &lt;span class="c1"&gt;# Repeat to create bulk
&lt;/span&gt;
    &lt;span class="c1"&gt;# Construct the context based on position
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;needle_position&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;filler&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;needle_position&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;middle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;half&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filler&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;filler&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# end
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filler&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Here is a document:

&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: What is the secret project code name?
Answer with just the code name, nothing else.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Test all positions
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;middle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;test_retrieval_position&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Needle at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Retrieved &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What you'll likely see:&lt;/strong&gt; Models perform better when the key information is at the start or end of the context, and worse when it's buried in the middle. This is the &lt;strong&gt;Lost in the Middle&lt;/strong&gt; phenomenon, and it has real implications for how you structure prompts and &lt;strong&gt;RAG&lt;/strong&gt; systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Summary
&lt;/h2&gt;

&lt;p&gt;Generative AI is genuinely transformative technology, and it's also genuinely overhyped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's real:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;These models can generate useful text, code, and creative content&lt;/li&gt;
&lt;li&gt;They can adapt to new tasks via prompting without retraining&lt;/li&gt;
&lt;li&gt;They're getting better fast: what fails today might work next quarter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's marketing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"It understands": No, it predicts based on patterns&lt;/li&gt;
&lt;li&gt;"It reasons": No, it mimics reasoning patterns from training data&lt;/li&gt;
&lt;li&gt;"It will replace X": It changes how X is done, rarely replaces it entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineers who thrive with GenAI are the ones who understand &lt;em&gt;both&lt;/em&gt;: who leverage the real capabilities while building guardrails around the limitations.&lt;/p&gt;

&lt;p&gt;Next up in this series: &lt;strong&gt;Prompt Engineering&lt;/strong&gt; foundations, covering how to actually communicate effectively with these systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GenAI&lt;/td&gt;
&lt;td&gt;AI that creates new content by predicting what comes next&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temperature&lt;/td&gt;
&lt;td&gt;Controls randomness (0 = deterministic, higher = more random)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token&lt;/td&gt;
&lt;td&gt;Basic unit of text processing (~4 chars in English)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transformer&lt;/td&gt;
&lt;td&gt;Architecture that processes all tokens in parallel via attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emergence&lt;/td&gt;
&lt;td&gt;Capabilities that appear at scale without explicit training&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination&lt;/td&gt;
&lt;td&gt;Confident generation of plausible but false information&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Is GenAI just a more sophisticated search engine?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No, and this confusion causes a lot of problems. Search engines retrieve existing information. GenAI &lt;em&gt;generates&lt;/em&gt; new text that may or may not reflect real information. When you ask ChatGPT a question, it's not looking anything up. It's constructing an answer based on patterns. That's why it can confidently state things that don't exist. Treat it like a creative collaborator who's well-read but occasionally makes stuff up, not like a factual reference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Should I be worried about my job as a developer?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I've been using GenAI heavily for about a year now. My honest take: it changes what I spend time on, not whether I'm needed. I write less boilerplate, but I spend more time on architecture, review, and verification. The developers who struggle are those who either refuse to use these tools OR blindly trust their output. The sweet spot is treating GenAI like a very fast junior developer who needs supervision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I know which model to use?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with the cheapest one that works for your task. For most things, smaller models like GPT-4o-mini or Claude Haiku are fine. Graduate to larger models (GPT-4, Claude Opus) when you hit quality limits. I use &lt;strong&gt;Haiku&lt;/strong&gt; for simple tasks, &lt;strong&gt;Sonnet&lt;/strong&gt; for most coding, and &lt;strong&gt;Opus&lt;/strong&gt; for complex reasoning. Your token bill will thank you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Continue Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enjoyed this article?&lt;/strong&gt; Put your knowledge to the test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://blocksimplified.com/blog/what-is-generative-ai-practical-introduction" rel="noopener noreferrer"&gt;Take the interactive quiz on BlockSimplified&lt;/a&gt;&lt;/strong&gt; to see how much you retained&lt;/li&gt;
&lt;li&gt;Explore 11 linked Learning Blocks, curated resources for deeper understanding&lt;/li&gt;
&lt;li&gt;Follow for more insights on AI, development, and tech&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>generativeai</category>
      <category>llm</category>
      <category>aifluency</category>
    </item>
    <item>
      <title>The Rise of Product Engineering</title>
      <dc:creator>Vaibhav Doddihal</dc:creator>
      <pubDate>Tue, 24 Feb 2026 13:24:55 +0000</pubDate>
      <link>https://dev.to/vibbsdod/the-rise-of-product-engineering-d04</link>
      <guid>https://dev.to/vibbsdod/the-rise-of-product-engineering-d04</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://blocksimplified.com/blog/the-rise-of-product-engineering" rel="noopener noreferrer"&gt;BlockSimplified&lt;/a&gt; — 4 min read&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Something I told every new engineer who joined my team: "You are not a frontend engineer. You are not a backend engineer. Those are just tags for the domain where you have depth. You are a product engineer."&lt;/p&gt;

&lt;p&gt;Most of them looked confused. Some pushed back. But the ones who got it became the strongest engineers I have worked with.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/posts/marcinroszczyk_the-quiet-return-of-engineers-ai-is-exposing-share-7428848892328431616-gJCy" rel="noopener noreferrer"&gt;Marcin Roszczyk wrote something recently&lt;/a&gt; that put words to what I have been saying for years: AI is not killing software engineering. It is exposing what engineering actually is. For the past two decades, we called millions of people software engineers, but most of the work was implementation. Assembling systems, translating requirements into code. Now that AI can implement faster than any of us, the gap is visible.&lt;/p&gt;

&lt;p&gt;I agree. And I have seen this play out firsthand leading teams of 10 to 14 engineers as a Tech Lead and Principal Engineer.&lt;/p&gt;

&lt;p&gt;For years, our industry rewarded implementation volume. Ship tickets. Close sprint points. Move fast inside your lane.&lt;/p&gt;

&lt;p&gt;When implementation stops being scarce, judgment becomes the scarce asset. And that is what is happening right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product engineers, not label engineers
&lt;/h2&gt;

&lt;p&gt;Frontend and backend are domain tags, not identity. They are not excuses to ignore the rest of the user journey.&lt;/p&gt;

&lt;p&gt;If you are on frontend and the integration is painful, saying "the API is wrong" is not engineering. Explain why it is wrong. Show the contract problem. Propose a better shape for the response. Make it easier for both the client and the system.&lt;/p&gt;

&lt;p&gt;If you are on backend, payload size is your problem too. A heavy response hurts parse time, rendering performance, and perceived speed in the browser. Engineering means optimizing the whole path, not just your endpoint benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The role trap early-career engineers fall into
&lt;/h2&gt;

&lt;p&gt;I see this often: new engineers rush to lock identity around FE, BE, or DevOps.&lt;/p&gt;

&lt;p&gt;Specialization is good. Role tribalism is not.&lt;/p&gt;

&lt;p&gt;Real engineering comes from curiosity about how the full application behaves end to end. You do not need to be a jack of all trades. You do need enough context to make decisions that help the product, not just your layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example from my team
&lt;/h2&gt;

&lt;p&gt;We were building local-first software. On first launch, the app needed to sync a large dataset and configure the local environment. We had already stripped initialization to the bare minimum, but the setup still took long enough that users bounced before seeing any value.&lt;/p&gt;

&lt;p&gt;The obvious fix was a loading spinner or progress bar. But that just tells users "wait." It does not solve the actual problem: the user has no reason to stay.&lt;/p&gt;

&lt;p&gt;Two backend engineers proposed something different. Instead of an empty loading state, show interactive marketing slides that teach users how the product works and what they can do with it. At the bottom, display clear messaging about exactly what the system is doing and why it takes time. The user gets oriented and sees value before the app is even ready.&lt;/p&gt;

&lt;p&gt;The constraints were real: the slides had to load instantly from bundled assets (no network dependency during setup), the progress messaging had to be accurate (not a fake progress bar), and the transition to the live app had to feel seamless. They prototyped it, tested the timing, and shipped it.&lt;/p&gt;

&lt;p&gt;NPS went up. Bounce rate during onboarding dropped.&lt;/p&gt;

&lt;p&gt;The best part: two backend engineers drove this from idea through execution. They did not say "that is a frontend problem." They saw a product problem, understood the user experience constraint, and built a solution that worked across the stack.&lt;/p&gt;

&lt;p&gt;That is product engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  A necessary caveat
&lt;/h2&gt;

&lt;p&gt;None of this means specialization is dead. AI still makes junior engineers dramatically more productive at implementation. There are domains like performance-critical systems, security, and compliance where deep specialists are exactly what you need. Not every team needs product engineers. Some need someone who knows memory allocation patterns better than anyone alive.&lt;/p&gt;

&lt;p&gt;The point is not "everyone should do everything." The point is that curiosity beyond your lane is what separates engineers who grow from engineers who plateau.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final take
&lt;/h2&gt;

&lt;p&gt;AI is rewarding engineers who are curious beyond their lane. The next generation of strong teams will be built by people who can reason across boundaries, communicate trade-offs clearly, and care about outcomes more than labels.&lt;/p&gt;

&lt;p&gt;Implementation still matters. But engineering is the moat.&lt;/p&gt;




&lt;h2&gt;
  
  
  Continue Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enjoyed this article?&lt;/strong&gt; Here's how to get more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://blocksimplified.com/blog/" rel="noopener noreferrer"&gt;Read on BlockSimplified&lt;/a&gt;&lt;/strong&gt; for curated resources, FAQs&lt;/li&gt;
&lt;li&gt;Follow for more insights on AI, development, and tech&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>engineering</category>
      <category>productengineering</category>
      <category>techleadership</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
