DEV Community: Vaibhav Doddihal

The Leap to Agentic AI: Introduction to Multi-Agent Systems

Vaibhav Doddihal — Thu, 02 Jul 2026 10:28:31 +0000

Originally published on BlockSimplified — 24 min read

This post is part of my AI Fluency series. We've covered single agents in Module 4; now we're scaling up. Module 5 is about getting multiple agents to work together, which is harder than it sounds.

I remember the moment I realized single agents weren't enough. I had built a research assistant that could search the web, summarize articles, and answer questions. It worked well for simple queries. Then I asked it to "research the AI agent landscape, compare the top 5 frameworks, and write a technical blog post with code examples." It choked. The context window filled up, the output became unfocused, and the code examples were hallucinated garbage.

That's when I started exploring multi-agent systems. The idea is simple: instead of one agent doing everything, you create a team. A researcher agent gathers information. A writer agent crafts the prose. A coder agent handles the technical examples. A reviewer agent catches errors. Each specialist does what it's good at, and together they produce something none could create alone.

Why Single Agents Hit a Wall

Single agents are powerful. With the right tools and prompts, they can do impressive things. The question is where they break down.

Here are the walls I've hit:

Context window limits. Complex tasks pile up information fast: research results, previous outputs, tool responses, conversation history. A single agent running a 10-step workflow runs out of context space before it finishes the job.

Specialization beats generalization. A system prompt can only stretch one agent so far. Ask it to be a world-class researcher AND writer AND coder AND editor, and you get the jack-of-all-trades problem: competent everywhere, excellent nowhere.

No second opinion. A single agent can hallucinate, make logical errors, or drift off-task, and nobody is watching. A second agent that reviews the first one's work catches mistakes that would otherwise slip through.

Parallelization. A single agent works through tasks one at a time. When subtasks are independent, multiple agents can research different parts of a problem at the same time.

The Human Team Analogy

Think about how a software team actually ships a feature. Nobody assigns one person to design, build, test, and document it solo. You split the work.

Here's how that maps:

Role	Responsibility	Agent Equivalent
Product Manager	Defines requirements, prioritizes	Planner Agent
Researcher	Investigates solutions, gathers context	Research Agent
Developer	Writes the code	Coder Agent
Code Reviewer	Catches bugs, suggests improvements	Reviewer Agent
Technical Writer	Documents the work	Writer Agent
QA Tester	Validates the implementation	Tester Agent

Each person has deep expertise in their area. They communicate through defined channels (standups, PRs, docs). A project manager coordinates the workflow. Sound familiar?

Agent orchestration is the AI equivalent of project management. Someone (or something) needs to decide: which agent handles this task? In what order? What happens when an agent fails?

Role Specialization: What Makes Each Agent Unique

In a multi-agent system, each agent has a distinct role. The role defines:

What the agent knows (system prompt, context)
What the agent can do (available tools)
What the agent is responsible for (its piece of the workflow)

Here's a concrete example. Let's say you're building a "research and write" system for technical blog posts.

The Research Agent

Role: Technical Researcher
Goal: Gather accurate, comprehensive information on the given topic
Backstory: You're a meticulous researcher who digs deep into technical topics.
           You cite sources, verify claims, and organize findings clearly.

Tools: web_search, read_documentation, fetch_github_repos

This agent's entire job is research. It doesn't write prose or format content. It searches, reads, and compiles facts. Its output is structured research notes that another agent will use.

The Writer Agent

Role: Technical Writer
Goal: Transform research into engaging, clear technical content
Backstory: You're an experienced technical writer who explains complex topics
           in accessible language. You use analogies, examples, and structure.

Tools: none (pure generation)

The writer takes the researcher's output and crafts it into readable content. It doesn't search the web or verify facts; that was done upstream. It focuses purely on writing quality.

The Editor Agent

Role: Technical Editor
Goal: Review content for accuracy, clarity, and consistency
Backstory: You're a detail-oriented editor who catches errors others miss.
           You verify technical claims, improve sentence structure, and ensure
           the content matches the target audience.

Tools: fact_check, grammar_check

The editor is the quality gate. It reviews the writer's output, flags issues, and either approves or requests revisions.

Collaboration Patterns: Centralized vs. Decentralized

Once you have multiple agents, you need to decide how they collaborate. The two main agent collaboration patterns are centralized and decentralized.

Centralized: The Manager Pattern

In centralized orchestration, one agent (the "manager" or "coordinator") controls the workflow. It receives the initial task, breaks it into subtasks, assigns each to the appropriate specialist agent, collects results, and delivers the final output.

┌─────────────────────────────────────────────────┐
│                                                 │
│              ┌──────────────┐                   │
│              │   MANAGER    │                   │
│              │    AGENT     │                   │
│              └──────┬───────┘                   │
│                     │                           │
│         ┌──────────┼──────────┐                 │
│         ▼          ▼          ▼                 │
│   ┌──────────┐ ┌──────────┐ ┌──────────┐        │
│   │ RESEARCH │ │  WRITER  │ │  EDITOR  │        │
│   │  AGENT   │ │  AGENT   │ │  AGENT   │        │
│   └──────────┘ └──────────┘ └──────────┘        │
│                                                 │
└─────────────────────────────────────────────────┘

Pros:

Clear control flow
Easy to debug (you can trace every decision through the manager)
Single point of coordination
Works well for defined workflows

Cons:

Manager is a bottleneck
If the manager fails, everything fails
Doesn't scale well to large agent networks

Most teams should start here. It's simpler, and you can always evolve to decentralized later.

Decentralized: Agent-to-Agent

In decentralized patterns, agents communicate directly with each other based on protocols or discovery mechanisms. There's no central manager; agents negotiate, delegate, and collaborate autonomously.

┌─────────────────────────────────────────────────┐
│                                                 │
│   ┌──────────┐         ┌──────────┐             │
│   │ RESEARCH │◄───────►│  WRITER  │             │
│   │  AGENT   │         │  AGENT   │             │
│   └────┬─────┘         └────┬─────┘             │
│        │                    │                   │
│        │    ┌──────────┐    │                   │
│        └───►│  EDITOR  │◄───┘                   │
│             │  AGENT   │                        │
│             └──────────┘                        │
│                                                 │
└─────────────────────────────────────────────────┘

Pros:

No single point of failure
Scales to large agent networks
Agents can dynamically discover collaborators
Works for marketplace/negotiation scenarios

Cons:

Harder to debug (who decided what?)
Requires protocols and discovery mechanisms
More failure modes
Higher complexity

Decentralized patterns make sense when you have many agents, dynamic environments, or need agents from different vendors/platforms to collaborate. This is where protocols like A2A (Agent2Agent) come in.

Communication: How Agents Talk to Each Other

Agents need to exchange information. The mechanism you choose affects how easy the system is to debug, how reliably it delivers results, and whether agents from different vendors can work together.

Message Passing

The simplest approach: agents send messages to each other. The message includes:

Who it's from
Who it's for
The content (task, results, questions)
Any relevant context

# Simplified message structure
message = {
    "from": "research_agent",
    "to": "writer_agent",
    "type": "task_result",
    "content": {
        "topic": "Multi-agent systems",
        "findings": [...],
        "sources": [...]
    }
}

This works for simple systems but gets messy as you scale. Who manages the message queue? How do you handle failed delivery? What if agents speak different "languages"?

Shared Memory

Instead of passing messages, agents read from and write to a shared state (like a database or in-memory store). Each agent checks the shared memory for new tasks, updates it with results, and other agents see those updates.

┌─────────────────────────────────────────────────┐
│                                                 │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
│   │ RESEARCH │   │  WRITER  │   │  EDITOR  │    │
│   │  AGENT   │   │  AGENT   │   │  AGENT   │    │
│   └────┬─────┘   └────┬─────┘   └────┬─────┘    │
│        │              │              │          │
│        ▼              ▼              ▼          │
│   ┌──────────────────────────────────────┐      │
│   │         SHARED MEMORY / STATE        │      │
│   │  (Redis, Vector Store, Database)     │      │
│   └──────────────────────────────────────┘      │
│                                                 │
└─────────────────────────────────────────────────┘

When to use shared memory:

Agents need access to the same data
You want to decouple producers from consumers
State needs to persist across agent restarts

Communication Protocols: A2A (and where ACP went)

For agents to communicate across platforms or vendors, you need standardized protocols. The picture got a lot clearer in 2025-2026, and it's worth knowing how it shook out so you don't bet on a dead standard.

***Agent2Agent (A2A):* Originally Google's open protocol for agent interoperability, donated to the Linux Foundation in June 2025 under neutral governance. Agents publish "Agent Cards" describing what they can do, and other agents can discover and invoke them. A2A reached v1.0 in April 2026 with Signed Agent Cards (cryptographic identity verification so a receiving agent can confirm a card really came from its claimed owner), multi-tenancy, and JSON-RPC/gRPC bindings. By its one-year mark it had 150+ supporting organizations and production integrations across Azure AI Foundry, Amazon Bedrock AgentCore, and Google Cloud.

Agent Communication Protocol (ACP): IBM's REST-based protocol (launched March 2025 to power the BeeAI platform) for lightweight agent invocation. In August 2025, ACP merged into A2A under the Linux Foundation — the BeeAI platform now uses A2A. So if you saw "ACP vs A2A" debates from early 2025, that question has been answered: it's A2A.

These protocols matter for the future of multi-agent systems. Today, most teams still let frameworks handle communication internally. But as agent ecosystems grow and you need agents from different vendors to collaborate, A2A is becoming the interoperability layer worth learning.

One concrete sign of maturity: A2A now has an official payments extension, the Agent Payments Protocol (AP2), built with 60+ payments and tech companies (Mastercard, PayPal, American Express, Coinbase, and others) so agents can securely initiate and authorize transactions on a user's behalf. Agentic commerce is moving from demo to standard.

A Simple Multi-Agent Example

Here's pseudocode for a two-agent "research and write" system with centralized orchestration:

# Pseudocode for a simple multi-agent system

def run_multi_agent(task: str):
    # Step 1: Manager breaks down the task
    manager = create_agent(
        role="Manager",
        goal="Coordinate research and writing tasks"
    )

    subtasks = manager.plan(task)
    # subtasks = ["Research X", "Write article about X"]

    results = {}

    # Step 2: Research agent handles research
    researcher = create_agent(
        role="Researcher",
        tools=[web_search, read_docs]
    )
    results["research"] = researcher.execute(subtasks[0])

    # Step 3: Writer agent handles writing, using research results
    writer = create_agent(
        role="Writer",
        context=results["research"]
    )
    results["draft"] = writer.execute(subtasks[1])

    # Step 4: Manager reviews and returns
    final = manager.review(results["draft"])
    return final

This is about 20 lines of pseudocode, but it captures the core pattern:

A manager agent plans the workflow
Specialist agents execute their piece
Results flow from one agent to the next
The manager delivers the final output

Real implementations add error handling, retries, logging, and more sophisticated orchestration. But the foundation is this simple.

When Things Go Wrong: Handling Agent Failures

In single-agent systems, failure is straightforward: the agent errors, you handle it. In multi-agent systems, failures cascade. Agent A fails, so Agent B doesn't get input, so Agent C produces garbage.

This isn't hand-waving. It's measured. UC Berkeley's MAST study ("Why Do Multi-Agent LLM Systems Fail?") hand-annotated 150 conversation traces across seven popular open-source multi-agent frameworks and found 14 distinct failure modes that cluster into three buckets: system/specification design (~41%), inter-agent misalignment (~37%), and task verification (~21%). The headline takeaway: most multi-agent failures aren't reasoning failures — they're coordination and verification failures. Agents act on stale or divergent views of shared state, or nobody checks the final output. That's exactly where you should spend your engineering effort.

Here's how I think about failure handling:

1. Fail Fast with Clear Errors

Each agent should validate its inputs and outputs. If an agent receives garbage, it should fail immediately with a clear error rather than produce garbage output.

def execute_with_validation(agent, task, input_data):
    # Validate input
    if not input_data or not input_data.get("content"):
        raise ValueError(f"Agent {agent.role} received empty input")

    result = agent.execute(task, input_data)

    # Validate output
    if not result or len(result) < 100:
        raise ValueError(f"Agent {agent.role} produced insufficient output")

    return result

2. Retry with Backoff

Transient failures (rate limits, network blips) should trigger retries:

def execute_with_retry(agent, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            return agent.execute(task)
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

3. Fallback Agents

For critical tasks, have a backup. If your primary research agent fails, maybe a simpler agent with different tools can provide basic results:

def research_with_fallback(task):
    try:
        return primary_research_agent.execute(task)
    except AgentError:
        log.warning("Primary researcher failed, using fallback")
        return fallback_research_agent.execute(task)

4. Human Escalation

Sometimes the right answer is "ask a human." Build escalation paths for high-stakes or ambiguous situations:

def execute_with_escalation(agent, task, confidence_threshold=0.7):
    result = agent.execute(task)

    if result.confidence < confidence_threshold:
        return request_human_review(task, result)

    return result

Practical Starting Point: CrewAI

If you want to try multi-agent systems today, CrewAI is a good starting point. It provides clear abstractions for:

Agents: Define role, goal, backstory, tools
Tasks: Define what needs to be done, expected output
Crews: Group agents and tasks into workflows
Processes: Sequential or hierarchical execution

Here's what a minimal CrewAI setup looks like:

from crewai import Agent, Task, Crew

# Define agents
researcher = Agent(
    role="Senior Researcher",
    goal="Find accurate, comprehensive information",
    backstory="You're an expert researcher with attention to detail",
    tools=[search_tool, scrape_tool]
)

writer = Agent(
    role="Technical Writer",
    goal="Create clear, engaging technical content",
    backstory="You explain complex topics in simple terms"
)

# Define tasks
research_task = Task(
    description="Research multi-agent systems and their applications",
    expected_output="Structured research notes with sources",
    agent=researcher
)

writing_task = Task(
    description="Write a blog post based on the research",
    expected_output="1500-word blog post",
    agent=writer
)

# Create and run the crew
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task]
)

result = crew.kickoff()

This is real code (not pseudocode). CrewAI handles the orchestration, message passing, and execution order. You focus on defining agents and tasks.

What's Next

You've got the foundations: why single agents hit walls, how to carve up roles, and when centralized beats decentralized. The next posts go deeper:

{/* TODO: Uncomment when these articles are published

*/}

For now, pick a task that naturally splits into two phases (research + writing is a good one), build one agent for each, and run it. You'll hit a failure mode or an unexpected behavior within the first few runs. That's the real learning.

Key Concepts Recap

Multi-Agent Systems
Agent Orchestration
Collaboration Patterns
AI Agents
Agentic AI
Agentic Systems
A2A Protocol
Function Calling
Guardrails

FAQs

What is a multi-agent system in AI?

A multi-agent system is a setup where several specialized AI agents collaborate to solve problems that would overwhelm a single agent. Each agent has a specific role, set of tools, and responsibility, like a researcher, a writer, and an editor working as a team instead of one person trying to do all three jobs.

When should I use a multi-agent system instead of a single agent?

Reach for multi-agent when you hit real walls with a single agent: context window limits on long workflows, the need for true specialization that one system prompt can't deliver, wanting a second agent to review for errors, or genuine parallelization. If a well-crafted prompt chain already solves your problem, stop there. Multi-agent adds real complexity.

What is the difference between centralized and decentralized agent orchestration?

Centralized orchestration uses a manager agent to assign tasks and collect results, which gives you clear control flow, easy debugging, and a good fit for defined workflows. Decentralized lets agents communicate directly without a central coordinator, which scales better in dynamic environments but is harder to debug and more complex to build. Start centralized.

How many agents should I use in a multi-agent system?

Start with the minimum: usually 2-3 agents with clearly distinct roles. Every additional agent adds coordination overhead, more API calls, and more potential failure points. Add agents only when you've identified a clear capability gap. In practice, 3-5 agents handle most use cases. If you need more, you might be over-engineering or could restructure into sub-crews.

Can different agents use different LLM providers?

Yes, and sometimes you should. A research agent might benefit from a model with strong web browsing capabilities, while a coding agent works better with a model optimized for code generation. Mix and match based on each agent's needs. Just watch out for increased complexity in error handling, cost tracking, and latency management.

What is the difference between multi-agent systems and prompt chaining?

Prompt chaining is sequential: output from prompt A becomes input for prompt B. It is linear and deterministic. Multi-agent systems add autonomy: agents can decide what to do, use tools, and communicate in non-linear ways. Prompt chains are simpler and sufficient for many use cases. Multi-agent systems handle more complex, dynamic workflows.

Continue Learning

Enjoyed this article? Put your knowledge to the test:

Take the interactive quiz on BlockSimplified to see how much you retained
Explore 14 linked Learning Blocks, curated resources, FAQs for deeper understanding
Follow for more insights on AI, development, and tech

The "Why": A Framework for AI Ethics

Vaibhav Doddihal — Thu, 18 Jun 2026 13:45:53 +0000

The "Why": A Framework for AI Ethics

Originally published on BlockSimplified

This article is part of my AI Fluency Curriculum, documenting my learnings around AI Fluency & Applied AI.
This is the first post in Module 8: Ethics, Safety, and Governance. We're starting with the foundational question: why do ethics matter for AI, and how do we actually practice them?

Ethics in AI isn't a box to check at the end of a project. It's a way of thinking that shapes every decision, from data collection to deployment. Get it wrong, and your AI fails the people who use it. Real people. At scale.

Current as of June 2026
This guide reflects the latest landscape: the EU AI Act's general-purpose AI obligations went live in August 2025, but in May 2026 EU negotiators provisionally agreed a "Digital Omnibus" that pushes the high-risk-system obligations back from August 2026 to December 2027; the 2025-26 wave of AI hiring-bias litigation (Mobley v. Workday's certified collective action, still escalating in 2026, and Harper v. Sirius XM); Fairlearn 0.14 (June 2026); and the growing role of ISO/IEC 42001 and the NIST Generative AI Profile as governance baselines.

I want to start with a confession. When I first heard "AI ethics" years ago, I mentally filed it under "compliance stuff that slows down real work." I was wrong, and the cases in this post are what changed my mind.

Look closely at how the famous failures actually happened and the same shape shows up every time. A well-intentioned team. No malice. No obvious bug. The system worked exactly as designed, but the design hadn't accounted for the ethical implications of the data patterns it learned. Amazon spent years building a recruiting tool it eventually threw away. A healthcare algorithm shaped care decisions for an estimated 100 million people before anyone caught what it was doing. None of these were cheap to fix, and none were caught early.

Ethics isn't the enemy of shipping. It's the prerequisite for shipping something that doesn't blow up in your face.

The four pillars of AI ethics, in brief

AI ethics breaks down into four pillars (FATP): fairness, accountability, transparency, and privacy.

AI amplifies bias in its training data: Amazon's recruiting tool, COMPAS risk scores, and a healthcare algorithm affecting 100 million patients all show how.

Fairness definitions are mathematically incompatible when base rates differ, so you must pick one and document the trade-off.

Treat ethics as a design constraint with concrete tools like Fairlearn, not a post-launch box to check.

What You'll Learn

By the end of this post, you'll be able to:

Understand and apply the four pillars of AI ethics: fairness, accountability, transparency, and privacy
Recognize real-world examples of AI systems causing ethical harm
Use technical interventions like Fairlearn for measuring and mitigating bias
Navigate the trade-offs between competing ethical principles

We'll cover three levels: Beginner (real-world case studies of AI bias), Intermediate (technical fairness interventions), and Advanced (societal trade-offs and formal mitigation strategies).

Beginner: Real-World Case Studies of AI Harm

Let me start with stories. Not because I want to scare you, but because ethical failures aren't abstract. They happen to real people, and understanding what went wrong is the first step to building differently.

Case Study 1: Amazon's Hiring Algorithm

In 2018, Reuters reported that Amazon had scrapped an internal AI recruiting tool after discovering it was biased against women. The system was trained on resumes submitted over 10 years, a period when the tech industry was predominantly male. The AI learned that male candidates were preferable, penalizing resumes that included the word "women's" (as in "women's chess club") or graduates of all-women's colleges.

The AI wasn't malicious. It was doing exactly what it was trained to do: find patterns in historical data and replicate them. The problem was that historical data encoded historical discrimination. The algorithm automated and scaled what had been individual human bias.

The lesson: AI systems don't transcend their training data. They amplify it.

Case Study 2: COMPAS Criminal Risk Assessment

ProPublica's 2016 investigation of COMPAS, a criminal risk assessment algorithm used across the US, found that the system was significantly more likely to incorrectly flag Black defendants as high-risk compared to white defendants. A white defendant and a Black defendant with similar criminal histories would receive different risk scores.

The company that made COMPAS, Northpointe (now Equivant), disputed the methodology. They argued that the algorithm met a different fairness criterion. Both sides were technically correct; they were just using different definitions of fairness.

The COMPAS case surfaces something the Amazon case didn't: algorithmic fairness isn't a single thing. There are multiple, mathematically incompatible definitions of what "fair" means. You literally cannot satisfy all of them simultaneously when different groups have different base rates.

The lesson: "Make it fair" isn't a specification. You have to choose which type of fairness matters most for your context, and be honest about the trade-offs.

Case Study 3: Healthcare Algorithm Racial Bias

A 2019 study published in Science found that a widely used healthcare algorithm systematically underestimated the health needs of Black patients. The algorithm used healthcare costs as a proxy for health needs, but because Black patients historically had less access to healthcare (and thus lower costs), the algorithm concluded they were healthier than they actually were.

The result? Sicker Black patients were deprioritized for care programs compared to healthier white patients. The algorithm affected an estimated 100 million patients annually.

100 million: patients affected annually by a healthcare algorithm that underestimated the health needs of Black patients. It used healthcare costs as a proxy for health, and Black patients historically had lower costs due to less access to care. (Obermeyer et al., Science (2019))

Proxy variables are dangerous
The healthcare algorithm never used race directly. It used healthcare costs, which correlated with race due to systemic inequities. This is called "bias by proxy," and it's one of the sneakiest ways discrimination enters AI systems. You can build a discriminatory system without ever touching protected attributes.

The lesson: Fairness through unawareness (not using sensitive attributes) doesn't work. Other variables carry the same signal.

Case Study 4: The 2025 AI Hiring-Bias Lawsuits

The earlier cases were investigations and academic studies. By 2025, these failures became courtroom liability. In May 2025, a federal court in California granted preliminary certification of a nationwide age-discrimination collective action in Mobley v. Workday, where the plaintiff alleges that an AI applicant-screening platform rejected him from over 100 jobs (his broader suit also claims race and disability discrimination). The case has only escalated since: in March 2026 the court rejected Workday's argument that age-discrimination law doesn't cover job applicants, keeping the collective action alive. The legal theory is disparate impact: even with no intent to discriminate, a screening tool that disproportionately filters out a protected group can be unlawful.

Then in August 2025, Harper v. Sirius XM echoed the healthcare case's proxy problem in a hiring context: the complaint alleges the AI screener used educational background and zip codes as stand-ins that correlated with race. Same "bias by proxy" mechanism, brand-new legal exposure.

The lesson: Ethical failures aren't just reputational anymore. If your AI screens, ranks, or scores people, "we didn't mean to discriminate" is not a defense, disparate impact looks at outcomes, not intent.

Case Study 5: The 2023-2024 Wave (When Shipped Products Failed)

The cases above are famous partly because they're old enough to have a verdict. But this isn't ancient history. The same failure modes keep shipping in mainstream AI products.

Google's Gemini (February 2024). Google paused Gemini's ability to generate images of people after it produced historically inaccurate results: racially diverse "founding fathers," a female pope, and Black and Asian Nazi soldiers. This one is the opposite of the Amazon case. Google had tuned the model to force diversity, countering the well-documented tendency of image models to default to white faces, and the correction overshot into rewriting history. CEO Sundar Pichai called the responses "completely unacceptable." The lesson is uncomfortable: mitigating bias is a judgment call, not a switch you flip. Over-correct, and you trade one failure for another.

SafeRent (settled November 2024). A tenant-screening AI gave each applicant a score that landlords used to accept or reject them. A class action alleged the score disproportionately downgraded Black and Hispanic applicants and housing-voucher holders, because it leaned on credit history and ignored the vouchers that make rent affordable. SafeRent settled for $2.3 million and agreed to stop showing the score for voucher applicants. This is the healthcare case's proxy problem, in production, with a price tag.

Stable Diffusion (analyzed 2023). When Bloomberg generated more than 5,000 images across occupations, the model amplified real-world bias rather than just mirroring it: higher-paying jobs skewed lighter-skinned and male, lower-paying jobs skewed darker-skinned, and "a person" defaulted to a light-skinned man. A University of Washington study presented at EMNLP 2023 found the same pattern. Generative AI, the tech most of us now use daily, makes the "AI scales the bias in its data" problem worse, not better.

The throughline: 2016 to 2024, different companies, different domains, identical root cause. Anyone who tells you AI bias is a solved problem is selling something.

Why These Cases Matter

Across a decade of mainstream AI systems, built by smart people with good intentions, the same root causes show up every time:

Training data encoded historical inequities (hiring algorithm)
Fairness definitions conflict and choices weren't made explicit (COMPAS)
Proxy variables carried discriminatory signal (healthcare algorithm, SafeRent, and the 2025 hiring lawsuits)
Outcomes, not intentions, create liability (Mobley, Harper, and the rising wave of disparate-impact litigation)
Over-correcting backfires too (Gemini's forced diversity rewrote history)

If your response to these cases is "my team would never do that," you're not paying attention. The path from "reasonable business metric" to "systematically disadvantaging vulnerable populations" is shorter than most engineers realize.

The Four Pillars: A Framework You Can Actually Use

I organize AI ethics around four pillars. I call it FATP: Fairness, Accountability, Transparency, and Privacy. Use them as design constraints, not a post-launch checklist.

Pillar 1: Fairness

Fairness is about ensuring AI systems don't create or reinforce unfair bias against individuals or groups.

Key questions to ask:

Who could be harmed by this system's errors?
Are outcomes equitable across different demographic groups?
What fairness metric are we optimizing for, and why that one?
Have we tested for bias using real, representative data?

The impossible trade-off: Different fairness definitions (demographic parity, equalized odds, predictive parity) cannot all be satisfied simultaneously when base rates differ between groups. This is mathematically proven. You have to choose.

For a hiring algorithm, you might prioritize equal selection rates across groups (demographic parity). For a medical diagnosis system, you might prioritize equal false negative rates across groups (equalized odds) because missing a disease is the critical error. The right choice depends on what harms you're most trying to prevent.

Pillar 2: Accountability

Accountability establishes who is responsible when AI systems cause harm.

Key questions to ask:

Who owns this AI system's outcomes?
What happens when it makes a mistake?
Can affected individuals appeal or challenge AI decisions?
Is there human oversight for high-stakes decisions?

The accountability gap: Traditional accountability frameworks assumed human decision-makers. AI breaks that. When an autonomous system denies your loan, who do you hold accountable? The developer who wrote the algorithm? The company that deployed it? The data provider whose dataset contained bias? The accountability gap is what you're left with when nobody clearly owns the outcome.

Practical accountability requires:

Clear role assignments (who can stop a deployment, who reviews outcomes)
Audit trails (records of what decisions were made and why)
Redress mechanisms (how harmed individuals can seek remedy)
Meaningful human oversight, where reviewers can actually change or stop a decision

Pillar 3: Transparency

Transparency means making AI systems understandable to stakeholders.

Key questions to ask:

Can affected individuals understand why a decision was made about them?
Is the AI system's involvement disclosed?
Are the limitations documented and communicated?
Can the system be audited by external parties?

Levels of transparency:

Technical explainability: What features drove this prediction? (SHAP values, attention weights)
User interpretability: Why did I get this result, in plain language?
Organizational disclosure: Are people told when AI is making decisions about them?

Different stakeholders need different types of transparency. An ML engineer debugging a model needs technical explainability. An end user denied a loan needs human-understandable reasoning. A regulator needs audit access.

Pillar 4: Privacy

Privacy protects individuals from unauthorized collection, use, and exposure of their data.

Key questions to ask:

What data does this system collect, and is collection minimized?
How long is data retained?
Can individuals access, correct, or delete their data?
Are there protections against re-identification from "anonymized" data?

AI-specific privacy concerns:

Training data privacy: Models can memorize and regurgitate training data, including personal information
Inference attacks: Sophisticated attackers can extract training data from model outputs
Aggregation risks: Combining multiple non-sensitive attributes can reveal sensitive information

Privacy by design means building protections into the system from the start, not bolting them on later.

Intermediate: Technical Fairness Interventions

Theory is nice, but let's get practical. How do you actually measure and mitigate bias in a real AI system?

Measuring Bias with Fairlearn

Fairlearn is an open-source toolkit (now a community-governed project, originally from Microsoft) that helps you assess and improve fairness. The current release is 0.14.0 (June 2026), which brought scikit-learn 1.6 compatibility and made the CorrelationRemover (handy for stripping proxy signal from features) fully scikit-learn-compatible. Here's a practical walkthrough.

Step 1: Define your fairness metric

Before you measure anything, decide what "fair" means for your use case. Fairlearn supports multiple metrics:

Demographic parity: Selection rates are equal across groups
Equalized odds: True positive and false positive rates are equal across groups
Predictive parity: Precision is equal across groups

# Example: Measuring demographic parity difference
from fairlearn.metrics import demographic_parity_difference

# y_true: actual outcomes, y_pred: predicted outcomes
# sensitive_features: group membership (e.g., gender, race)
dp_diff = demographic_parity_difference(
    y_true=y_true,
    y_pred=y_pred,
    sensitive_features=sensitive_features
)

print(f"Demographic Parity Difference: {dp_diff:.3f}")
# Closer to 0 = more fair (by this definition)

Step 2: Visualize disparities

Fairlearn provides dashboards to visualize how your model performs across groups.

from fairlearn.metrics import MetricFrame
from sklearn.metrics import accuracy_score, precision_score, recall_score

metrics = {
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score
}

metric_frame = MetricFrame(
    metrics=metrics,
    y_true=y_true,
    y_pred=y_pred,
    sensitive_features=sensitive_features
)

# See performance broken down by group
print(metric_frame.by_group)

This prints accuracy, precision, and recall side by side for each group, so a gap that the aggregate score hides becomes obvious at a glance.

Step 3: Apply mitigation techniques

Fairlearn offers algorithms to reduce unfairness:

Threshold optimization: Find different decision thresholds for different groups to equalize a fairness metric
Reduction approaches: Constrain the model during training to satisfy fairness constraints
Post-processing: Adjust predictions after the model is trained

from fairlearn.postprocessing import ThresholdOptimizer

# Optimize thresholds to achieve equalized odds
postprocess_est = ThresholdOptimizer(
    estimator=your_model,
    constraints="equalized_odds",
    prefit=True
)

postprocess_est.fit(X_train, y_train, sensitive_features=sensitive_train)
y_pred_fair = postprocess_est.predict(X_test, sensitive_features=sensitive_test)

Mitigation has costs
Fairness mitigation almost always reduces some other metric, often accuracy. This is not a bug; it's a fundamental trade-off. You're explicitly trading overall predictive performance for more equitable outcomes across groups. This is a values decision, not a technical one.

The Fairness-Accuracy Trade-off in Practice

Here's what the trade-off looks like in practice. Say you have a loan approval model:

Scenario	Overall Accuracy	Approval Gap (Group A vs B)
Baseline	85%	25 percentage points
After mitigation	82%	8 percentage points

You've reduced the approval gap from 25 points to 8 points, but accuracy dropped 3 points. Is that worth it? The answer depends on:

How much harm does the disparity cause?
What are the consequences of reduced accuracy?
What do stakeholders value?

There's no formula to answer these questions. They're ethical choices that require human judgment.

Advanced: Societal Trade-offs and Formal Mitigation

Here's where it gets genuinely uncomfortable: the FATP pillars sometimes conflict with each other, and the fairness definitions from the previous section are mathematically incompatible. No clever engineering eliminates the trade-off. You have to choose.

The Impossibility Theorem

In 2016, researchers proved that three common fairness definitions (calibration, balance for the positive class, and balance for the negative class) cannot all be satisfied simultaneously unless the base rates are equal across groups or the classifier is perfect.

This is known as the impossibility theorem. In any real-world scenario with different base rates, you will violate at least one reasonable-sounding fairness criterion. That's not a model flaw. It's a mathematical limit.

Example: In criminal recidivism prediction, if one group actually has a higher base rate of re-offending (due to systemic factors like poverty, lack of opportunity, etc.), then:

Equalizing prediction rates (demographic parity) will over-predict risk for the lower-base-rate group
Equalizing false positive rates will under-serve the higher-base-rate group
Equalizing calibration will result in different prediction rates

You literally cannot have all three. Which do you choose?

A Framework for Trade-off Decisions

When facing ethical trade-offs, I use this framework:

1. Identify the stakeholders

Who benefits from the AI system?
Who bears the risks?
Who has no voice in the decision?

2. Map the harms

What happens if the system is unfair by metric A?
What happens if it's unfair by metric B?
Which harms are reversible? Which are permanent?

3. Consider power asymmetries

Does the system affect vulnerable populations disproportionately?
Do affected individuals have recourse?
Who profits from the system vs. who bears the risk?

4. Make and document the choice

Which fairness criterion are you prioritizing and why?
What are you explicitly trading off?
How will you monitor for unintended consequences?

The 2026 Regulatory Reality

Ethics used to be mostly voluntary. As of mid-2026, the FATP pillars increasingly map to legal obligations, so documenting your choices is also how you stay compliant.

EU AI Act: The bans on prohibited practices (since February 2025) and the obligations for general-purpose AI model providers (since 2 August 2025) are live and unchanged. But the headline 2 August 2026 deadline for high-risk systems has moved: under the "Digital Omnibus" that the Council and Parliament provisionally agreed on 7 May 2026 (formal adoption expected before August), Annex III high-risk obligations are deferred to 2 December 2027 and AI embedded in regulated products to 2 August 2028. The deferral is a pragmatic admission that the supporting standards and infrastructure weren't ready, not a softening of the substance. Penalties for prohibited practices still reach up to EUR 35 million or 7% of global turnover, and the voluntary General-Purpose AI Code of Practice (published July 2025, signed by Anthropic, Google, Microsoft, OpenAI, and others) still operationalizes the transparency and safety expectations.
US state laws are in flux. Colorado's pioneering AI Act (SB 24-205) targeted "algorithmic discrimination" in high-risk systems, but it never took effect: its start date was pushed to 30 June 2026, then the whole framework was repealed and replaced by SB 26-189 (signed 14 May 2026), a narrower transparency-and-consumer-rights law that takes effect 1 January 2027. The lesson for builders: regulation is moving fast and unevenly, so design to the principles, not to a single statute.
Voluntary standards as a baseline. ISO/IEC 42001 (the first certifiable AI management-system standard) and the NIST Generative AI Profile (AI 600-1) have become the de facto frameworks organizations adopt to demonstrate due care, increasingly as a procurement requirement for vendors.

Compliance follows ethics, not the other way around
If you've already worked through the FATP checklist, fairness testing, accountability ownership, transparency documentation, and privacy controls, you're most of the way to satisfying the EU AI Act, ISO/IEC 42001, and the NIST profile. Teams that bolt compliance on at the end pay for it the expensive way: retraining, scrapped work, legal exposure, and lost trust, the kind of bill the case studies above all ran up.

Writing a Formal Mitigation Report

For high-stakes AI systems, document your ethical analysis formally. Here's a template:

# Ethical Mitigation Report: [System Name]

## 1. System Description
- Purpose and intended use
- Affected populations
- Decision types (recommendations, predictions, automatic actions)

## 2. Fairness Analysis

### Metrics Assessed
| Metric | Definition | Result |
|--------|------------|--------|
| Demographic Parity | Equal selection rates | 0.15 gap |
| Equalized Odds | Equal TPR/FPR | 0.08 gap |

### Groups Analyzed
- [Group A vs Group B]
- [Other relevant comparisons]

## 3. Trade-off Analysis
- [Fairness metric A] vs [Fairness metric B]: We prioritized [A] because [reasoning]
- Accuracy impact: [X]% reduction in overall accuracy

## 4. Mitigation Applied
- Technique: [ThresholdOptimizer / Reductions / etc.]
- Parameters: [settings]
- Resulting metrics: [post-mitigation numbers]

## 5. Residual Risks
- [Remaining disparities]
- [Known limitations]

## 6. Monitoring Plan
- Metrics tracked post-deployment
- Alerting thresholds
- Review frequency

## 7. Accountability
- System owner: [name]
- Ethics review: [approver]
- Redress contact: [process for affected individuals]

Make it a living document
This isn't a one-time report. Update it as the system evolves, as you gather production data, and as your understanding deepens. Ethical analysis is iterative, not waterfall.

The FATP Checklist

Before deploying any AI system, work through this checklist:

Fairness

[ ] Identified protected groups relevant to the use case
[ ] Measured outcomes across demographic groups
[ ] Chosen and documented primary fairness metric
[ ] Applied mitigation if disparities exceeded threshold
[ ] Tested for bias using data representative of production

Accountability

[ ] Assigned clear ownership for system outcomes
[ ] Defined human oversight requirements for high-stakes decisions
[ ] Established redress mechanism for affected individuals
[ ] Created audit trail for decisions
[ ] Documented escalation path for ethical concerns

Transparency

[ ] Documented model limitations and known failure modes
[ ] Provided explanation capability appropriate to stakeholders
[ ] Disclosed AI involvement to affected parties
[ ] Made system auditable by authorized parties
[ ] Published model card or equivalent documentation

Privacy

[ ] Minimized data collection to what's necessary
[ ] Implemented appropriate data retention policies
[ ] Provided individual access and deletion rights
[ ] Assessed re-identification risks
[ ] Protected against training data extraction

The Honest Summary

AI ethics comes down to building systems you can defend when things go wrong. Systems that don't compound existing inequities at scale. Being a good person helps, but good intentions don't survive contact with bad design choices.

What works:

Treating ethics as a design constraint, not a post-launch audit
Using concrete tools like Fairlearn to measure and mitigate bias
Documenting trade-offs explicitly rather than hiding them
Building accountability structures before you need them

What's hard:

Navigating mathematically incompatible fairness definitions
Convincing stakeholders that ethical constraints are worth the accuracy trade-off
Detecting bias from proxy variables you didn't know were proxies
Maintaining ethical vigilance as systems evolve

The teams that get AI ethics right share one trait: they ask "who could this hurt?" before they ask "how fast can we ship?" That question is the design constraint. Catch the answer early and you're fixing a decision. Catch it in production and you're fixing a lawsuit.

Next up in Module 8: AI Safety and Security, where we tackle the technical vulnerabilities that make ethical AI possible or impossible.

Quick Reference

Pillar	Key Question	Primary Tool
Fairness	Are outcomes equitable across groups?	Fairlearn, AIF360
Accountability	Who's responsible when things go wrong?	RACI matrix, audit trails
Transparency	Can stakeholders understand decisions?	SHAP, model cards
Privacy	Is data collection minimized and protected?	Privacy by design

FAQs

Q: Our company doesn't have an AI ethics team. How do we get started?

You don't need a dedicated team to start practicing AI ethics. Begin with the FATP checklist for your next AI project. Assign an "ethics owner" (could be the tech lead or PM) who ensures the checklist gets attention. Run fairness metrics on your existing systems to establish baselines. The goal isn't perfection from day one; it's building the muscle of asking ethical questions consistently. As you mature, you might invest in dedicated roles, but many organizations practice effective AI ethics with distributed responsibility.

Q: Doesn't focusing on fairness reduce model accuracy? How do I justify that to stakeholders?

Yes, fairness interventions often reduce overall accuracy. Frame it this way: What's the cost of the current unfairness? If your hiring algorithm systematically excludes qualified candidates from certain groups, you're leaving talent on the table. If your loan algorithm denies creditworthy applicants unfairly, you're losing good business. Calculate the cost of false negatives across groups, not just aggregate accuracy. Often, "reducing accuracy" means "reducing accuracy for the majority group while improving it for minority groups." The aggregate number goes down, but the system becomes more useful for more people.

Q: How do I know which fairness metric to prioritize?

Start with the harms you're trying to prevent. If the main concern is equal access (loan approvals, hiring), demographic parity matters more. If the concern is equal treatment of actual positives (medical diagnosis), equalized odds matters more. If you need the predictions to mean the same thing across groups (risk scores), calibration matters. There's no universal answer; it depends on the domain, the stakes, and the values of your organization. The key is to make the choice explicitly and document your reasoning.

Read the Full Curriculum

This piece is one post in my AI Fluency Curriculum, where I document what I'm learning about building and shipping AI responsibly. The full version on BlockSimplified includes an interactive quiz, linked Learning Blocks for the key terms, and a curated resource list. If ethics-as-design-constraint resonated, read the full article and the rest of the series.

Evaluating LLM Systems: Metrics, Methods, and Scorecards

Vaibhav Doddihal — Thu, 18 Jun 2026 13:45:13 +0000

Evaluating LLM Systems: Metrics, Methods, and Scorecards

Originally published on BlockSimplified — 11 min read

This post is part of the AI Fluency series, where I document my learnings around applied AI concepts. The goal is to help you build practical skills you can apply in real projects.

Here is the hard truth about LLM development: most teams ship without proper evaluation. They run a few manual tests, the outputs "look good," and they call it done. Then users start complaining about weird responses, and suddenly nobody knows if the problem is the prompt, the model, or the retrieval pipeline.

I have been there. Early in my LLM projects, I would tweak a prompt, eyeball a few outputs, and deploy. It felt productive. But when something broke in production, I had no baseline to compare against. Did the new prompt actually help? Was the model always this bad at edge cases? No idea.

Evaluation & Testing is not just about catching bugs. It is your compass for improvement. Without systematic evaluation, you are navigating by feel in a space where intuition often fails.

Why Evaluation is Hard (And Why Most Teams Skip It)

LLMs are not like traditional software. When you test a function that adds two numbers, the expected output is clear. With LLMs, the "correct" answer is subjective, context-dependent, and often impossible to define precisely.

Consider this: you ask an LLM to summarize an article. There are dozens of valid summaries. Some focus on the main argument, others on supporting details. Some are formal, others conversational. How do you score that?

This ambiguity leads teams to skip evaluation entirely. It feels like too much work for uncertain benefit. But skipping evaluation means you are:

Flying blind when making prompt changes
Unable to compare models objectively
Missing regressions that hurt users
Building on a foundation you cannot trust

The good news: evaluation does not have to be perfect to be useful. Even rough metrics beat no metrics. Let me show you how to build a practical evaluation system.

The Evaluation Pyramid: Three Levels of Rigor

I think about LLM evaluation as a pyramid with three levels. Each level trades off between accuracy and scalability.

Level 1: Human Evaluation (The Gold Standard)

Human Evaluation is the most accurate but least scalable. Real people assess real outputs against criteria like helpfulness, accuracy, and tone.

When to use it:

Validating that your automated metrics correlate with actual quality
Evaluating subjective criteria like "does this sound natural?"
High-stakes applications where errors are costly
Creating the initial labels for your golden dataset

How to do it well:

Define clear criteria. Vague instructions like "rate quality" lead to inconsistent scores. Instead, specify: "Rate helpfulness from 1-5, where 1 means the response does not address the question at all, and 5 means it fully answers with actionable details."
Use multiple annotators. At minimum, have 3 people rate each response. Calculate inter-rater agreement using Cohen's Kappa. If agreement is low (below 0.6), your guidelines need work.
Include calibration examples. Show annotators examples of responses at each score level before they start. This anchors their judgments.

The practical reality: Human evaluation is expensive. You cannot have humans review every response in production. That is why we need automated methods.

Level 2: LLM-as-a-Judge (Scalable Quality Assessment)

LLM-as-a-Judge uses a capable model to evaluate outputs from your system. It is faster and cheaper than humans while being more nuanced than simple metrics.

The basic pattern:

judge_prompt = """
You are an expert evaluator. Rate the following response on a scale of 1-5
for HELPFULNESS.

Scoring rubric:
1 - Does not address the question
2 - Partially addresses but missing key information
3 - Addresses the question but could be clearer
4 - Good answer with minor room for improvement
5 - Excellent, comprehensive answer

User question: {question}
Response to evaluate: {response}

Provide your score and a brief justification.
"""

Key considerations:

Use a stronger model as judge. If you are evaluating GPT-3.5 outputs, use GPT-4 as the judge. The judge should be at least as capable as the model being evaluated.
Validate against human labels. Run your judge on a set where you have human scores. If correlation is below 0.7, refine your rubric.
Watch for biases. LLMs prefer verbose responses and may favor outputs similar to their training data. Check for these patterns in your evaluations.
Use reference-guided judging when possible. Providing the judge with a reference answer improves consistency.

Level 3: Automated Metrics (Fast but Limited)

Automated Evaluation Metrics are the fastest and cheapest option. They compute scores algorithmically without any LLM calls.

Traditional NLP metrics:

Metric	What it measures	Good for
BLEU	N-gram overlap with reference	Translation
ROUGE	Recall of reference n-grams	Summarization
BERTScore	Semantic similarity via embeddings	General text comparison
Exact Match	String equality	Factoid QA with single correct answer

The problem: These metrics measure surface-level similarity, not actual quality. A response could be helpful and accurate but score poorly because it uses different words than the reference.

When automated metrics work:

Tasks with clearly correct answers (math, coding with test cases)
Detecting obvious failures (empty responses, errors)
Tracking trends over time (even imperfect metrics show direction)

RAG-Specific Evaluation: The Metrics That Matter

If you are building a Retrieval-Augmented Generation system, generic evaluation is not enough. You need metrics that assess both retrieval quality and generation quality.

Retrieval Metrics

Before the LLM even sees the context, did you retrieve the right documents?

Context Precision: Of the documents retrieved, how many were actually relevant?
Context Recall: Of all relevant documents in your corpus, how many did you retrieve?
Recall@K / Precision@K: Versions of above limited to top K results

Why this matters: If retrieval fails, even the best LLM cannot give good answers. Always evaluate retrieval independently.

Generation Metrics (RAG-specific)

These metrics assess the LLM's output given the retrieved context:

Faithfulness: Does the response stick to what is in the context? This catches hallucinations where the model makes up facts not supported by the retrieved documents.

Answer Relevance: Does the response actually answer the user's question? A response could be faithful to the context but still miss what the user asked.

Answer Correctness: Is the response factually correct? This compares against a ground truth answer if available.

RAGAS Framework

The RAGAS (Retrieval Augmented Generation Assessment) framework provides a structured approach to these metrics. It uses LLM-as-a-Judge internally to score each dimension.

# Conceptual example - actual RAGAS API may differ
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=your_test_data,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Agent Evaluation: Beyond Single-Turn Responses

Evaluating AI Agents is a different challenge. Agents take multiple steps, use tools, and their success depends on achieving a goal, not just producing good text.

Goal Completion Rate

Did the agent accomplish what the user asked? For a travel planning agent, did it actually book the flight? For a research agent, did it find the information?

This is a binary metric (success/failure) but incredibly important. An agent that produces fluent text but fails to complete tasks is useless.

Tool-Use Accuracy

When the agent decides to use a tool, does it:

Choose the right tool for the situation?
Provide correct parameters?
Use tool results appropriately?

Track each of these separately. You might find your agent is good at choosing tools but bad at formatting parameters.

Trajectory Analysis

For multi-step tasks, examine the full trajectory:

How many steps did it take? (Efficiency)
Did it recover from errors? (Robustness)
Did it take unnecessary detours? (Planning quality)

Safety Violation Rate

Especially important for agents with real-world actions. Did the agent ever:

Attempt unauthorized actions?
Leak sensitive information?
Violate explicit constraints?

Even a 0.1% violation rate is too high for production agents with meaningful capabilities.

Building Your Evaluation Scorecard

A scorecard brings all your metrics together in one view. It tells you at a glance whether your system is healthy.

What to Include

Core metrics (track always):

Overall quality score (LLM-as-Judge, 1-5)
Faithfulness (for RAG)
Goal completion rate (for agents)
Safety violation rate

Diagnostic metrics (dig in when core metrics drop):

Context precision/recall (RAG retrieval health)
Tool-use accuracy (agent capability)
Latency and token usage (operational health)

The Golden Dataset

Golden Dataset is your foundation for reliable evaluation. It is a curated set of inputs with verified expected outputs.

How to build one:

Start with real queries. Pull from production logs (anonymized). These represent actual user needs.
Include edge cases. Add queries that have caused failures. These are your regression tests.
Get expert verification. Have domain experts validate or write reference answers.
Keep it manageable. 50-100 high-quality examples beat 1,000 sloppy ones. Quality over quantity.

How to use it:

Run your golden dataset after any change:

New prompt version? Run the golden dataset.
Model upgrade? Run the golden dataset.
Retrieval pipeline change? Run the golden dataset.

Compare scores to your baseline. If quality drops, investigate before deploying.

Automated Regression Testing

Integrate golden dataset evaluation into your CI/CD pipeline:

# Conceptual GitHub Actions workflow
name: LLM Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run golden dataset evaluation
        run: python evaluate.py --dataset golden_set.json

      - name: Check quality thresholds
        run: |
          if [ $(cat results.json | jq '.faithfulness') < 0.85 ]; then
            echo "Faithfulness dropped below threshold"
            exit 1
          fi

Practical Implementation: Where to Start

If you are just getting started with LLM evaluation, here is my recommended sequence:

Week 1: Create Your Golden Dataset

Pull 30-50 representative queries from production or brainstorming
Write reference answers for each
Include 10+ edge cases or known failure scenarios

Week 2: Set Up LLM-as-a-Judge

Create a judge prompt for your primary quality criterion
Run it on your golden dataset
Manually review judge outputs to check reasonableness

Week 3: Validate and Iterate

Have humans rate a subset of the same responses
Compare human scores to judge scores
Refine your judge prompt until correlation is decent (aim for 0.7+)

Week 4: Automate

Integrate evaluation into your deployment process
Set quality thresholds that block bad deploys
Create a dashboard to track metrics over time

Ongoing: Expand and Maintain

Add new examples to golden dataset as you find failures
Add metrics for new dimensions (safety, latency, etc.)
Review and update quarterly

Common Mistakes to Avoid

Optimizing for the metric, not the goal. Your metric is a proxy for quality, not quality itself. If you tune prompts to maximize your judge scores, you might overfit to the judge's preferences rather than actual user needs.

Too few examples in golden dataset. You need coverage of your use cases. Fifty examples is a minimum; one hundred is better. But focus on quality and diversity, not raw quantity.

Not validating your judge. An LLM judge can have systematic biases. Always check correlation with human judgment before trusting it.

Evaluating in isolation. A component might score well individually but fail in the full pipeline. Test end-to-end, not just pieces.

Static evaluation sets. Your application evolves. Your evaluation set should too. Review and update regularly.

Key Takeaways

Key Concepts

Evaluation & Testing
Evaluation Metrics
LLM-as-a-Judge
Human Evaluation
Golden Dataset
Retrieval-Augmented Generation
AI Agents
Hallucinations

Continue Learning

Enjoyed this article? Put your knowledge to the test:

Take the interactive quiz on BlockSimplified to see how much you retained
Explore 16 linked Learning Blocks, curated resources for deeper understanding
Follow for more insights on AI, development, and tech

What is Generative AI? A Practical Introduction

Vaibhav Doddihal — Thu, 18 Jun 2026 13:44:35 +0000

What is Generative AI? A Practical Introduction

Originally published on BlockSimplified — 13 min read

Welcome to the AI Fluency Curriculum, a series I'm building to help engineers and technical folks get genuinely comfortable with applied AI. Not the hype. The actual mechanics.
This is the first post in Module 1: Foundations of Generative AI.

I remember the first time I got ChatGPT to write a bash script for me. I'd described what I needed in plain English, and out came working code. My first reaction: "How does it know this?" My second reaction: "Wait, that variable name is wrong." That tension between impressive capability and subtle wrongness is what we're going to unpack.

What You'll Learn

By the end of this post, you'll be able to:

Explain GenAI in simple terms (to your manager, your team, your confused relatives)
Differentiate it from traditional software and predictive AI
Identify real capabilities and limitations, not just the marketing version

We'll cover three depth levels: Beginner, Intermediate, and Advanced. Skip around based on what you need.

Beginner: GenAI as "Autocomplete on Steroids"

Let me start with an analogy that helped me get it.

The Restaurant Analogy

Imagine a restaurant kitchen:

Traditional Software is like a recipe book. You give it inputs (ingredients), it follows exact steps, you get a predictable output. Same input = same dish. Every. Single. Time. A calculator works this way. Your banking app works this way.

Predictive AI (the old kind) is like a sommelier who looks at your order and predicts: "Based on customers who ordered the lamb, you'll probably want the Malbec." It classifies, predicts, and recommends, but it doesn't create anything new.

Generative AI is like a chef who's eaten at thousands of restaurants, read millions of recipes, and watched countless cooking shows. Give them a prompt ("I want something spicy, Italian-inspired, but with Thai flavors") and they'll generate something entirely new. Sometimes brilliant. Sometimes... experimental.

The key insight: GenAI doesn't look up answers. It generates them by predicting what tokens (words, code, pixels) should come next, based on patterns learned from massive training data.

Your First API Call

Let's stop talking and actually run something. Here's a minimal Python example using OpenAI's API:

# genai_hello.py
# Your first Generative AI API call
# Requires: pip install openai

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env variable

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Explain Generative AI in one sentence, like I'm a software engineer."}
    ]
)

print(response.choices[0].message.content)

What's happening:

You send a prompt (the "user" message)
The model processes it through billions of parameters
It generates a response, token by token
You get back text that didn't exist before your request

Run this a few times. Notice how the response varies slightly each time? That's not a bug. It's the core mechanic.

What GenAI Can (and Can't) Do

Genuine capabilities I use daily:

Drafting documentation, emails, technical specs
Explaining unfamiliar code or concepts
Generating boilerplate code (with review!)
Brainstorming approaches to problems
Summarizing long documents

Real limitations that have bitten me:

Hallucinations: confidently wrong answers that sound perfect
No actual reasoning: it's pattern matching, not thinking
Knowledge cutoffs: models don't know recent events
Inconsistency: same prompt can yield different quality outputs
Context limits: can't read your entire codebase (yet)

Intermediate: The Paradigm Shift from Deterministic to Probabilistic

Here's where things get interesting. If you've been writing software for a while, you've internalized a core assumption: same input → same output.

GenAI breaks that contract.

Why It's Probabilistic

Under the hood, LLMs work by:

Tokenizing your input (breaking text into chunks)
Processing tokens through neural network layers
For each position, calculating probability distributions over ALL possible next tokens
Sampling from that distribution to pick the actual next token
Repeating until done

The output isn't retrieved from a database. It's constructed on the fly, one token at a time.

The Temperature Parameter: Your Control Dial

Here's the single most important parameter you should understand: temperature.

# temperature_demo.py
# See how temperature affects output variability

from openai import OpenAI

client = OpenAI()

prompt = "Write a one-sentence description of what Python is."

for temp in [0, 0.5, 1.0, 1.5]:
    print(f"\n--- Temperature: {temp} ---")
    for i in range(3):  # Run 3 times to see variance
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=temp,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"  {i+1}: {response.choices[0].message.content}")

What you'll observe:

Temperature 0: Nearly identical outputs every run. The model always picks the highest-probability token.
Temperature 0.5: Slight variations, still coherent
Temperature 1.0: More creative, occasional surprises
Temperature 1.5: Wild variations, sometimes off the rails

Think of temperature like the spice level at a restaurant. Zero is the safe, house recipe every time. Higher values let the chef improvise, sometimes inspired, sometimes questionable.

When to Trust the Output

This probabilistic nature means you can't treat GenAI outputs like database queries. Here's my mental model:

Task Type	Trust Level	Verification Approach
Brainstorming	High	None needed
Drafting	Medium	Human review
Code generation	Low-Medium	Tests + code review
Factual claims	Very Low	Always verify sources
Critical decisions	None	Don't delegate these

The chef analogy again: You'd happily let them experiment with appetizer specials, but you'd want to taste-test before serving to customers, and you'd never let them guess at food allergy information.

Advanced: Transformers and Emergent Abilities

Alright, let's pop the hood. If you're comfortable with software architecture, this section explains how these systems actually work.

The Transformer Architecture (The Short Version)

Before 2017, sequence models like RNNs processed text token-by-token, like reading a book one word at a time while trying to remember everything. Slow, and information from early in the sequence got fuzzy.

The transformer architecture (from the "Attention Is All You Need" paper) introduced a radical idea: process all tokens in parallel using something called "attention."

Attention in plain terms: Instead of reading sequentially, the model can directly look at relationships between ANY two tokens in the input. When processing "The cat sat on the mat because it was tired," attention lets the model directly connect "it" to "cat" rather than hoping that connection survives through sequential processing.

Why this matters for you:

Parallel processing → trainable on massive datasets
Attention patterns → models can handle long-range dependencies
Stacking transformer layers → each layer learns more abstract patterns

The models you're using (GPT-4, Claude, Gemini) are just really big stacks of transformer blocks, trained on really big datasets, with clever fine-tuning.

Emergent Abilities: The Weird Part

Here's something that still surprises me: abilities that "emerge" at scale without being explicitly trained.

When you train small models, they get gradually better at their training task. But at certain scale thresholds, capabilities appear that weren't in the training objective:

Chain-of-thought reasoning
Following complex multi-step instructions
In-context learning (learning from examples in the prompt)
Code debugging and generation

Nobody trained GPT-4 on "how to debug Python code." It emerged from training on enough text that contained code discussions, Stack Overflow answers, and technical documentation.

This is both exciting and concerning. Exciting because we get useful capabilities "for free." Concerning because we don't fully understand when or why they emerge, or when they might fail.

Stress Test: Long-Context Degradation

Let's run an experiment that exposes real limitations. Models advertise large context windows (100K+ tokens), but performance isn't uniform across that window.

# long_context_stress_test.py
# Test the "Lost in the Middle" phenomenon

from openai import OpenAI

client = OpenAI()

def test_retrieval_position(needle_position: str):
    """
    Hide a fact in different positions within a long context
    and test if the model can retrieve it.
    """

    # The "needle" - a specific fact to retrieve
    needle = "The secret project code name is AURORA-7."

    # "Haystack" - filler paragraphs about various topics
    filler = """
    Cloud computing has transformed how organizations deploy applications. 
    The shift from on-premise servers to managed cloud services has enabled 
    rapid scaling and reduced operational overhead. Major providers include 
    AWS, Azure, and Google Cloud Platform, each with distinct strengths.
    """ * 20  # Repeat to create bulk

    # Construct the context based on position
    if needle_position == "start":
        context = needle + "\n\n" + filler
    elif needle_position == "middle":
        half = len(filler) // 2
        context = filler[:half] + "\n\n" + needle + "\n\n" + filler[half:]
    else:  # end
        context = filler + "\n\n" + needle

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "user", "content": f"""
Here is a document:

{context}

Question: What is the secret project code name?
Answer with just the code name, nothing else.
"""}
        ]
    )

    return response.choices[0].message.content

# Test all positions
for pos in ["start", "middle", "end"]:
    result = test_retrieval_position(pos)
    print(f"Needle at {pos}: Retrieved '{result}'")

What you'll likely see: Models perform better when the key information is at the start or end of the context, and worse when it's buried in the middle. This is the Lost in the Middle phenomenon, and it has real implications for how you structure prompts and RAG systems.

The Honest Summary

Generative AI is genuinely transformative technology, and it's also genuinely overhyped.

What's real:

These models can generate useful text, code, and creative content
They can adapt to new tasks via prompting without retraining
They're getting better fast: what fails today might work next quarter

What's marketing:

"It understands": No, it predicts based on patterns
"It reasons": No, it mimics reasoning patterns from training data
"It will replace X": It changes how X is done, rarely replaces it entirely

The engineers who thrive with GenAI are the ones who understand both: who leverage the real capabilities while building guardrails around the limitations.

Next up in this series: Prompt Engineering foundations, covering how to actually communicate effectively with these systems.

Quick Reference

Concept	What It Means
GenAI	AI that creates new content by predicting what comes next
Temperature	Controls randomness (0 = deterministic, higher = more random)
Token	Basic unit of text processing (~4 chars in English)
Transformer	Architecture that processes all tokens in parallel via attention
Emergence	Capabilities that appear at scale without explicit training
Hallucination	Confident generation of plausible but false information

FAQs

Q: Is GenAI just a more sophisticated search engine?

No, and this confusion causes a lot of problems. Search engines retrieve existing information. GenAI generates new text that may or may not reflect real information. When you ask ChatGPT a question, it's not looking anything up. It's constructing an answer based on patterns. That's why it can confidently state things that don't exist. Treat it like a creative collaborator who's well-read but occasionally makes stuff up, not like a factual reference.

Q: Should I be worried about my job as a developer?

I've been using GenAI heavily for about a year now. My honest take: it changes what I spend time on, not whether I'm needed. I write less boilerplate, but I spend more time on architecture, review, and verification. The developers who struggle are those who either refuse to use these tools OR blindly trust their output. The sweet spot is treating GenAI like a very fast junior developer who needs supervision.

Q: How do I know which model to use?

Start with the cheapest one that works for your task. For most things, smaller models like GPT-4o-mini or Claude Haiku are fine. Graduate to larger models (GPT-4, Claude Opus) when you hit quality limits. I use Haiku for simple tasks, Sonnet for most coding, and Opus for complex reasoning. Your token bill will thank you.

Continue Learning

Enjoyed this article? Put your knowledge to the test:

Take the interactive quiz on BlockSimplified to see how much you retained
Explore 11 linked Learning Blocks, curated resources for deeper understanding
Follow for more insights on AI, development, and tech

The Rise of Product Engineering

Vaibhav Doddihal — Tue, 24 Feb 2026 13:24:55 +0000

Originally published on BlockSimplified — 4 min read

Something I told every new engineer who joined my team: "You are not a frontend engineer. You are not a backend engineer. Those are just tags for the domain where you have depth. You are a product engineer."

Most of them looked confused. Some pushed back. But the ones who got it became the strongest engineers I have worked with.

Marcin Roszczyk wrote something recently that put words to what I have been saying for years: AI is not killing software engineering. It is exposing what engineering actually is. For the past two decades, we called millions of people software engineers, but most of the work was implementation. Assembling systems, translating requirements into code. Now that AI can implement faster than any of us, the gap is visible.

I agree. And I have seen this play out firsthand leading teams of 10 to 14 engineers as a Tech Lead and Principal Engineer.

For years, our industry rewarded implementation volume. Ship tickets. Close sprint points. Move fast inside your lane.

When implementation stops being scarce, judgment becomes the scarce asset. And that is what is happening right now.

Product engineers, not label engineers

Frontend and backend are domain tags, not identity. They are not excuses to ignore the rest of the user journey.

If you are on frontend and the integration is painful, saying "the API is wrong" is not engineering. Explain why it is wrong. Show the contract problem. Propose a better shape for the response. Make it easier for both the client and the system.

If you are on backend, payload size is your problem too. A heavy response hurts parse time, rendering performance, and perceived speed in the browser. Engineering means optimizing the whole path, not just your endpoint benchmarks.

The role trap early-career engineers fall into

I see this often: new engineers rush to lock identity around FE, BE, or DevOps.

Specialization is good. Role tribalism is not.

Real engineering comes from curiosity about how the full application behaves end to end. You do not need to be a jack of all trades. You do need enough context to make decisions that help the product, not just your layer.

A real example from my team

We were building local-first software. On first launch, the app needed to sync a large dataset and configure the local environment. We had already stripped initialization to the bare minimum, but the setup still took long enough that users bounced before seeing any value.

The obvious fix was a loading spinner or progress bar. But that just tells users "wait." It does not solve the actual problem: the user has no reason to stay.

Two backend engineers proposed something different. Instead of an empty loading state, show interactive marketing slides that teach users how the product works and what they can do with it. At the bottom, display clear messaging about exactly what the system is doing and why it takes time. The user gets oriented and sees value before the app is even ready.

The constraints were real: the slides had to load instantly from bundled assets (no network dependency during setup), the progress messaging had to be accurate (not a fake progress bar), and the transition to the live app had to feel seamless. They prototyped it, tested the timing, and shipped it.

NPS went up. Bounce rate during onboarding dropped.

The best part: two backend engineers drove this from idea through execution. They did not say "that is a frontend problem." They saw a product problem, understood the user experience constraint, and built a solution that worked across the stack.

That is product engineering.

A necessary caveat

None of this means specialization is dead. AI still makes junior engineers dramatically more productive at implementation. There are domains like performance-critical systems, security, and compliance where deep specialists are exactly what you need. Not every team needs product engineers. Some need someone who knows memory allocation patterns better than anyone alive.

The point is not "everyone should do everything." The point is that curiosity beyond your lane is what separates engineers who grow from engineers who plateau.

Final take

AI is rewarding engineers who are curious beyond their lane. The next generation of strong teams will be built by people who can reason across boundaries, communicate trade-offs clearly, and care about outcomes more than labels.

Implementation still matters. But engineering is the moat.

Continue Learning

Enjoyed this article? Here's how to get more:

Read on BlockSimplified for curated resources, FAQs
Follow for more insights on AI, development, and tech