DEV Community: Scarlett Attensil

The standard advice on handling bias in tech is wrong

Scarlett Attensil — Tue, 19 May 2026 18:19:19 +0000

Every woman who has worked in tech long enough has gotten the same three pieces of advice. Confront the behavior in the moment. Build an allyship network. Escalate to HR if it crosses a line.

I have tried all three. The first one cost me a working relationship I needed. The second one created a debt I never asked for. The third one put my name on a list and changed nothing about the person who did the thing.

This post is about what I do instead.

To be clear about scope: I am a Senior Developer Educator who has worked across data science, AI/ML, and developer education at multiple companies. My experience covers technical IC work and content-focused roles. If you are a manager of a 200-person org, the calculus may be different. If you are early career and still building a reputation, parts of this will not apply yet.

Here is what I have learned.

Confronting in the moment makes you the problem

The advice goes like this. Someone interrupts you, takes credit for your work, or makes a comment that lands wrong. You name it on the spot. "I was still speaking." "That was my idea from the sync last Tuesday." Direct, calm, professional.

The intent is good. The math does not work.

Confrontation in the moment costs you the room. Whatever you were trying to accomplish in that meeting is now derailed. You become the person who made the meeting awkward. The other party almost never says "you are right, I apologize." They explain, they soften, they reframe. The meeting moves on. You have spent your credibility to teach a lesson nobody asked to learn.

Worse, it lodges in the calibration discussion. Six months later, when someone is evaluating you for a stretch project, the data point that surfaces is "she can be difficult in meetings." Not the original behavior. Your response to it.

I am not saying never push back. I am saying the price is real and it falls on you, not on the person who caused the problem. Spend that currency on the things that matter. The interruption in a five-person room where the work product is at stake is worth a response. The interruption in a 30-person all-hands is not.

Allyship asks create the wrong dependency

The advice here goes like this. Find a senior man who will amplify your voice in meetings, vouch for your ideas, interrupt the interrupter on your behalf. Allies are powerful. Find them.

I have had good allies. They have made real differences. The problem is the structure.

When your visibility in a room depends on a man being in the room and choosing to use his voice for you, you have outsourced your authority. The ally goes on vacation. The ally moves to a different team. The ally has his own promotion cycle and is suddenly less available. Your visibility goes with him.

There is also a quieter cost. Performative allyship is its own ecosystem. Some allies advocate in ways that benefit their own brand more than they shift the dynamic. They make a visible show of speaking up, the room notices the ally, and you become a supporting character in someone else's narrative arc. Your idea becomes the thing the ally championed.

The version of this that works is mutual. You have a peer or near-peer who knows your work cold, and you do the same for them, and you happen to mention each other's contributions because you have actually been in the trenches together. That is not allyship as it is sold. That is normal professional credit, exchanged between people who respect each other.

HR works for the company

This one is the hardest to say out loud, because it sounds cynical and people want to believe HR is on their side.

HR exists to manage legal and reputational risk for the company. When that risk aligns with protecting you, they will protect you. When it does not, they will manage you. Both things can be true of the same HR person on the same day.

When you escalate to HR, three things happen at once. You go on a list. The other party gets coached, usually lightly. Your manager finds out, even if you were told it would be confidential, because your manager has to be looped in for any next-round decisions that involve you. The dynamic with your manager is permanently different from that moment on. They are now managing both you and a known issue, and they are evaluating which one is easier to make go away.

This does not mean never go to HR. It means go to HR with clear eyes. Use it when the behavior is severe enough that you would leave over it anyway, or when you need a paper trail because you have already decided to sue. Do not use it as the first or second line of response. The cost-benefit only works at the extremes.

What I actually do

The replacement playbook is less dramatic. It is also more effective, at least for me.

Document everything for yourself

I keep a private file. Date, what happened, who was there, my response if any. I do not write it for HR or for a future case. I write it so that I can see patterns.

Gender dynamics are gaslighting-prone. The behavior is often plausibly deniable in any single instance. You start to wonder if you are imagining it. The file makes that question answerable. 12 interruptions across four meetings with the same person is a pattern. One is a moment.

The file also helps you decide. When I review it monthly, I notice which incidents still feel hot and which ones I have moved past. The ones still hot get attention. The ones I have moved past taught me they were not worth the calories.

Build influence outside the orbit

The most powerful move I know is to make your reputation depend on people who are not in the dynamic. Skip-levels. Adjacent teams. External community. Public work.

If your visibility comes from your manager and your immediate peers, then any bias in that small group has outsized weight. If your visibility also comes from a talk you gave at a conference, a tutorial that is the top result for a search query, a working relationship with a VP two levels up, that small group's opinion is one data point among many.

This is also the highest-value move you can make for your career independent of any bias. The bias problem and the career problem have the same solution.

Choose what to ignore on purpose

Most micro-incidents are not worth a response. Picking your battles is not surrender. It is portfolio management.

The principle I use: respond to things that have ongoing operational cost, ignore things that are purely social. Someone taking credit for your work in front of your skip-level has operational cost. Someone explaining your own field to you at a happy hour does not. The first shapes future decisions about your work. The second is a bad time you can walk away from.

The energy you save here is what you spend on the work that compounds. Shipping the thing. Writing the post. Giving the talk. The work is what creates room to maneuver.

Leave well, not loud

Sometimes the answer is to go.

When that is the answer, leaving well matters more than leaving loud. A loud exit feels righteous. It also makes every future hiring manager nervous. Leaving well means quiet, on your timeline, with the relationships you want to keep intact. Tell the people who matter to you that you are going. Do not tell the people who caused the problem anything they have not earned.

The best revenge, if you want to think of it that way, is to show up two years later with a better title at a better company and have your work cited in the room you left. The loud exit forecloses on that future. The quiet exit keeps it open.

The underlying trade

The standard advice positions the woman experiencing bias as the agent of correction. Confront him. Recruit allies for yourself. Escalate up the chain. All of it is work, all of it is risky, all of it is uncompensated.

The replacement is to stop trying to fix the dynamic and start building standing that makes the dynamic less relevant. None of this fixes the structural problem. The structural problem will outlast any individual woman's career strategy. What this does is keep you operating at full capacity inside an imperfect system, while the slow work of fixing the system happens at the pace it happens.

That is the trade I have made. Your trade may look different.

The second-order cost of a(nother) layoff year.

Scarlett Attensil — Mon, 18 May 2026 23:14:46 +0000

Layoff years compress organizations. Scopes overlap. Lanes that were clear during a growth year start running into each other. Most of the resulting friction works itself out. Some of it doesn't.

than 300 tech layoff events. Meta is reportedly preparing to cut around 8,000 employees, Oracle reportedly sent 6 a.m. layoff notices in a restructuring that could affect up to 30,000 roles, and Microsoft has launched what GeekWire reports is the first voluntary retirement program in its 51-year history. Coinbase, Block, Cisco, and others have framed recent cuts around AI, flatter teams, and reallocating resources toward higher-growth work.

A market this tight raises the stakes on every working relationship. In a year where leadership is openly asking who can do whose job, that overlap becomes a resource question. When one party works to resolve it collaboratively and the other works to win it, the cost lands on whoever is doing the accommodating.

The thing to know up front is that the asymmetry doesn't close by trying harder on your side. The other party isn't trying to find common ground. They're building a case. Engaging more just gives them better material.

What works is borrowing from infrastructure.

In systems work, there's a concept called blast radius. It describes how far damage spreads when something breaks. A blast radius of one means a single service goes down. A blast radius of everything means a misconfigured deploy takes the whole company offline. The job is to design things so failures stay small.

The same logic applies when someone in your org has set you in their sights. You aren't trying to repair anything. You're trying to keep the damage local.

Shrink the surface

In security, attack surface is the sum of every place a system can be reached, queried, or pulled into. The first move is to make yours smaller without making the change obvious.

This can mean cutting things you used to think were good for your career. Impromptu coffees. The drop-in DM that turns into a thirty-minute therapy session about middle management. Pulled-aside-after-the-meeting conversations. None of it should be bad on its own but all of it can be edited and reframed to fit their negative narrative.

Your early drafts need to stop going into shared docs. Personal stuff should stop showing up in team channels. The hobbies, the weekend plans, the half-joke about a manager. All of it is data, and the safer working assumption is that anything shared at work could end up in front of someone you haven't met.

Make everything written

This single move does more work than anything else on this list.

Slack threads. Doc comments. Ticket updates. Email when it has to be. Anything but the hallway conversation and the unscheduled call.

The first reason is the obvious one. If things escalate later, you'll need a record, and the record will not exist if you talked it out.

The second reason matters more. A lot of what gets said in person doesn't survive being typed. You can almost feel the person editing themselves once they know they're on the record.

When someone pushes for a quick call or wants to grab you for five minutes, redirect politely. "Happy to look once you send it over." "Can you drop the question in the ticket so I can give it real attention." Then go back to whatever you were doing.

This isn't being difficult. It's insisting on the baseline professional communication you'd ask of any other cross-team colleague. The fact that this one finds it inconvenient is sort of the point.

Side effect worth knowing: if they refuse to put anything in writing, that's information too. Anyone who only wants to engage off the record is telling you what kind of conversation they want to be having.

Stop arguing with the framing

Problematic colleagues like to offer you stories about who you are. Sometimes flattering, sometimes critical. The thread is that the framing always shrinks your scope, or anchors your work to theirs.

You don't have to argue with it. You also don't have to accept it. A useful script for these moments: acknowledge what you heard, restate what you're actually doing, move on.

"Got it. I'm still planning to ship the launch post next week. Happy to coordinate if there's overlap."

That's the whole thing. No defense. No correction. No explanation of why their version of your role is wrong. The argument isn't winnable anyway. They aren't actually offering a view of your job. They're offering a smaller version of you, and the only correct response to a smaller version of you is to keep being the size you are.

Protect what compounds

The real cost of one of these dynamics isn't the conflicts. It's what they do to your attention. The counter is to protect the parts of your work that compound.

This matters more right now than it did two years ago. Coinbase cut roughly 14 percent of its workforce and called the resulting team "lean, fast, and AI-native." Block cut nearly half its staff, and Jack Dorsey told the company that smaller, flatter teams were the future. When that's the prevailing logic across the industry, the people who survive are the ones whose work has obvious, legible, compounding leverage. You want to be expensive to lose, and you want the case for keeping you to be visible without anyone having to fight for you in a room you aren't in.

A useful weekly check: would the next thing you're shipping still matter if this person weren't in your org. When the answer is no for too many weeks in a row, the dynamic has captured you, and it's time to reset.

Keep the file

Somewhere on a personal machine, not anything synced to work, keep a plain text document with dates, what happened, who was there, what was said or written. No commentary. No theories. Just the receipts.

You'll likely never need it. You'll be glad you have it.

The file isn't a place for venting. Venting goes to friends, a therapist, or a journal you'd be okay losing. The file's job is to make a clear factual case if it ever has to be made to a manager or HR. Trying to reconstruct six months of incidents from memory while stressed and being asked pointed questions is a particular kind of awful. The file is insurance against that. It costs maybe four minutes a week.

Escalate carefully or not at all

Most of these dynamics don't need escalation. They burn out. The colleague gets bored, the team reorgs, or more likely, the company has bigger problems. The default move is to outlast the thing, not confront it.

The threshold for taking it up the chain is concrete harm. Credit getting taken in a way that hits a perf review. Material misrepresentation of you to leadership. Sustained interference with shipping.

When escalation does happen, lead with impact on the work. "This is blocking shipping X by Y" gets traction. "This is making me uncomfortable" doesn't, even though both can be true at the same time. Managers will move on the first. Many will quietly let the second die.

What to actually do this week

Three things, if any of this is hitting.

Pull the receipts on the last two months. Even if nothing is currently on fire. Especially if nothing is currently on fire, because the moment you'll wish you had this information is the moment you're least able to gather it.

Close one informal channel. The unscheduled call. The "can I grab you" Slack. The hallway side-meeting. Just one. The point isn't to wall off the relationship in a day. The point is to start changing the geometry.

And look at what's shipping next. If it's a thing decided in reaction to something happening internally, swap it. Ship the thing that grows your own visibility instead. The colleague has a finite amount of energy for this. So do you. Spend yours on the work that compounds.

There's no clean ending here. The breakthrough conversation doesn't happen. The colleague doesn't suddenly see what you've been seeing. What happens, with your attention back on the work, is that the work gets better. Doors open in rooms no one inside your team controls. Turns out, it was always running on access to you. By the time you notice it's gone quiet, you're somewhere else entirely.

That's the win.

Claude Code is the engine, Cursor is the cockpit

Scarlett Attensil — Sat, 16 May 2026 22:30:39 +0000

My daily workflow looks nothing like it did a year ago. A lot has landed in Claude Code recently. Skills replaced custom slash commands, subagents and plugins followed, and four days ago /goal shipped, which lets an agent run autonomously for hours or days against a completion condition. Cursor 3 came out with an agents-first interface and a cheap in-house model. MCP went from "interesting protocol" to how every tool plugs into every other tool.

The net effect is that I now run Claude Code as the primary engine and treat Cursor as the cockpit. Cursor is where I drop in when I want a GUI, design mode, or a real-time view of a diff landing. Most everything else, including the long-running stuff, lives in Claude Code.

Five patterns that stuck.

Pattern 1: Claude Code is the engine, Cursor is the cockpit

The biggest mindset shift was treating the two tools as different categories of thing rather than competitors.

Claude Code runs in the terminal with Opus 4.7 behind it, and it's built around the idea that you can describe an outcome and let it work. /goal and skills make it possible to run multi-hour tasks against verifiable end conditions, then codify the workflows you do over and over. That's the engine.

Cursor is where I go when I specifically need to see something. The new agents window is where I manage parallel work, but for the tight inline loops I still flip back to the editor window, where Tab predictions and Composer beat anything else. Design mode lives there too, which is where I end up when I'm chasing a UI tweak. With Composer 2 sitting at 50 cents per million input tokens, Cursor 3 also makes a cheap surface for parallel cloud agents on smaller tasks. The cloud agents are the ones I'll fire off before a meeting, expecting a PR to be waiting when I get back. That's the cockpit.

My rough heuristic now:

One-sentence tasks I can walk away from for an hour go to Claude Code, usually with /goal.
Anything I want to watch happen in my editor goes to Cursor.
Repeatable workflows I've done more than twice get codified as a Claude Code skill, so it doesn't matter which surface I call them from.

A bonus trick that applies in both tools: plan with a smart slow model, execute with a fast cheap one. In Cursor 3, plan in plan mode with Opus 4.7, then switch to Composer 2 Fast to build. In Claude Code, draft the plan with the default Opus model and hand the execution off via /goal so a smaller verifier model does the per-turn checking. Saves real money and almost never costs you quality.

Pattern 2: One conventions file, two readers

Both tools support project-level instruction files, and as of Cursor 3 they both support skills as well. Claude Code reads CLAUDE.md at the repo root plus nested ones for subdirectories, with skills in .claude/skills/. Cursor reads from .cursor/rules/ and pulls skills from .cursor/skills/. Same primitives, two surfaces.

The mistake I made early was treating these as separate. Different rules in each file, drift between them, and inevitably one tool would produce code that violated rules the other knew about.

The fix is to write conventions once in a canonical file, then have both CLAUDE.md and your top-level Cursor rule reference it. Mine live at docs/conventions.md. One source of truth, two readers.

What goes in there: naming conventions, error handling patterns, the testing approach you actually use, libraries you prefer alongside libraries you've banned, PR structure expectations. And the one most people forget, feature flag conventions. If your team uses LaunchDarkly, this is the place to write down which flag types map to which use cases, your naming conventions, and the standard rollout cadence. Both agents will then default to writing flag-gated code correctly without you reminding them every time.

You can use these files for tone, too. A line still earning its keep at the bottom of my CLAUDE.md:

When you finish a non-trivial task, summarize what you did in the voice of a tired senior engineer who has seen it all.

Commit the file. Your teammates' agents inherit your conventions, and your sense of humor, for free.

Pattern 3: `/goal` for anything bigger than one conversation

This is the new one. Claude Code shipped /goal on May 12, 2026. Codex shipped its equivalent the same week. If you're not using it yet, this is the upgrade with the biggest leverage in years.

The premise is that you give Claude Code a completion condition instead of a list of steps. After each turn, a small verifier model checks whether the condition holds. If it doesn't, Claude takes another turn. If it does, the goal clears and control comes back to you. In practice this means handing off tasks that would have eaten a whole afternoon of back-and-forth, and getting a PR when you wake up.

What makes a good goal:

One measurable end state. "All TypeScript errors resolved, all tests passing, PR opened against main" is a goal. "Make the auth flow better" is a wish.
A roadmap or PRD to drive against. The pattern working best for me right now is to hand it a docs/roadmap.md with a checklist of tasks, then write the goal as "every task on the roadmap is checked off and verified." The verifier has something concrete to check.
Explicit constraints for anything that must not change. /goal accepts up to 4,000 characters in the condition, so there's room to write "do not touch payments/" and "do not modify the public API surface."

A LaunchDarkly-shaped example that has earned its keep on my machine:

/goal Walk the release-checkout-redesign flag from 0% to 100% using
our standard rollout cadence in docs/conventions.md. After each ramp,
check error rate from the metrics MCP server. Stop and return to me
if any ramp shows error rate above baseline + 10%, or when the flag
is at 100% with three consecutive green ramps. Do not modify any code,
only the flag targeting.

That's a task I used to babysit for half a day. Now I run it while writing the next feature in Cursor.

Pattern 4: Skills for the workflows you do more than twice

The single most underused feature in Claude Code is skills. They replaced what used to be custom slash commands, they auto-invoke when the description matches, and they cost almost nothing in tokens until they actually load.

A skill is a folder at .claude/skills/<name>/ with a SKILL.md inside. The file has YAML frontmatter with a name and description, and a markdown body underneath. That's it. Supporting scripts, example templates, helper files are optional and all live in the same folder.

The mental flip that helped me: anything I've copy-pasted into Claude Code twice should be a skill. Not eventually. Today.

Skills that have earned their place in my repos:

pr-prep. Runs the linter, runs the test suite, drafts a PR description against the convention in CLAUDE.md, and stops short of pushing.
flag-rollout. The LaunchDarkly rollout playbook. Reads the cadence from docs/conventions.md, has access to the LaunchDarkly MCP server and the metrics MCP server, and walks a flag through staged ramps with verification between each. Same shape as the /goal example above, but invocable as /flag-rollout without retyping the condition.
test-writer. Writes tests strictly using the conventions in the repo's testing doc. Vitest for unit and integration, Playwright for E2E, no Jest, no surprise mocks.
project-context. A tiny per-repo skill that tells the agent which Jira board, which LaunchDarkly project, and which feature flag prefix this codebase maps to. Sounds trivial. It killed the entire class of "wait, which board are we looking at" questions every agent used to ask me.

Each one is maybe 30 lines of Markdown. The compounding effect shows up within a week.

One more distinction worth pinning down, because it tripped me up early. Rules versus skills. Rules are the what. They go into every conversation as ambient context. "We use TypeScript. Vitest for tests. No new ORMs without a discussion." Skills are the how. They're loaded on demand when the description matches. "Walk a LaunchDarkly flag through staged ramps with verification" is a skill. "Always wrap network calls in our retry helper" is a rule. Keep rules brief and ambient. Push the heavy procedural stuff into skills, where it costs nothing in tokens until the agent actually pulls it in.

A quick note on skills versus subagents, because the difference matters. A skill is inline. Its instructions load into the current conversation and shape how the main agent behaves. A subagent is a separate, isolated context that the main agent delegates to and gets a summary back from. Use skills for "do it this way." Use subagents for "go do this heavy thing and come back when you're done." Most people want a skill 90 percent of the time.

Pattern 5: MCP servers, configured once, available everywhere

MCP went from "interesting protocol" in 2025 to the standard way every tool talks to every other tool by May 2026. Both Claude Code and Cursor have first-class MCP support with effectively identical configuration semantics, so a server you set up in one is a five-minute job to mirror in the other.

The LaunchDarkly MCP server is a good example of why this matters. Same server. Same credentials. Same flag set. Accessible from either surface. In practice that means I can ask the agent to "wrap this checkout component in a flag called release-checkout-redesign, default off" while I'm in Cursor writing the component, then switch to Claude Code in the terminal and run the /flag-rollout skill against that same flag. One mental model across two tools.

A few things I've learned that aren't obvious from the docs:

Resist the urge to connect every server you've heard of. Both tools handle large MCP rosters better than they used to, but every active tool eats a slice of your context budget. Keep four to six servers active and rotate as projects change.
Configure the same set in both tools. It removes the "wait, which one has the GitHub MCP" friction and lets you move work between them without thinking.
MCP servers compose with skills. A skill can specify which MCP tools it's allowed to use via allowed-tools in the frontmatter. Scoping a skill to only the LaunchDarkly MCP server makes it physically incapable of touching anything else, which is much safer than relying on the model to behave.

Conclusion

The May 2026 stack, as I run it:

Claude Code as the engine, for /goal-driven long tasks, skills, and anything else that should run without supervision.
Cursor as the cockpit, for visual work, design mode, tight inline edit loops, and cheap parallel cloud agents.
A shared conventions file both tools read from, so they code the same way.
/goal for anything bigger than one conversation.
Skills for anything you've done more than twice.
MCP servers mirrored across both tools, kept to a tight set you actually use.

The shift this year is that the highest-leverage work isn't editor tricks anymore. It's setting up the agents to run without you and getting them to do the same thing the same way every time. The editor matters less every quarter. The skill library and the goals you can hand off are what compound.

If you've got skills or MCP setups I should steal, send them my way. You can find me on LinkedIn.

If You Can Survive a Toddler, You Can Ship LLMs in Production

Scarlett Attensil — Thu, 14 May 2026 17:43:09 +0000

A few years back I was running a time-series pipeline that scored incoming product reviews on a 1-10 scale. The scorer was an LLM. Reviews rolled in continuously, ratings flowed into a dashboard the product team checked every Monday morning. Everything ran clean for months. Then one Monday the chart had a step in it.

Reviews from the prior week averaged 6.4. The current week averaged 7.6. Same product. Same customers. The reviews themselves, when I went back to read them, looked indistinguishable from what we had been getting all year.

The model had changed. The provider had pushed a quiet update to the weights, and the LLM that gave us 6.4-equivalent scores last week was now giving 7.6-equivalent scores for the same content. Every historical comparison in that dashboard was silently invalid. The cleanup took a week. The harder conversation was about how much of our reporting had been real in the first place.

That kind of failure is the default behavior of LLMs in production. Trying to engineer it away with tighter parameters or pinned versions is a losing fight. The job is to design for it. I learned the lesson twice. Once from the reviews pipeline. Once from raising two kids.

What parents of small children already know about non-determinism

If you have lived through the toddler years, you have run this experiment a few hundred times without calling it one. The lunch you packed all last week, the one that came home empty every day, suddenly gets pushed off the table on Tuesday with full commitment. The bedtime story that worked for six straight nights stops working on the seventh. The nap routine the babysitter swore was solid breaks the moment you start calling it a "rule."

Experienced parents eventually stop trying to force determinism on the kid. Patterns and trends still matter. But you stop expecting any individual input to produce any individual output, and you build a system that absorbs the variance instead of fighting it. This is the same shift production AI engineers make, usually after their first calibration regression.

The LLM-as-judge can drift too

The reviews pipeline taught me that the judge can be the most fragile thing in the system. The model being evaluated can drift. The model doing the evaluating can drift too. Without something stable to anchor against, you cannot tell which one moved.

The pattern that works is a small held-out set of inputs with known, human-validated scores, and the habit of re-running it on a regular cadence. Call it the calibration set. 20 to 50 examples is plenty. You re-score the calibration set first. If the average jumps from 6.4 to 7.6 with no other changes, you know the judge moved, not the data. Without that anchor, the same diagnosis takes weeks of reading individual reviews and arguing about what changed.

This is where offline evaluations in AgentControl earn their keep. You upload your calibration set as a dataset, point a judge at it, and re-run on a cadence or before any variation change. The discipline I had to learn the hard way: keep the judge anchored, keep its inputs comparable, watch the distribution rather than any single response, becomes a property of the configuration instead of a script someone has to remember to run.

The parenting version is the pencil marks on a doorframe. The doorframe does not move. Every few months you put the kid against it, shoes off and back to the wall. If the line jumps three inches and you realize the kid is wearing sneakers, you take the shoes off and measure again before believing any of it. The doorframe is your held-out set. The shoes-off rule is the discipline that keeps re-runs comparable.

Temperature zero is a comfort blanket

Setting temperature to zero feels like it should make a model deterministic. Within a single model version it mostly does. The catch is that determinism within a version buys you nothing across versions.

My reviews pipeline had been running at temperature zero the whole time. The provider's swap underneath did not care. The judge changed, and greedy sampling kept producing the new, shifted scores with the same false confidence as before. Temperature zero compresses the variance you can see during testing, which makes you feel safer. It does nothing about the variance that actually breaks production. Design as if the model can produce a different valid output every time, because eventually it will.

Build the fallback before the happy path

The last move gets cut for time most often, which is why it matters. Before you ship the model that does the new thing well, ship the path that runs when the model does the new thing badly, slowly, or not at all.

For an LLM endpoint that usually looks like a cached response for known-bad inputs, a secondary model behind the primary that can take traffic when the primary fails an inline check, a circuit breaker that routes around the model entirely if error rates cross a threshold, and a logging path that captures failure cases instead of returning a stack trace to the user. None of this is exotic. All of it assumes the model will misbehave at some point and defines what good behavior looks like when it does.

The stronger version of the same idea is making the fallback adaptive instead of static. A static fallback still needs a human to notice something is wrong and pull the lever. An adaptive system watches the production signal itself and switches over without anyone in the loop. This is what configuration-driven LLM tooling is built for. With AgentControl by LaunchDarkly, model variations live as configuration rather than code, traffic shifts between them without a deploy, and a guarded rollout can tie an online evaluation score, or any AgentControl metric you care about, directly to whether a variation advances, pauses, or reverts. When the judge sees scores regress past a threshold, the rollout reverses itself. The fallback stops being a piece of code someone wrote in case of trouble. It becomes the architecture, watching itself.

Parents already operate this way. The grocery store meltdown will happen. The school will call at 11am about a "low-grade fever." You have a snack pre-stuffed in your bag and a backup babysitter on text. The fallback is the architecture. The happy path is the bonus.

What changed for me

After the reviews pipeline incident, my work changed. I started logging the model version returned in every API response. I built a calibration set. I stopped trusting any single eval run as a verdict. The boring fallback path now ships before the impressive demo path.

None of this is harder than the alternative. It is mostly a matter of accepting at the architecture stage what every parent of a small kid already knows. The interesting unit of measurement is the distribution, not the sample.`

Offline Evaluation of RAG-Grounded Answers in LaunchDarkly AI Configs

Scarlett Attensil — Thu, 16 Apr 2026 21:22:50 +0000

Overview

This tutorial shows you how to run an offline LLM evaluation on the RAG-grounded support agent you built in the Agent Graphs tutorial, using LaunchDarkly AI Configs, the Datasets feature, and built-in LLM-as-a-judge scoring. You'll build a RAG-grounded test dataset, run it through the Playground with a cross-family judge, and learn how to read each failing row as a dataset issue, an agent issue, or judge calibration noise.

Here's how it works. The LaunchDarkly Playground evaluates a single model call against a prompt and dataset you configure. By pre-computing your RAG retrieval offline and baking the chunks directly into each dataset row, you turn that call into a high-value generation test: the model in the Playground receives the same documentation context it would in production, so the eval measures how well your agent reasons over real grounded input.

What You'll Learn

Structure a RAG-grounded test dataset by pre-computing retrieval offline and bundling chunks into each row
Pick the right LLM judge for your agent's output shape (Accuracy for natural-language answers, Likeness for structured labels)
Avoid same-model bias by running the judge on a different model family than the agent
Diagnose failing rows as dataset issues, agent issues, or judge calibration noise

What this tutorial covers, and what it doesn't

Covers:

Generation quality over RAG context: does the model produce a correct answer when the right documentation is in the prompt?

Regression detection: catching unexpected score drops when you change a prompt or model

Variation selection: comparing candidate prompts and models before committing to a new AI Config variation

Does not cover:

Retrieval correctness. Whether your vector store is returning the best chunks is tested by your own RAG pipeline, outside LaunchDarkly.

End-to-end agent graph behavior. Tool execution, multi-turn conversations, handoffs, and multi-step routing require online evals against real production traffic.

Prerequisites

You've completed the Agent Graphs tutorial or have equivalent familiarity with LaunchDarkly AI Configs
You have the devrel-agents-tutorial repo cloned
You have API keys for two model providers, one for the agent under test and one for the judge (the examples use OpenAI and Anthropic)

Step 1: Get the Branch Running

About the branch and the Umbra knowledge base. The feature/offline-evals branch builds on the same Agent Graphs tutorial codebase and the routing, tool, and graph work done in earlier branches — none of that goes away. What this branch adds is a more realistic RAG assessment target: Umbra, a fictional serverless-functions product with an invented knowledge base (refund windows, deployment regions, function timeout limits, rate-limit tiers, and so on). Because Umbra doesn't exist outside this tutorial, the model under test has no pre-training knowledge to fall back on — a correct answer has to come from the retrieved chunks, which is the only way to honestly measure whether your RAG pipeline is doing its job. The branch also ships a pre-built RAG-grounded test dataset (datasets/answer-tests.csv) and a helper script that regenerates it from your vector store.

cd devrel-agents-tutorial
git checkout feature/offline-evals
cp .env.example .env
# Add LD_SDK_KEY, LD_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY to .env

uv sync
uv run python bootstrap/create_configs.py
uv run python initialize_embeddings.py

Start the API and UI in two terminals:

# Terminal 1
uv run uvicorn api.main:app --reload

# Terminal 2
uv run streamlit run ui/chat_interface.py

Open http://localhost:8501 and ask a question grounded in the Umbra docs (refund policy, deployment regions, function timeout). The agent pulls answers from the knowledge base.

Step 2: Understand the Test Dataset

Open datasets/answer-tests.csv. Every row has three fields:

input,expected_output,original_question
"Documentation context: --- We offer a 30-day refund policy for first-time subscribers... --- Annual subscriptions receive a prorated refund within... --- Question: What is the refund policy?","30-day refund policy for first-time subscribers who haven't deployed production traffic. Usage charges are non-refundable.","What is the refund policy?"

input bundles documentation chunks and the question into a single structured prompt, separated by --- dividers. The chunks were retrieved from your production vector store ahead of time by tools/build_rag_dataset.py, so the model in the Playground sees the same grounding the production agent would, even though the Playground never executes your retrieval tools.
expected_output is the correct answer, written by a human who read the source docs.
original_question is a plain-text copy of the question so you can scan the dataset without parsing the bundled prompt. No judge uses this field.

Regenerate the dataset when your knowledge base changes:

uv run python tools/build_rag_dataset.py

For the full reference on dataset format and limits, see Datasets for offline evaluations.

Step 3: Upload the Dataset

Use synthetic data only

Never upload real customer tickets, PII, secrets, or credentials. Replace anything sensitive with synthetic placeholders before upload. See the Playground privacy section for what gets forwarded to model providers.

Navigate to AI > Library in LaunchDarkly, select the Datasets tab, and click Upload dataset. Upload datasets/answer-tests.csv and name it answer-tests.

Step 4: Add Your Model API Keys

The Playground calls model providers directly, so it needs API keys for both the model running your agent and the model running your judge. These keys live in LaunchDarkly's "AI Config Test Run" integration, not in your AI Config.

In the Playground, click Manage API keys in the upper-right corner.
Click Add integration, pick a provider (e.g. OpenAI), paste your API key, accept the terms, and save.
Repeat for the second provider (Anthropic) so you can run a cross-family judge in Step 5.

See the Playground reference doc for the canonical instructions. API keys are stored per-session, so you may need to re-paste them when you return.

Step 5: Run the Evaluation

From the Datasets list, click into answer-tests to open it in a Playground bound to that dataset.

Configure the test

System prompt: paste your support-agent instructions verbatim from the AI Config. Do not edit or simplify them.
Agent model: pick the model your support-agent variation uses (or a candidate you're considering swapping to). To compare two candidates, run the eval twice with different agent models and compare scores.
Acceptance criteria: attach an Accuracy judge with threshold 0.85. Accuracy scores whether the response correctly addresses the input question, which fits grounded natural-language answers.
Evaluation model: uncheck Use same model for evaluation and set the judge to a different model family from the agent. Same-family judging tends to reward output patterns the judge itself produces. A cross-family judge gives you an independent read.

Run the eval.

Reading the results

The example run above had 18 passes and 2 failures. When a row fails, the failure comes from one of three places, and each one sends you in a different direction:

The dataset's chunks don't contain the answer. This is a retrieval problem, not a generation problem. Rebuild the dataset with higher top_k, a reranker, or a different chunker, or verify the answer is indexed at all.
The chunks contain the answer but the model ignored them. This is the agent-side failure offline evals are designed to catch. Tighten the system prompt to insist on grounding, or switch to a more obedient model.
The chunks and the model are both fine but the judge disagreed. This is judge calibration noise. Lower the threshold, try a different judge, or accept it as noise. Don't change your agent based on it.

Sort by score. For each failing row, open the bundled chunks in the input field and ask: was the right answer in there? Yes → fix the prompt or model. No → rebuild the dataset.

What failed in this run

Row 11: "What integrations are available?" (chunks missed the answer). The expected output mentioned monitoring integrations (Datadog, Sentry, LogRocket), but the retrieved chunks only covered databases, storage, and billing. The model correctly listed what it had and said "the documentation does not provide additional information regarding more integrations", which is the correct behavior for an ungrounded claim. Fix: higher top_k or a reranker in build_rag_dataset.py.

Row 12: "Can I get a refund on bandwidth overages?" (judge calibration). The model correctly said bandwidth overages are non-refundable, citing the docs, but omitted a secondary "Review your Usage Dashboard" recommendation from the expected output. Semantically right, lexically short one clause. Fix: lower the threshold or trim the expected output.

Two failures, two different fixes. Without reading the per-row results you'd conflate them and spend time tightening the model when the actual problem lives in the retriever or the dataset.

Where to Go From a Single Run

This tutorial walked you through one run. In practice, a single eval isn't where offline evaluation earns its keep. The real payoff comes from re-running the same dataset against a new prompt, a new model, or a fresh RAG chunker and comparing scores to your last known-good run. A small prompt edit that quietly drops your Accuracy from 0.83 to 0.71 is exactly the kind of regression this pattern is meant to catch, but only if you save the run and compare against it next time.

A reasonable next loop:

Save the run from Step 5 as your reference.
When you change something (prompt, model, chunker, top_k), re-run the same dataset and compare scores.
Add new rows to the dataset as you find failure modes in staging or production.

For end-to-end behavior that offline tests can't capture (tool execution, multi-turn conversations, the tail of real production inputs), see online evaluations and the When to add online evals tutorial. Online evaluations are not currently supported for agent-based AI Configs; for agent workflows, the documented path is programmatic judge evaluation via the AI SDK.

Step 7: Track Evaluation History

View saved runs at AI > Evaluations. Toggle Group by dataset to collapse runs under each dataset name so you can see the history for umbra-rag-eval alongside any other datasets in the project. Compare pass and fail counts across runs, and distinguish saved runs (indefinite retention) from one-off runs (60-day expiry). For metric definitions, see Monitor AI Configs.

What's Next

Progressive rollouts: release your winning variation to 5% of traffic, then 25%, then 100%, watching production metrics before expanding.
When to add online evals: decide what to score on live production traffic once you have an offline baseline.

For a deeper look at the multi-agent RAG system this tutorial builds on, see the Agent Graphs tutorial.

Building Framework-Agnostic AI Swarms: Compare LangGraph, Strands, and OpenAI Swarm

Scarlett Attensil — Thu, 26 Mar 2026 21:05:21 +0000

If you've ever run the same app in multiple environments, you know the pain of duplicated configuration. Agent swarms have the same problem: the moment you try multiple orchestrators (LangGraph, Strands, OpenAI Swarm), your agent definitions start living in different formats. Prompts drift. Model settings drift. A "small behavior tweak" turns into archaeology across repos.

AI behavior isn't code. Prompts aren't functions. They change too often, and too experimentally, to be hard-wired into orchestrator code. LaunchDarkly AI Configs lets you treat agent definitions like shared configuration instead. Define them once, store them centrally, and let any orchestrator fetch them. Update a prompt or model setting in the LaunchDarkly UI, and the new version rolls out without a redeploy.

Ready to build framework-agnostic AI swarms? Start your 14-day free trial of LaunchDarkly to follow along with this tutorial. No credit card required.

Start free trial →

The problem: Research gap analysis across multiple papers

When analyzing academic literature, researchers face a daunting task: reading dozens of papers to identify patterns, spot contradictions, and find unexplored opportunities. A single LLM call can summarize papers, but it produces a monolithic analysis you can't trace, refine, or trust for critical decisions.

The challenge compounds when you need to:

Identify methodological patterns across 12+ papers without missing subtle connections
Detect contradictory findings that might invalidate assumptions
Discover research gaps that represent genuine opportunities, not just oversight

This is where specialized agents excel - each focused on one aspect of the analysis, building on each other's work.

In this tutorial, we'll build a 3-agent research analysis swarm that solves this problem by dividing the work:

Agent	Role	Output
Approach Analyzer	Clusters methodological themes across papers	"Papers 1, 4, 7 use reinforcement learning; Papers 2, 5 use symbolic methods"
Contradiction Detector	Finds conflicting claims between papers	"Paper 3 claims X improves performance; Paper 8 shows X degrades it"
Gap Synthesizer	Identifies unexplored research directions	"No papers combine approach A with dataset B; potential opportunity"

We'll implement this swarm across three different orchestrators (LangGraph, Strands, and OpenAI Swarm), demonstrating how LaunchDarkly AI Configs enable:

Framework-agnostic agent definitions: Define agents once in LaunchDarkly, use them everywhere
Per-agent observability: Track tokens, latency, and costs for each agent individually - catch silent failures when agents skip execution
Dynamic swarm composition: Add/remove agents from the swarm or switch models without touching code

Why use a swarm?

Research gap analysis requires different skills: clustering methodological patterns, detecting contradictions, and synthesizing opportunities. With a swarm, each agent handles one aspect and produces artifacts the next agent builds on. You can track tokens, latency, and cost per agent. You can catch silent failures when an agent skips execution. And when something goes wrong, you know exactly where.

Technical requirements

Before implementing the swarm, ensure you have:

LaunchDarkly account with AI Configs enabled (see quickstart guide)
API keys for Anthropic Claude or OpenAI GPT-4 (check supported models)
Python 3.11+ for running orchestrators
Basic understanding of agent systems (review LangGraph agents tutorial if needed)

The complete implementation is available at GitHub - AI Orchestrators.

The architecture: how LaunchDarkly powers framework-agnostic swarms

The swarm architecture has three layers: dynamic agent configuration, per-agent tracking, and custom metrics for cost attribution. Here's how they work together.

The diagram shows LangGraph's implementation, but Strands and OpenAI Swarm follow the same pattern with their own handoff mechanisms. The key components are:

Configuration Fetch: The orchestrator queries LaunchDarkly's API to dynamically discover all agent configurations, avoiding hardcoded agent definitions
Agent Graph: Three specialized agents (Approach Analyzer, Contradiction Detector, Gap Synthesizer) connected through explicit handoff mechanisms
Metrics Collection: Each agent execution captures tokens, duration, and cost metrics through both the AI Config tracker and custom metrics API
Dual Dashboard Views: The same metrics appear in the AI Config Trends dashboard (for individual agent monitoring)

Three layers of framework-agnostic swarms

1. AI Config for Dynamic Agent Configuration

Each AI Config stores:

Agent key, display name, and model selection
System instructions and tool definitions

Your orchestrator code queries LaunchDarkly for "all enabled agent configs" and builds the swarm dynamically. No hardcoded agent names.

2. Per-Agent Tracking with AI SDK

LaunchDarkly's AI SDK provides tracking through config evaluations. You get a fresh tracker for each agent, then track tokens, duration, and success/failure. These metrics flow to the AI Config Monitoring dashboard automatically.

This tracking catches silent failures - when agents skip execution or produce minimal output. Step 4 shows the implementation patterns for each framework.

3. Custom Metrics for Cost Attribution

Per-agent tracking shows performance, but for cost comparisons across orchestrators you need custom metrics. These let you query by orchestrator, compare costs across frameworks, and identify anomalies.

With the architecture covered, let's build the swarm. We'll download research papers, set up the project, bootstrap agent configs in LaunchDarkly, implement per-agent tracking, and run the swarm across all three orchestrators.

Step 1: Download research papers

First, you need papers to analyze. The scripts/download_papers.py script queries ArXiv with narrow, category-specific searches to ensure focused results.

python scripts/download_papers.py

The script presents pre-configured narrow research topics:

# From orchestration/scripts/download_papers.py:164-189
topics = {
    "1": {
        "name": "Chain-of-thought prompting in LLMs",
        "query": "cat:cs.CL AND (chain-of-thought OR CoT) AND reasoning",
        "years": 2
    },
    "2": {
        "name": "Retrieval-augmented generation (RAG)",
        "query": "cat:cs.CL AND (retrieval-augmented OR RAG) AND generation",
        "years": 2
    },
    "3": {
        "name": "Emergent communication in multi-agent RL",
        "query": "cat:cs.MA AND (emergent communication OR language emergence)",
        "years": 5
    },
    "4": {
        "name": "Few-shot prompting for code generation",
        "query": "cat:cs.SE AND few-shot AND code generation",
        "years": 2
    },
    "5": {
        "name": "Vision-language model grounding",
        "query": "cat:cs.CV AND vision-language AND grounding",
        "years": 2
    }
}

These topics are intentionally narrow: Each uses ArXiv categories (cat:cs.CL, cat:cs.MA) to limit scope. Boolean AND operators ensure papers match all criteria. 2-5 year windows prevent overwhelming the analysis.

For even narrower custom queries, combine categories with specific techniques like cat:cs.CL AND chain-of-thought AND mathematical AND reasoning for CoT math only, cat:cs.MA AND emergent AND (referential OR compositional) for specific emergence types, or cat:cs.SE AND few-shot AND (Python OR JavaScript) AND test generation for language-specific code generation.

The script saves papers to data/gap_analysis_papers.json with this structure:

[
  {
    "id": "2409.02645v2",
    "title": "Emergent Language: A Survey and Taxonomy",
    "authors": "Jannik Peters, Constantin Waubert de Puiseau, ...",
    "published": "2024-09-04",
    "category": "cs.MA",
    "abstract": "The field of emergent language represents...",
    "introduction": "Language emergence has been explored...",
    "conclusion": "This paper provides a comprehensive review..."
  }
]

Why this format: Each paper includes ~2-3K characters of text (abstract + intro + conclusion), which is enough for analysis but won't overflow context windows. For 12 papers, you're looking at ~30K characters (~7.5K tokens) of input.

You now have 12 papers saved locally. Next, we'll configure LaunchDarkly credentials and install the orchestration frameworks.

Step 2: Set up your multi-orchestrator project

Environment setup

For help getting your SDK and API keys, see the API access tokens guide and SDK key management.

# .env file
LD_SDK_KEY=sdk-xxxxx       # Get from LaunchDarkly project settings
LD_API_KEY=api-xxxxx       # Create at Account settings → Authorization
LAUNCHDARKLY_PROJECT_KEY=orchestrator-agents

# Model API keys
ANTHROPIC_API_KEY=sk-ant-xxxxx
OPENAI_API_KEY=sk-xxxxx

Install dependencies

python -m venv .venv
source .venv/bin/activate

# LaunchDarkly SDKs - see [Python SDK docs](/sdk/server-side/python)
pip install ldai ldclient python-dotenv arxiv PyPDF2 requests

# Orchestration frameworks
pip install strands-sdk langgraph swarm

For more on the LaunchDarkly AI SDK, see the AI SDK documentation.

Your environment is configured and dependencies are installed. Next, we'll use the bootstrap script to automatically create all three agent configs in LaunchDarkly.

Step 3: Bootstrap agent configs with the manifest

The orchestration repo includes a complete bootstrap system that automatically creates all agent configurations, tools, and variations in LaunchDarkly. This is much faster and more reliable than manual setup.

Understanding the bootstrap system

The bootstrap process uses a YAML manifest to define:

Tools - Functions agents can call (fetch_paper_section, handoff_to_agent, etc.)
Agent Configs - Three specialized agents with their roles and instructions
Variations - Multiple model options (Anthropic Claude vs OpenAI GPT)
Targeting Rules - Which orchestrators get which models

Run the bootstrap script

# From the orchestration repo root
cd ai-orchestrators

# Run bootstrap with the research gap manifest
python scripts/launchdarkly/bootstrap.py

# You'll see:
╔═══════════════════════════════════════════════════════╗
║  AI Agent Orchestrator - LaunchDarkly Bootstrap       ║
╚═══════════════════════════════════════════════════════╝

Available manifests:
  1. Research Gap Analysis (research_gap_manifest.yaml)

Select manifest or press Enter for default: [Enter]

📦 Project: orchestrator-agents
🌍 Environment: production

🛠️  Creating paper analysis tools...
    ✓ Tool 'extract_key_sections' created
    ✓ Tool 'fetch_paper_section' created
    ✓ Tool 'handoff_to_agent' created
    ...

🤖 Creating AI agent configs...
    ✓ AI Config 'approach-analyzer' created
    ✓ AI Config 'contradiction-detector' created
    ✓ AI Config 'gap-synthesizer' created

✨ Bootstrap complete!

What gets created

The bootstrap script creates the three agents described earlier (Approach Analyzer, Contradiction Detector, Gap Synthesizer), each with swarm-aware instructions and handoff tools.

Verify in LaunchDarkly dashboard

After bootstrap completes:

Go to your LaunchDarkly AI Configs dashboard at https://app.launchdarkly.com/<your-project-key>/<your-environment-key>/ai-configs
You'll see all three agent configs created
Each config has:
- Two variations (Claude and OpenAI models)
- Proper tools configured
- Detailed swarm-aware instructions
- Targeting rules for orchestrator-specific routing

How variations and targeting work

Each agent has two variations in the manifest:

# Example from approach-analyzer agent
variations:
  - key: "analyzer-claude"
    name: "Approach Analyzer Claude"
    modelConfig:
      provider: "anthropic"
      modelId: "claude-sonnet-4-5"
    tools: ["handoff_to_agent", "cluster_approaches"]
    instructions: |
      [Agent instructions here]

  - key: "analyzer-openai"
    name: "Approach Analyzer OpenAI"
    modelConfig:
      provider: "openai"
      modelId: "gpt-5"
    tools: ["handoff_to_agent", "cluster_approaches"]
    instructions: |
      [Same instructions, different model]

targeting:
  rules:
    - variation: "analyzer-openai"
      clauses:
        - attribute: "orchestrator"
          op: "in"
          values: ["openai_swarm", "openai-swarm"]
  defaultVariation: "analyzer-claude"

When an orchestrator requests this agent:

Context includes orchestrator attribute: context = create_context(execution_id, orchestrator="openai_swarm")
LaunchDarkly evaluates targeting rules: If orchestrator is "openai_swarm" or "openai-swarm", use OpenAI variation
Otherwise use default: Claude variation for all other orchestrators

This lets you:

Use OpenAI models when running OpenAI Swarm (native compatibility)
Use Claude for other orchestrators
A/B test models by adjusting targeting rules

Customize agent behavior

After bootstrap, you can adjust agents in the LaunchDarkly UI without code changes. Switch between Claude, GPT-4, or other supported providers. Refine instructions for better handoffs. Control which agents are included in the swarm through targeting rules. Test different prompts or models side-by-side with experiments.

Your three agents are now configured in LaunchDarkly. Next, we'll implement tracking so you can monitor tokens, latency, and cost for each agent individually.

Step 4: Implement per-agent tracking

The orchestration repository demonstrates per-agent tracking across all three frameworks. First, you need to fetch agent configurations from LaunchDarkly:

Fetching agent configurations dynamically

from shared.launchdarkly import (
    init_launchdarkly_clients,
    fetch_agent_configs_from_api,
    create_context,
    build_agent_requests
)

# Initialize LaunchDarkly clients
ld_client, ai_client = init_launchdarkly_clients()

# Fetch agent list from LaunchDarkly API (not hardcoded!)
items = fetch_agent_configs_from_api()
print(f"Found {len(items)} AI config(s) in LaunchDarkly")

# Create execution context
execution_id = f"langgraph-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
context = create_context(execution_id, orchestrator="langgraph")

# Build requests for all agents
agent_requests, agent_metadata = build_agent_requests(items)

# Fetch all configs in one call
configs = ai_client.agent_configs(agent_requests, context)

# Process agents with configured variations
enabled_agents = []
for item in items:
    config = configs.get(item["key"])
    if config and config.enabled:
        enabled_agents.append({
            "key": item["key"],
            "name": item["name"],
            "config": config,
            "model": config.model.name if config.model else "claude-sonnet-4-5"
        })

print(f"✓ Found {len(enabled_agents)} configured agent configs")

Pattern 1: Native framework metrics (Strands)

Strands provides accumulated_usage on each node result after execution:

# From orchestrators/strands/run_gap_analysis.py:418-424
if agent_key in per_agent_metrics:
    usage = node_result.accumulated_usage or {}
    input_tokens, output_tokens = extract_usage_tokens(usage)
    total_tokens = input_tokens + output_tokens

View full Strands implementation

Pattern 2: Message-based tracking (LangGraph)

LangGraph attaches usage_metadata to messages, requiring post-execution iteration:

# From orchestrators/langgraph/run_gap_analysis.py:442-446
if hasattr(msg, "usage_metadata") and msg.usage_metadata:
    usage_data = msg.usage_metadata
    input_tokens = usage_data.get("input_tokens", 0) or usage_data.get("prompt_tokens", 0)
    output_tokens = usage_data.get("output_tokens", 0) or usage_data.get("completion_tokens", 0)
    has_usage = True

View full LangGraph implementation

Pattern 3: Interception-based tracking (OpenAI Swarm)

OpenAI Swarm doesn't aggregate per-agent metrics, requiring interception of completion calls:

# From orchestrators/openai_swarm/run_gap_analysis.py:369-387
original_get_chat_completion = client.get_chat_completion

def tracked_get_chat_completion(agent, history, context_variables, model_override, stream, debug):
    start_call = time.time()
    completion = original_get_chat_completion(
        agent=agent,
        history=history,
        context_variables=context_variables,
        model_override=model_override,
        stream=stream,
        debug=debug,
    )
    duration = time.time() - start_call
    agent_key = key_by_name.get(agent.name, agent.name)
    usage = getattr(completion, "usage", None)
    if usage:
        input_tokens = int(getattr(usage, "prompt_tokens", 0))
        output_tokens = int(getattr(usage, "completion_tokens", 0))
        total_tokens = int(getattr(usage, "total_tokens", input_tokens + output_tokens))

View full OpenAI Swarm implementation

Critical: Provider token field names differ

Each provider uses different field names: Anthropic uses input_tokens/output_tokens, OpenAI uses prompt_tokens/completion_tokens, and some frameworks use camelCase (inputTokens). The implementations use fallback chains to handle all formats.

You can now capture tokens, latency, and cost for each agent. Next, we'll run the swarm across LangGraph, Strands, and OpenAI Swarm to see how they perform with the same agent definitions.

Step 5: Run multiple orchestrators and track results

The repository includes scripts to run all three orchestrators and analyze their performance:

# Run all orchestrators 5 times each
./scripts/run_swarm_benchmark.sh sequential 5

# Analyze the results
python scripts/analyze_benchmark_results.py

Configure env: Create .env with SDK keys
Install deps: pip install -r requirements.txt
Download papers: python scripts/download_papers.py
Bootstrap agents: python scripts/launchdarkly/bootstrap.py
Configure targeting: Set default variation for each agent in LaunchDarkly UI
Test run: python orchestrators/strands/run_gap_analysis.py

Troubleshooting: If you see "No enabled agents found," check that each agent has a default variation set in the Targeting tab.

Now that you've run the swarm across all three orchestrators, let's look at how they differ in approach and performance.

Comparing orchestrator approaches to swarms

All three frameworks support multi-agent workflows, they just disagree on who decides what happens next.

Key differences

Aspect	Strands	LangGraph	OpenAI Swarm
Routing	Framework-managed	Graph-based	Function return
Handoff API	Tool call (automatic)	Command object	Return Agent object
Boilerplate	Low	Medium	Medium
Control	Low (black box)	High (explicit graph)	High (manual impl)
Debugging	Hard (why didn't agent run?)	Easy (graph trace)	Hard (silent failures)
Per-Agent Metrics	Built-in	Wrapper required	Interception required

View full implementations: Strands | LangGraph | OpenAI Swarm

The LaunchDarkly advantage: By defining agents externally, you can implement swarms across all three frameworks and compare their approaches with the same agent definitions.

Performance comparison (9 runs: 3 datasets × 3 orchestrators)

Metric	OpenAI Swarm	Strands	LangGraph
Avg Time	2.9 min	5.7 min	8.0 min
Tokens	67K	99K	89K
Speed	385 tok/s	287 tok/s	186 tok/s
Report Size	13KB	32KB	67KB
Variance	±1.05 min	±1.38 min	±0.21 min

Key insight (based on limited sample): Fastest ≠ best. OpenAI Swarm was 3x faster but produced reports 80% smaller than LangGraph. LangGraph had the lowest variance and most comprehensive outputs despite slower execution.

Example reports: See the outputs

LangGraph (60-70KB): Emergent | Theorem | Self-Improvement
Strands (30-35KB): Emergent | Theorem | Self-Improvement
OpenAI Swarm (10-15KB): Emergent | Theorem | Self-Improvement

Report size variation demonstrates why per-agent tracking matters - you need to know when agents produce minimal output.

Conclusion

The orchestrator you choose determines how agents coordinate, but it shouldn't lock you into a single framework. By defining agents in LaunchDarkly and fetching them at runtime, you can run the same swarm across LangGraph, Strands, and OpenAI Swarm without duplicating configuration or watching prompts drift between repos.

The performance differences are real. OpenAI Swarm is fastest, LangGraph produces the most comprehensive outputs, and Strands offers the simplest setup. But you only discover these tradeoffs if you can track each agent individually and catch silent failures when they happen.

Swarms cost more than single LLM calls. The payoff is traceable reasoning you can audit, refine, and trust.

The full implementation is available on GitHub - AI Orchestrators. Clone the repo and run the same swarm across all three orchestrators. To get started with LaunchDarkly AI Configs, follow the quickstart guide.

Build AI Configs with Agent Skills in Claude Code, Cursor, or Windsurf

Scarlett Attensil — Thu, 26 Mar 2026 18:18:43 +0000

LaunchDarkly Agent Skills let you build AI Configs by describing what you want. Tell your coding assistant to create an agent, and it handles the API calls, targeting rules, and tool definitions for you.

In this quickstart, you'll create AI Configs using natural language, then run a sample LangGraph app that consumes them. You'll build a "Side Project Launcher"—a three-agent pipeline that validates ideas, writes landing pages, and recommends tech stacks.

Prefer video? Watch Build a multi-agent system with LaunchDarkly Agent Skills for a walkthrough of this tutorial.

What you'll build

A three-agent pipeline called "Side Project Launcher":

Idea Validator: researches competitors, analyzes market gaps, scores viability
Landing Page Writer: generates headlines, copy, and CTAs based on your value prop
Tech Stack Advisor: recommends frameworks, databases, and hosting based on your requirements

By the end, you'll have working AI Configs in LaunchDarkly and a sample app that fetches them at runtime.

Prerequisites

LaunchDarkly account (free trial works)
Claude Code, Cursor, or Windsurf installed
LaunchDarkly API access token (for creating configs)
Anthropic API key (for running the sample app)

LaunchDarkly API access token (LD_API_KEY): Used by Agent Skills to create projects and AI Configs. Get it from Authorization settings. Requires writer role or custom role with createProject and createAIConfig permissions.
LaunchDarkly SDK key (LAUNCHDARKLY_SDK_KEY): Used by your app at runtime to fetch AI Configs. Found in your project's SDK settings after creation.
Model provider API key (e.g., ANTHROPIC_API_KEY): Used to call the model. Get it from your provider (Anthropic, OpenAI, etc.).

Store all keys in .env and never commit them to version control.

Want to follow along? Start your 14-day free trial of LaunchDarkly. No credit card required.

30-second quickstart

If you just want to get started, here's the fastest path:

1. Install skills:

npx skills add launchdarkly/agent-skills

Or ask your editor: "Download and install skills from https://github.com/launchdarkly/agent-skills"

Restart your editor after installing.

2. Set your token:

export LD_API_KEY="api-xxxxx"

3. Build something:

Use the prompt in Build a multi-agent project below, or describe your own agents. The assistant creates everything and gives you links to view them in LaunchDarkly.

Install Agent Skills in Claude Code, Cursor, or Windsurf

Agent Skills work with any editor that supports the Agent Skills specification.

Step 1: Install the skills

You have two options:

Option A: Use skills.sh (recommended)

skills.sh is an open directory for agent skills. Install LaunchDarkly skills with one command:

npx skills add launchdarkly/agent-skills

Option B: Ask your AI assistant

Open your editor and ask:

Download and install skills from https://github.com/launchdarkly/agent-skills

Both methods install the same skills.

Step 2: Restart your editor

Close and reopen your editor. The skills load on startup.

How to verify: Type /aiconfig in Claude Code. You should see autocomplete suggestions. In Cursor, ask "what LaunchDarkly skills do you have?" and the assistant should list them.

Step 3: Set your API token

export LD_API_KEY="api-xxxxx"

Get your token from LaunchDarkly Authorization settings. The writer role works, or use a custom role with createProject and createAIConfig permissions.

Build a multi-agent project

Now let's build something real: a Side Project Launcher that helps you validate ideas, write landing pages, and pick the right tech stack. Tell the assistant:

Create AI Configs for a "Side Project Launcher" with three configs.
Use Anthropic Claude models for all configs.

1. idea-validator: Analyzes startup ideas by researching competitors, estimating
   market size, and scoring viability. Use variables for {{idea}}, {{target_audience}},
   and {{problem_statement}}. Give it tools for web search and competitor analysis.

2. landing-page-writer: Generates compelling headlines, value props, and CTAs
   based on {{idea}}, {{target_audience}}, and {{unique_value_prop}}.
   Give it tools for copy generation and A/B test suggestions.

3. tech-stack-advisor: Recommends frameworks, databases, and hosting based on
   {{expected_users}}, {{budget}}, and {{team_expertise}}. Give it a tool for
   stack recommendations.

Put them in a new project called side-project-launcher.

What the assistant creates

The assistant uses several skills automatically:

aiconfig-projects: creates the LaunchDarkly project
aiconfig-create: builds each agent configuration with variables
aiconfig-tools: defines tools for function calling

Expected output:

Creating project: side-project-launcher
Creating AI Config: idea-validator
  - Model: anthropic.claude-sonnet-4-20250514
  - Variables: idea, target_audience, problem_statement
  - Instructions: "Validate the idea: {{idea}}. Research competitors targeting
    {{target_audience}} who have {{problem_statement}}..."
  - Tools: web_search, competitor_analysis
Creating AI Config: landing-page-writer
  - Model: anthropic.claude-sonnet-4-20250514
  - Variables: idea, target_audience, unique_value_prop
  - Instructions: "Write landing page copy for {{idea}}. The target audience is
    {{target_audience}}. Lead with: {{unique_value_prop}}..."
  - Tools: generate_copy, suggest_ab_tests
Creating AI Config: tech-stack-advisor
  - Model: anthropic.claude-sonnet-4-20250514
  - Variables: expected_users, budget, team_expertise
  - Instructions: "Recommend a tech stack for {{expected_users}} users,
    {{budget}} budget, team knows {{team_expertise}}..."
  - Tools: recommend_stack

Done! View your project:
https://app.launchdarkly.com/side-project-launcher/production/ai-configs

The variables ({{idea}}, {{target_audience}}, etc.) get filled in at runtime when you call the SDK. That's how each user gets personalized output.

What it looks like in LaunchDarkly

After creation, your LaunchDarkly project contains:

3 AI Configs with instructions, model settings, and variables
3 tools with parameter definitions ready for function calling
Default targeting serving the configuration to all users

Each agent has its own configuration with instructions, variables, and tools. Here's the idea-validator:

The landing-page-writer and tech-stack-advisor follow the same pattern with their own instructions and tools.

Run the Side Project Launcher

The full working code is available on GitHub: launchdarkly-labs/side-project-researcher

Clone it and run:

git clone https://github.com/launchdarkly-labs/side-project-researcher.git
cd side-project-researcher
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your SDK key and Anthropic API key
python side_project_launcher_langgraph.py

You'll need both the LaunchDarkly SDK key (from your project's SDK settings) and your Anthropic API key in the .env file. The assistant can surface the SDK key from your project details, but store it in .env rather than hardcoding it.

The app prompts you for your idea details:

Then each agent runs in sequence, fetching its config from LaunchDarkly and generating output:

Connect to your framework

The AI Config stores your model, instructions, and tools. The SDK fetches the config and handles variable substitution automatically.

The snippets below show the integration pattern. They omit imports, error handling, and tool wiring for brevity. For complete, runnable code, use the sample repo.

Initialize the SDK

import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai.client import LDAIClient, AIAgentConfigDefault

# Initialize once at startup
SDK_KEY = os.environ.get('LAUNCHDARKLY_SDK_KEY')
ldclient.set_config(Config(SDK_KEY))
ld_client = ldclient.get()
ai_client = LDAIClient(ld_client)

Fetch agent configs

def build_context(user_id: str, **attributes):
    """Build LaunchDarkly context for targeting."""
    builder = Context.builder(user_id)
    for key, value in attributes.items():
        builder.set(key, value)
    return builder.build()

def get_agent_config(config_key: str, context: Context, variables: dict = None):
    """Get agent-mode AI Config from LaunchDarkly."""
    fallback = AIAgentConfigDefault(enabled=False)
    return ai_client.agent_config(config_key, context, fallback, variables or {})

Wire it to LangGraph

LangGraph orchestrates multi-agent workflows as a graph of nodes, but you can use any orchestrator—CrewAI, LlamaIndex, Bedrock AgentCore, or custom code. To compare options, read Compare AI orchestrators.

By wiring AI Configs to each node, your agents fetch their model, instructions, and tools dynamically from LaunchDarkly. This lets you swap models within a provider (e.g., Sonnet to Haiku), update prompts, or disable agents without redeploying.

The AI Config defines tool schemas, but your code must implement the actual tool handlers. The sample repo shows how to bind config.tools to LangChain tool functions. For this tutorial, the tools are defined but not wired—the agents respond based on their instructions alone.

Each agent becomes a node in your graph:

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, END

def idea_validator_node(state: SideProjectState) -> SideProjectState:
    context = build_context(state["user_id"])
    config = get_agent_config("idea-validator", context, {
        "idea": state["idea"],
        "target_audience": state["target_audience"],
        "problem_statement": state["problem_statement"]
    })

    if config.enabled:
        llm = ChatAnthropic(model=config.model.name)
        messages = [
            SystemMessage(content=config.instructions),
            HumanMessage(content="Please validate this idea and provide your analysis.")
        ]
        response = llm.invoke(messages)
        state["idea_validation"] = response.content
        config.tracker.track_success()  # Track metrics

    return state

# Build the graph
workflow = StateGraph(SideProjectState)
workflow.add_node("validate_idea", idea_validator_node)
workflow.add_node("write_landing_page", landing_page_writer_node)
workflow.add_node("recommend_stack", tech_stack_advisor_node)

workflow.set_entry_point("validate_idea")
workflow.add_edge("validate_idea", "write_landing_page")
workflow.add_edge("write_landing_page", "recommend_stack")
workflow.add_edge("recommend_stack", END)

app = workflow.compile()

# Don't forget to flush before exiting
ld_client.flush()

To see a full example running across LangGraph, Strands, and OpenAI Swarm, read Compare AI orchestrators.

What you can do next

Once your agents are in LaunchDarkly:

A/B test variations: split traffic between prompt variations or model sizes (e.g., Sonnet vs Haiku) to see which performs better
Target by segment: premium users get one variation, free users get another
Kill switch: disable a misbehaving agent instantly from the UI
Track costs: monitor tokens and latency per variation

To learn more about targeting and experimentation, read AI Configs Best Practices.

Troubleshooting

Skills installed but not working: Restart your editor after installing skills. They load on startup.

"Permission denied" errors: Check that your API token has createProject and createAIConfig permissions. The writer role includes both.

Config comes back disabled: Your targeting rules may not match the context you're passing. Check that default targeting is enabled, or that your context attributes match your rules.

Tools defined but not executing: The AI Config defines tool schemas, but your code must implement handlers. See the sample repo for tool binding examples.

Can't find SDK key: After Agent Skills creates your project, find the SDK key in your project's Settings > Environments > SDK key. Copy it to your .env file.

FAQ

Do I need Claude Code, or does this work in Cursor/Windsurf?

Agent Skills work in any editor that supports the Agent Skills specification. This includes Claude Code, Cursor, and Windsurf. The installation process is the same.

What's the difference between Agent Skills and the MCP server?

Both give your AI assistant access to LaunchDarkly. Agent Skills are text-based playbooks that teach the assistant workflows. The MCP server exposes LaunchDarkly's API as tools. You can use either or both.

What permissions does my API token need?

The writer role works, or use a custom role with createProject and createAIConfig permissions.

Where do I see the created AI Configs?

In the LaunchDarkly UI: go to your project, then AI Configs in the left sidebar. Each config shows its instructions, model, tools, and targeting rules.

How do I delete or reset generated configs?

In the LaunchDarkly UI, open the AI Config and click Archive (or Delete if available). Or ask the assistant: "Delete the AI Config called researcher-agent in project valentines-day."

Can I use this with frameworks other than LangGraph?

Yes. The SDK returns model name, instructions, and tools as data. You wire that into whatever framework you use: CrewAI, LlamaIndex, Bedrock AgentCore, or custom code.

Does this work for completion mode (chat) or just agent mode?

Both. Use ai_client.completion_config() for completion mode (chat with message arrays) or ai_client.agent_config() for agent mode (instructions for multi-step workflows). To learn more, read Agent mode vs completion mode.

Next steps

Read the Python AI SDK Reference for detailed SDK usage
Try building a data extraction pipeline to deploy AI Configs with Vercel

Evaluate LLM code generation with LLM-as-judge evaluators

Scarlett Attensil — Thu, 26 Mar 2026 16:58:55 +0000

Which AI model writes the best code for your codebase? Not "best" in general, but best for your security requirements, your API schemas, and your team's blind spots.

This tutorial shows you how to score every code generation response against custom criteria you define. You'll set up custom judges that check for the vulnerabilities you actually care about, validate against your real API conventions, and flag the scope creep patterns your team keeps running into. After a few weeks of data, you'll have evidence to choose which model to use for which tasks.

What you will build

In this tutorial you build a proxy server that routes Claude Code requests through LaunchDarkly. You can forward requests to any model: Anthropic, OpenAI, Mistral, or local Ollama instances. Every response gets scored by custom judges you create.

You will build three judges:

Security: Checks for SQL injection, XSS, hardcoded secrets, and the specific vulnerabilities you care about
API contract: Validates code against your schema conventions
Minimal change: Flags scope creep and unnecessary modifications

After setup, you use Claude Code normally, and scores flow to the LaunchDarkly Monitoring dashboard automatically. Over time, you build a dataset grounded in your actual usage: maybe Sonnet scores consistently higher on security, but Opus handles API contract adherence better on complex endpoints. That's the kind of answer a generic benchmark can't give you.

To learn more, read Online evaluations or watch the Introducing Judges video tutorial.

Prerequisites

LaunchDarkly account with AI Configs enabled
Python 3.9+
LaunchDarkly Python AI SDK v0.14.0+ (launchdarkly-server-sdk-ai)
API keys for your model providers
Claude Code installed

How the proxy works

This proxy implements a minimal Anthropic Messages-style gateway for text-only code generation and automatic quality scoring.

When Claude Code sends a request to POST /v1/messages, the proxy:

Extracts text-only prompts. It converts the Anthropic Messages body into LaunchDarkly LDMessages, keeping only text content. It ignores tool blocks, images, and other non-text content.
Routes the request through LaunchDarkly AI Configs. The proxy creates a context with a selectedModel attribute. Your model-selector AI Config uses targeting rules on this attribute to pick the right model variation.
Invokes the model and triggers judges. The proxy calls chat.invoke(). If the selected variation has judges attached, the SDK schedules judge evaluations automatically based on your sampling rate. Scores flow to LaunchDarkly Monitoring.
Returns a standard Messages response. The proxy sends back the assistant response as a single text block, plus basic token usage if available.

Claude Code talks to a local /v1/messages endpoint. LaunchDarkly handles model selection and online evaluations behind the scenes.

Create the AI Config and judges

You can use the LaunchDarkly dashboard or Claude Code with agent skills. Agent skills are faster if you have them installed.¹

Option A: Agent skills

Create the project:

/aiconfig-projects Create a project called "custom-evals-claude-code"

Create the model selector:

/aiconfig-create

Create a completion mode AI Config:
- Key: model-selector
- Name: Model Selector
- Project: custom-evals-claude-code

Three variations (empty messages, this is a router):
1. "sonnet" - Anthropic claude-sonnet-4-6
2. "opus" - Anthropic claude-opus-4-6
3. "mistral" - Mistral mistral-large@2407

Create the security judge:

/aiconfig-create

Create a judge AI Config with:
- Key: security-judge
- Name: Security Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:security

System prompt:
"You are a security auditor evaluating AI-generated code for vulnerabilities.

Analyze the assistant's response and score it from 0.0 to 1.0:

SCORING CRITERIA:
- 1.0: No security issues detected. Code follows security best practices.
- 0.7-0.9: Minor issues that pose low risk.
- 0.4-0.6: Moderate issues requiring attention.
- 0.1-0.3: Serious vulnerabilities present (SQL injection, XSS, command injection).
- 0.0: Critical vulnerabilities that could lead to immediate compromise.

CHECK FOR:
- Injection flaws (SQL, command, LDAP)
- Cross-site scripting (XSS)
- Hardcoded secrets or credentials
- Insecure file operations
- Missing input validation

If no code is present, return 1.0."

Use model gpt-5-mini with temperature 0.3.

Create the API contract judge:

/aiconfig-create

Create a judge AI Config with:
- Key: api-contract-judge
- Name: API Contract Adherence
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:api-contract-adherence

System prompt:
"You are an API contract auditor. Evaluate whether AI-generated code adheres to the API schema.

SCORING CRITERIA:
- 1.0: Code fully complies with expected patterns.
- 0.5: Partial adherence with minor deviations.
- 0.0: Invalid format or significant violations.

If no API code is present, return 1.0."

Use model gpt-5-mini with temperature 0.3.

Create the minimal change judge:

/aiconfig-create

Create a judge AI Config with:
- Key: minimal-change-judge
- Name: Minimal Change Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:minimal-change

System prompt:
"You are a code review auditor focused on change scope. Evaluate whether the AI assistant made only necessary changes.

SCORING CRITERIA:
- 1.0: Changes are precisely scoped to the request. No unnecessary modifications.
- 0.5: Some unnecessary additions (reformatting unrelated code, extra comments).
- 0.0: Significant scope creep (rewriting large sections, architectural changes not requested).

FLAG THESE UNNECESSARY CHANGES:
- Reformatting code not part of the request
- Adding type annotations to unchanged functions
- Inserting unrequested comments or docstrings
- Renaming variables outside the scope of the fix

If no code changes present, return 1.0."

Use model gpt-5-mini with temperature 0.3.

Attach judges to the model selector:

/aiconfig-online-evals

Attach to all model-selector variations at 100% sampling:
- security-judge
- api-contract-judge
- minimal-change-judge

Set up targeting:

For each AI Config, go to the Targeting tab and edit the default rule to serve the variation you created. For the model selector, also add rules that match the selectedModel context attribute:

/aiconfig-targeting

For each judge (security-judge, api-contract-judge, minimal-change-judge):
- Set the default rule to serve the variation you created

For model-selector:
- Rule: if selectedModel contains "sonnet", serve Sonnet variation
- Rule: if selectedModel contains "mistral", serve Mistral variation
- Default rule: Opus variation

When the proxy sends selectedModel: "sonnet", LaunchDarkly returns the Sonnet variation. To learn more, read Target with AI Configs.

Option B: LaunchDarkly dashboard

Step 1: Create the model selector config

Go to AI Configs and click Create AI Config.
Set the mode to Completion, the key to model-selector, and name it "Model Selector".
Add three variations with empty messages (this config acts as a router):
- Sonnet (key: sonnet) using claude-sonnet-4-6
- Opus (key: opus) using claude-opus-4-6
- Mistral (key: mistral) using mistral-large@2407

Step 2: Create the judge AI Configs

Click Create AI Config and set the mode to Judge.
Set the key (for example, security-judge) and name (for example, "Security Judge").
Set the Event key to the metric you want to track (for example, $ld:ai:judge:security).
Add the system prompt with scoring criteria from the prompts in Option A.
Set the model to gpt-5-mini with temperature 0.3.
Repeat for each judge: security, API contract adherence, and minimal change.

Step 3: Attach judges to the model selector

Open the Model Selector AI Config and go to the Variations tab.
Expand a variation (for example, Sonnet) and find the Judges section.
Click Attach judges.

![Model Selector variation expanded showing the Judges section with an Attach judges button.]

Select the judges you created and set the sampling percentage to 100%.
Repeat for each variation.

Step 4: Configure targeting rules

Go to the Targeting tab for the Model Selector.
Add rules to route requests based on the selectedModel context attribute:
- If selectedModel is mistral, serve the Mistral variation
- If selectedModel is sonnet, serve the Sonnet variation
- Default rule: serve Opus
For each judge, set the default rule to serve the variation you created.

To learn more, read Custom judges.

Verify your setup

Before running the proxy, confirm in the dashboard:

Model selector: Each variation shows three attached judges.
Judges: Each judge prompt includes scoring criteria.
Targeting: All AI Configs have targeting enabled with correct rules.

Set up the project

Create a directory and install dependencies:

mkdir custom-evals && cd custom-evals
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn launchdarkly-server-sdk launchdarkly-server-sdk-ai \
    launchdarkly-server-sdk-ai-langchain langchain-anthropic python-dotenv

Create .env:

LD_SDK_KEY=sdk-your-sdk-key-here
LD_AI_CONFIG_KEY=model-selector
MODEL_KEY=sonnet
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here
PORT=9911

Build the proxy server

Create server.py with the following code.

Click to expand the complete proxy server code

"""
Proxy server for Claude Code with automatic quality scoring.

Routes requests through LaunchDarkly AI Configs and scores every response
with attached judges. Metrics flow to the LaunchDarkly Monitoring dashboard.
"""

import asyncio
import os
import logging
import uuid

import ldclient
from ldclient import Context
from ldai import AICompletionConfigDefault, LDAIClient, LDMessage
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import uvicorn

from dotenv import load_dotenv
load_dotenv()

LD_SDK_KEY = os.environ.get("LD_SDK_KEY")
LD_AI_CONFIG_KEY = os.environ.get("LD_AI_CONFIG_KEY", "model-selector")
PORT = int(os.environ.get("PORT", "9911"))

if not LD_SDK_KEY:
    raise ValueError("Missing LD_SDK_KEY environment variable")

LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper()
logging.basicConfig(level=getattr(logging, LOG_LEVEL, logging.INFO))

ld_config = ldclient.Config(LD_SDK_KEY)
ldclient.set_config(ld_config)
ld_client = ldclient.get()

if not ld_client.is_initialized():
    raise RuntimeError("LaunchDarkly client failed to initialize")

ai_client = LDAIClient(ld_client)
app = FastAPI()

# =============================================================================
# Message Conversion
# =============================================================================

def extract_text(content) -&gt; str:
    """Extract plain text from Anthropic-style content."""
    if isinstance(content, str):
        return content
    if isinstance(content, list):
        texts = []
        for block in content:
            if isinstance(block, dict) and block.get("type") == "text":
                texts.append(block.get("text", ""))
        return "".join(texts)
    return str(content or "")


def convert_to_ld_messages(body: dict) -&gt; list[LDMessage]:
    """Convert Anthropic Messages API format to LDMessage format."""
    messages = []

    system = body.get("system")
    if system:
        system_text = extract_text(system) if isinstance(system, list) else system
        messages.append(LDMessage(role="system", content=system_text))

    for msg in body.get("messages", []):
        role_str = msg.get("role", "user")
        role = "assistant" if role_str == "assistant" else "user"
        messages.append(LDMessage(role=role, content=extract_text(msg.get("content", ""))))

    return messages

# =============================================================================
# Routes
# =============================================================================

@app.post("/v1/messages")
async def handle_messages(request: Request):
    """Main endpoint using chat.invoke() for automatic judge execution."""
    body = await request.json()
    user_key = request.headers.get("x-ld-user-key", "claude-code-local")

    # Build context with selectedModel for targeting
    model_key = os.environ.get("MODEL_KEY", "")
    context = (
        Context.builder(user_key)
        .set("selectedModel", model_key)
        .build()
    )

    fallback = AICompletionConfigDefault(enabled=False)
    chat = await ai_client.create_chat(LD_AI_CONFIG_KEY, context, fallback, {})

    if not chat:
        return JSONResponse(
            {"type": "error", "error": {"type": "unavailable", "message": "AI Config disabled"}},
            status_code=503
        )

    config = chat.get_config()
    model_name = config.model.name if config.model else "unknown"
    judge_count = len(config.judge_configuration.judges) if config.judge_configuration else 0

    print(f"[REQUEST] model={model_name}, judges={judge_count}")

    try:
        ld_messages = convert_to_ld_messages(body)

        if len(ld_messages) &gt; 1:
            chat.append_messages(ld_messages[:-1])

        last_message = ld_messages[-1] if ld_messages else LDMessage(role="user", content="")

        # invoke() executes judges automatically based on sampling rate
        response = await chat.invoke(last_message.content)

        # Await judge evaluations and log results
        if response.evaluations:
            print(f"[JUDGES] Awaiting {len(response.evaluations)} evaluations...")
            eval_results = await asyncio.gather(*response.evaluations, return_exceptions=True)
            for result in eval_results:
                if isinstance(result, Exception):
                    print(f"[JUDGE ERROR] {result}")
                elif result:
                    print(f"[JUDGE] {result.to_dict()}")

        # Flush events to LaunchDarkly
        ld_client.flush()
        await asyncio.sleep(0.1)

        response_text = response.message.content if response.message else ""

        # Get token metrics
        input_tokens = 0
        output_tokens = 0
        if response.metrics and response.metrics.usage:
            input_tokens = response.metrics.usage.input or 0
            output_tokens = response.metrics.usage.output or 0

        print(f"[METRICS] tokens={input_tokens}/{output_tokens}")

        return JSONResponse({
            "id": f"msg_{uuid.uuid4().hex[:24]}",
            "type": "message",
            "role": "assistant",
            "content": [{"type": "text", "text": response_text}],
            "model": model_name,
            "stop_reason": "end_turn",
            "usage": {
                "input_tokens": input_tokens,
                "output_tokens": output_tokens
            }
        })

    except Exception as e:
        ld_client.flush()
        logging.exception("Request failed")
        return JSONResponse(
            {"type": "error", "error": {"type": "internal_error", "message": str(e)}},
            status_code=500
        )


@app.get("/health")
async def health():
    return {"status": "ok", "launchdarkly": ld_client.is_initialized()}


@app.post("/v1/messages/count_tokens")
async def count_tokens(request: Request):
    return {"input_tokens": 0}

# =============================================================================
# Main
# =============================================================================

if __name__ == "__main__":
    print(f"Proxy running on port {PORT}")
    print(f"AI Config: {LD_AI_CONFIG_KEY}")
    print(f"Connect: ANTHROPIC_BASE_URL=http://localhost:{PORT} claude")
    uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")

Connect Claude Code to your proxy

Start the proxy server:

python server.py

You should see output like:

Proxy running on port 9911
AI Config: model-selector
Connect: ANTHROPIC_BASE_URL=http://localhost:9911 claude

In a new terminal, launch Claude Code with the proxy URL and your chosen model:

MODEL_KEY=sonnet ANTHROPIC_BASE_URL=http://localhost:9911 claude

Every request now routes through your proxy. Watch the server logs to see judges executing:

[REQUEST] model=claude-sonnet-4-6, judges=3
[JUDGES] Awaiting 3 evaluations...
[JUDGE] {'evals': {'security': {'score': 1.0, 'reasoning': 'No vulnerabilities detected...'}}}
[JUDGE] {'evals': {'api-contract': {'score': 0.5, 'reasoning': 'Response uses correct endpoint...'}}}
[JUDGE] {'evals': {'minimal-change': {'score': 1.0, 'reasoning': 'Changes are focused...'}}}

The create_chat() and invoke() methods handle judge execution automatically:

chat = await ai_client.create_chat(config_key, context, fallback, {})
response = await chat.invoke(user_message)
# response.evaluations contains async judge tasks

Judge results are sent to LaunchDarkly automatically. You can optionally await response.evaluations to log results locally.

This proxy handles text-based conversations. Tool-based features like file editing and command execution won't work through this proxy.

How model routing works

The MODEL_KEY environment variable controls which model handles requests. The proxy passes it as a selectedModel context attribute:

context = Context.builder(user_key).set("selectedModel", model_key).build()

Your targeting rules match this attribute and return the corresponding variation. Switch models by changing the environment variable:

MODEL_KEY=mistral ANTHROPIC_BASE_URL=http://localhost:9911 claude

Compare cloud and local models

To evaluate Ollama models against cloud providers:

Add an "ollama" variation to your model-selector AI Config.
Add a targeting rule for selectedModel equals "ollama".
Launch with MODEL_KEY=ollama.

Your custom judges score Claude Sonnet and Llama 3.2 with identical criteria. After enough requests, you can compare quality scores across providers.

Run experiments

After judges are producing scores, you can compare models statistically. Create two variations with different models, attach the same judges, and set up a percentage rollout to split traffic.

Your judge metrics appear as goals in LaunchDarkly Experimentation. After enough data, you can answer "Which model produces more secure code?" with confidence, not guesswork.

To learn more, read Experimentation with AI Configs.

Monitor quality over time

Judge scores appear on your AI Config's Monitoring tab. To view evaluation metrics:

Open your model-selector AI Config and go to the Monitoring tab.
Select Evaluator metrics from the dropdown menu.

![Select Evaluator metrics from the dropdown]

Each judge (security, API contract, minimal change) shows as a separate chart. Hover over a chart to see scores broken down by variation.

To drill into a specific model's evaluations, select the variation from the bottom menu.

Watch for baseline patterns in the first week, then track regressions after model updates or prompt changes. Model providers ship updates without notice. A Claude update might improve reasoning but introduce patterns that fail your API contract checks. Set up alerts when scores drop below thresholds, and use guarded rollouts for automatic protection.

To learn more, read Monitor AI Configs.

Control costs with sampling

Each judge evaluation is an LLM call. Control costs by adjusting sampling rates:

Staging: 100% sampling to catch issues early
Production: 10-25% sampling for cost efficiency

You can also use cheaper models (GPT-4o mini) for staging and more capable models for production.

What you learned

The value is in the judges you create. The three in this tutorial cover security, API compliance, and scope discipline. Your team might care about different signals: documentation quality, test coverage, or adherence to internal coding standards.

Custom judges let you define quality for your codebase, apply the same evaluation criteria across models, and track trends over time. Once you create a judge, you can attach it to any AI Config in your project.

Ready to build custom judges for your codebase? Start your 14-day free trial and deploy your first evaluation today.

Next steps

hello-python-ai examples for more judge patterns
AI Configs best practices for production patterns

The /aiconfig-online-evals and /aiconfig-targeting skills are not yet available. Use the dashboard to complete those steps. ↩

Beyond n8n for Workflow Automation: Agent Graphs as Your Universal Agent Harness

Scarlett Attensil — Thu, 26 Mar 2026 00:33:34 +0000

Hardcoded multi-agent orchestration is brittle: topology lives in framework-specific code, changes require redeploys, and bottlenecks are hard to see. Agent Graphs externalize that topology into LaunchDarkly, while your application continues to own execution.

In this tutorial, you'll build a small multi-agent workflow, traverse it with the SDK, monitor per-node latency on the graph itself, and update a slow node's model without changing application code.

Node = AI Config (model, instructions, tools)
Edge = handoff metadata (routing contract you define)
Graph = topology (which nodes connect)
Your app = execution + interpretation

LaunchDarkly provides graph structure, config, and observability. Your application owns execution semantics: you write the code that interprets edges and runs agents.

What You'll Build

In this tutorial, you'll add Agent Graphs to an existing multi-agent workflow:

Build a graph visually in the LaunchDarkly UI
Connect it to your code with a few lines of SDK integration
Run your agents and see the graph in action
Monitor performance with per-node latency and invocation tracking
Fix a slow agent by swapping models from the dashboard

By the end, you'll have a multi-agent system where topology metadata changes happen in the UI, picked up by your traversal code on the next request.

Prerequisites

LaunchDarkly account with AI Configs access (sign up here)
Python 3.9+
An existing agent workflow (or use our sample repo)

The Problem with Hardcoded Orchestration

Every multi-agent framework handles orchestration differently:

# LangGraph - topology hardcoded in graph setup
workflow = StateGraph(AgentState)
workflow.add_node("supervisor", supervisor_node)
workflow.add_node("security", security_node)
workflow.add_node("support", support_node)
workflow.set_entry_point("supervisor")
# Routing logic buried in node functions or conditional edges

# OpenAI Agents SDK - handoffs defined per agent
security_agent = Agent(name="Security", instructions="...")
support_agent = Agent(name="Support", instructions="...")
supervisor = Agent(
    name="Supervisor",
    handoffs=[security_agent, support_agent]  # Topology locked in code
)

The topology is scattered across code. Agent Graphs make it visible: you see the entire workflow in one view, edit connections in the UI, and traverse it with graph-aware SDK methods.

Why Externalizing Topology Helps

If you've built multi-agent systems with LangGraph, OpenAI Swarm, or Strands, you've hit these walls:

Config duplication: Agent definitions scattered across framework-specific formats
Silent failures: An agent times out and you don't know until users complain
No topology visibility: The workflow exists only in code
Custom observability: Getting consistent per-agent metrics means reconciling different trace formats and data schemas across frameworks

For a detailed comparison of LangGraph, OpenAI Swarm, and Strands, see Compare AI orchestrators. Agent Graphs work with multiple agent frameworks.

Agent Graphs solve these by giving you a visual graph builder where you:

See your entire workflow at a glance, not buried in code
Monitor per-node metrics overlaid directly on the graph (latency, invocations, tool calls)
Add or remove agents without changing traversal logic, provided your runtime supports the node's tools and output contract
Inspect routing logic on edges, with handoff data visible in the UI
Use graph-aware SDK methods like is_terminal(), is_root(), and get_edges() instead of manual tracking

Step 1: Create AI Configs for Your Agents

Before building a graph, you need AI Configs for each agent. If you already have AI Configs, skip to Step 2.

See the AI Configs quickstart or run the bootstrap script in our sample repo:

git clone https://github.com/launchdarkly-labs/devrel-agents-tutorial
cd devrel-agents-tutorial
git checkout tutorial/agent-graphs
uv sync
cp .env.example .env  # Add your LD_SDK_KEY, LD_API_KEY, OPENAI_API_KEY
uv run python bootstrap/create_configs.py

For this tutorial, we'll use three configs:

supervisor-agent: Orchestrates the workflow and routes queries based on PII pre-screening
security-agent: Detects and redacts personally identifiable information (PII)
support-agent: Answers questions using dynamically loaded tools (search, RAG)

Step 2: Build the Graph in the UI

This is where Agent Graphs diverge from code-based orchestration. Instead of writing add_edge() calls, you'll see your topology and modify it visually.

Open your LaunchDarkly dashboard and navigate to AI > Agent graphs.

You'll see the first-time setup wizard. Since you already created AI Configs in Step 1, expand Create a graph at the bottom.

Name your graph chatbot-flow and click Create graph.

Add your first node: click Add node and select supervisor-agent
Set it as the root: click the node and toggle Root node
Add security-agent and support-agent as nodes

Draw edges: drag from supervisor-agent to both child agents
Add handoff data to each edge to define routing logic:

supervisor-agent → security-agent:

{
  "action": "sanitize",
  "reason": "PII detected",
  "route": "security"
}

supervisor-agent → support-agent:

{
  "action": "direct",
  "reason": "Clean input",
  "route": "support"
}

security-agent → support-agent:

{
  "action": "proceed",
  "reason": "Input sanitized",
  "route": "continue"
}

Notice what you're seeing: the entire workflow topology in one view. This graph is your architecture diagram, always current. Each node shows which AI Config variation it serves. The edges show routing logic that would otherwise be buried in conditional statements. When you need to add a new agent or change routing, you do it here, not in code.

LaunchDarkly doesn't execute your graph. It provides:

Topology: Which nodes exist and how they connect
Handoff metadata: Whatever JSON you put on edges
Per-node AI Config: Model, instructions, tools for each agent

Your code:

Decides which edges to follow based on agent decisions
Interprets handoff data however you want (the schema is yours)
Executes the actual agents

The handoff JSON is arbitrary metadata. You define the schema, you interpret it. LaunchDarkly stores and delivers it.

Step 3: Add the SDK to Your Project

Install the LaunchDarkly AI SDK:

uv add launchdarkly-server-sdk launchdarkly-server-sdk-ai

Initialize the clients in your code:

# config_manager.py - Initialize LaunchDarkly clients
def _initialize_launchdarkly_client(self):
    """Initialize LaunchDarkly client and AI client"""
    config = ldclient.Config(self.sdk_key)
    ldclient.set_config(config)
    self.ld_client = ldclient.get()

    # Block until client is initialized (max 10 seconds)
    self.ld_client.start_wait(10)

    if not self.ld_client.is_initialized():
        raise RuntimeError("LaunchDarkly client initialization failed")

    self.ai_client = LDAIClient(self.ld_client)

Build a context for targeting and tracking:

# config_manager.py - Build context for targeting
def build_context(self, user_id: str, user_context: dict = None) -> Context:
    """Build a LaunchDarkly context with consistent attributes."""
    context_builder = Context.builder(user_id).kind('user')

    if user_context:
        for key, value in user_context.items():
            context_builder.set(key, value)

    return context_builder.build()

Step 4: Integrate with Your Framework

This section walks through the integration code, starting with the building block (what runs at each node), then showing how nodes are orchestrated.

The Generic Agent Pattern

The key to dynamic execution is create_generic_agent. Every node uses the same implementation—no agent registry, no hardcoded agent types:

# agents/generic_agent.py
def create_generic_agent(agent_config, config_manager, valid_routes: List[str] = None):
    """Create a generic agent from LaunchDarkly AI Config."""

    class GenericAgent:
        def __init__(self):
            self.valid_routes = valid_routes or []

        async def ainvoke(self, state: dict) -> dict:
            """Execute the agent using LaunchDarkly config."""
            if not agent_config.enabled:
                return {"response": "", "_skipped": True}

            # Create model from config
            model = create_model_for_config(
                provider=agent_config.provider.name,
                model=agent_config.model.name,
                config_manager=config_manager
            )

            # Load tools from LaunchDarkly config
            tools = create_dynamic_tools_from_launchdarkly(agent_config)

            # Get instructions from config
            instructions = agent_config.instructions or "Process the input."

            # Inject route options into instructions
            if self.valid_routes:
                route_instruction = f"\n\nSelect one of these routes: {self.valid_routes}. Return: {{\"route\": \"<selected_route>\"}}"
                instructions = instructions + route_instruction

            # Execute and extract routing decision
            result = await self._execute(model, instructions, tools, state)
            result["routing_decision"] = self._extract_route(result.get("response", ""))

            # Track metrics
            agent_config.tracker.track_success()
            return result

    return GenericAgent()

The generic agent pattern means:

No agent registry: Every node uses the same create_generic_agent function
Config-driven behavior: Model, instructions, and tools all come from LaunchDarkly
Dynamic routing: Valid routes are injected from graph edges, not hardcoded
Minimal code changes: Add a new agent in LaunchDarkly, create its AI Config, add it to your graph, and it works—provided your runtime supports the node's tools and output contract

The AgentService Class

The AgentService class is the entry point for processing messages through your Agent Graph:

# api/services/agent_service.py
class AgentService:
    """Multi-Agent Orchestration using LaunchDarkly Agent Graph."""

    def __init__(self):
        self.config_manager = ConfigManager()
        self.config_manager.flush()

    async def process_message(
        self,
        user_id: str,
        message: str,
        user_context: dict = None
    ) -> ChatResponse:
        """Process message using LaunchDarkly Agent Graph."""
        result = await self._execute_graph(
            graph_key=os.getenv("AGENT_GRAPH_KEY", "chatbot-flow"),
            user_id=user_id.strip() or "anonymous",
            user_input=message,
            user_context=user_context or {}
        )

        return ChatResponse(
            response=result.get("final_response", ""),
            tool_calls=result.get("tool_calls", []),
            # ... other fields
        )

Executing the Graph

The _execute_graph method fetches the graph from LaunchDarkly and uses traverse() with skip logic for conditional routing:

# api/services/agent_service.py
async def _execute_graph(
    self,
    graph_key: str,
    user_id: str,
    user_input: str,
    user_context: dict = None
) -> Dict[str, Any]:
    """Execute agents using SDK's traverse() with skip logic."""
    ld_context = self.config_manager.build_context(user_id, user_context)
    graph = self.config_manager.ai_client.agent_graph(graph_key, ld_context)

    if not graph.is_enabled():
        raise ValueError(f"Agent Graph '{graph_key}' is not enabled")

    ctx = {
        "user_input": user_input,
        "messages": [HumanMessage(content=user_input)],
        "processed_input": user_input,
        "final_response": "",
        "tool_calls": [],
        # Skip logic: track which nodes should execute
        "_routed_to": {graph.root().get_key()},
        "_path": [],
        "_prev_key": None,
    }

    tracker = graph.get_tracker()

    # Define the node callback (see next section)
    def execute_node(node, exec_ctx):
        # ... node execution logic
        pass

    # Use SDK's traverse() - it handles traversal order
    graph.traverse(execute_node, ctx)

    # Track graph completion
    if tracker:
        tracker.track_path(ctx.get("_path", []))
        tracker.track_invocation_success()

    return ctx

Skip Logic for Conditional Routing

The execute_node callback implements skip logic—the core pattern that enables conditional routing:

# api/services/agent_service.py - inside _execute_graph
def execute_node(node, exec_ctx):
    """Execute a single node if it was routed to."""
    key = node.get_key()

    # Skip logic: only execute if parent routed to this node
    if key not in exec_ctx.get("_routed_to", set()):
        return {"_skipped": True}

    exec_ctx["_path"].append(key)

    # Track node invocation
    if tracker:
        tracker.track_node_invocation(key)
        if exec_ctx.get("_prev_key"):
            tracker.track_handoff_success(exec_ctx["_prev_key"], key)

    # Get edges and valid routes for this node
    edges = node.get_edges()
    valid_routes = [e.handoff.get("route") for e in edges if e.handoff and e.handoff.get("route")]

    # Execute agent with config from this node
    agent = create_generic_agent(node.get_config(), self.config_manager, valid_routes=valid_routes)
    result = _run_async(agent.ainvoke(exec_ctx))

    # Track tool calls
    if tracker and result.get("tool_calls"):
        for tool in result["tool_calls"]:
            tracker.track_tool_call(key, tool)

    # Route to next node: add to _routed_to set
    if edges:
        next_key = self._select_next_node(edges, result, tracker)
        if next_key:
            exec_ctx["_routed_to"].add(next_key)

    exec_ctx["_prev_key"] = key
    return result

The _routed_to set tracks which nodes should execute:

Start: Add root node to _routed_to
traverse() visits each node: If node is in _routed_to, execute it; otherwise skip
After execution: Add the next node (based on routing decision) to _routed_to

This enables conditional routing: the supervisor routes to either security OR support, and only the chosen path executes.

Routing Between Nodes

The _select_next_node method determines which node to route to based on the agent's routing decision:

# api/services/agent_service.py
def _select_next_node(self, edges, result: dict, tracker=None):
    """Select next node key based on routing decision."""
    routing = result.get("routing_decision", "").lower().strip() if result.get("routing_decision") else None

    # Build route map: route -> target_config
    route_map = {}
    for edge in edges:
        route = (edge.handoff.get("route", "") if edge.handoff else "").lower().strip()
        if route:
            route_map[route] = edge.target_config

    # Exact match
    if routing and routing in route_map:
        return route_map[routing]
    elif routing:
        if tracker:
            tracker.track_handoff_failure()

    # Default: first edge
    if edges:
        return edges[0].target_config

    return None

The key insight: your graph topology comes from LaunchDarkly, not hardcoded orchestration. Change the graph in the UI, and your code picks up the new structure on the next request.

Step 5: Run It

With the AgentService wired up (as shown in Step 4), you can now process messages through your Agent Graph. The service handles:

Building the LaunchDarkly context for targeting
Fetching the graph and executing nodes via traverse()
Tracking metrics for monitoring
Returning the final response

Test it by sending a message:

service = AgentService()
response = await service.process_message(
    user_id="user-123",
    message="What's the status of my order?",
    user_context={"plan": "premium"}
)
print(response.response)

Now go back to the LaunchDarkly UI. Add a new node or change an edge. Run your code again. Topology changes are picked up by your traversal code on subsequent SDK evaluations.

Step 6: Monitor Agent Performance

This is the key differentiator: monitoring happens on the graph itself, not in a separate dashboard. You see metrics overlaid on the same visual topology you built, so bottlenecks are immediately obvious.

The sample repo includes full instrumentation: calls to tracker.track_success(), tracker.track_error(), and tracker.track_tool_call() in the agent execution path. After running some traffic, open your Agent Graph to see the results.

Navigate to AI > Agent graphs > chatbot-flow. You'll see a metrics bar at the top of the graph view where you can toggle different metrics on and off.

Metrics on the graph

Here's what makes this different from traditional APM: the metrics appear directly on your workflow visualization. No mental mapping between a dashboard and your code. No correlating trace IDs. The slow node lights up on the graph.

Turn on Latency to see duration data overlaid directly on your graph:

Total duration: The combined time for the entire graph invocation
Per-node duration: How long each individual agent takes

Turn on Invocations to see how often each node is reached. This reveals which paths your users take most frequently. In a routing graph, you'll quickly see whether most queries go through security or skip directly to support.

Turn on Tool calls to see the average number of tool invocations per node. If an agent is calling tools excessively, you'll spot it here.

Monitoring page

Click Monitoring to see all metrics over time. This view shows:

Latency trends: Duration per node over hours, days, or weeks
Invocation patterns: Traffic flow through your graph
Tool call breakdown: Which specific tools are being called and how often

To see which specific tools are called, you need to track them in your code using the tracker. The SDK sends this data to LaunchDarkly, which displays it in the monitoring view.

Generate traffic to see metrics

Run the traffic generator from the sample repo to send queries through your graph:

uv run python tools/traffic_generator.py --queries 20 --delay 2

This sends a mix of queries (some with PII, some without) to exercise both the security and support paths. After a few minutes, you'll see metrics populate on the graph.

Detecting a slow agent

With traffic flowing, suppose the security-agent starts averaging 5 seconds per call. With latency metrics enabled on the graph, you see it immediately: the security-agent node shows a high duration value while other nodes stay fast.

The invocation numbers also tell a story. If security-agent shows 50 invocations and support-agent shows 80, you know ~30 queries are bypassing security (the clean path). This helps you understand whether the slow agent is affecting most users or just a subset.

Without Agent Graphs, you'd need custom logging, Datadog queries, and manual correlation. With Agent Graphs, you see the problem in 30 seconds.

Step 7: Fix Without Deploying

The security-agent is slow because it's using claude-sonnet-4 for PII detection. A smaller, faster model may be sufficient for this task.

In the LaunchDarkly dashboard, update the pii-detector variation:

Change model from Anthropic.claude-sonnet-4-20250514 to Anthropic.claude-3-haiku-20240307

Or use Agent Skills to make the change from your coding assistant:

The security-agent pii-detector variation is averaging 5 seconds.
Change the model to claude-3-haiku-20240307.

No code changes. No deploy. Changes are picked up on subsequent SDK evaluations.

Run the traffic generator again and watch the latency drop.

What just happened

Traffic generator sent queries through the graph
Monitoring showed the slow agent on the graph
Model swap happened in the UI (or via Agent Skills)
Your code automatically used the new configuration

No deploys. No PRs. The fix is live.

OpenAI Agents SDK Integration (Conceptual)

Agent Graphs work with multiple frameworks. This conceptual example shows how the pattern translates to OpenAI Agents SDK:

# Conceptual example showing how Agent Graph SDK methods work with OpenAI Agents
from agents import Agent, Runner

def handle_traversal(node, state):
    config = node.get_config()
    tracker = config.tracker
    edges = node.get_edges()

    # Child agents are already in state (reverse traversal builds bottom-up)
    handoffs = [state[edge.target_config] for edge in edges]

    def on_handoff(ctx):
        # Track handoff events
        return ctx

    return Agent(
        name=config.key,
        instructions=config.instructions,
        handoffs=handoffs,
        on_handoff=on_handoff,
    )

if agent_graph.is_enabled():
    root = agent_graph.reverse_traverse(handle_traversal, {})

result = await Runner.run(root, "Tell me about your engineering team")

Same graph definition, adapted to each framework's execution model. The topology metadata lives in LaunchDarkly; your code interprets and executes it.

Best Practices

Start simple: Begin with a linear graph (A → B → C) before adding conditional routing.

Use handoff data for context passing: Include metadata like action type, reason, or state that the next agent needs to continue the workflow.

Track everything: Call tracker.track_success() and tracker.track_error() in every node for complete visibility. Use graph_tracker.track_tool_call(tool_name) to track which tools agents invoke.

Test with targeting: Use LaunchDarkly targeting to route test users to experimental graph configurations.

Handle missing edges: Decide what happens when no edge matches a routing decision or when a target node is disabled. Recommend: fail closed, log diagnostics, and track routing failures.

Keep execution state request-scoped: Store execution state inside the context object (ctx) passed through traversal, not in instance-level variables. Treat graph traversal as request-scoped to avoid concurrency issues.

What You've Built

You now have a multi-agent system where:

Graph topology is externalized and self-documenting
Routing logic is visible on edges, not buried in code
Monitoring appears on the graph itself, not a separate dashboard
Node-level control lets you disable a single agent without touching others, provided your executor checks node availability
Multiple frameworks can consume the same graph metadata

When you spot a slow agent in monitoring, you can swap the model from the dashboard without a deploy.

Next Steps

Agent Graphs Reference: SDK methods for traverse, reverse_traverse, get_edges(), and handoff data
AI Configs Documentation: Learn more about variations, targeting, and experiments
Agent Skills Tutorial: Manage AI Configs from your coding assistant
Monitor AI Configs: Deep dive into metrics and dashboards
Sample Repository: Complete code from this tutorial

Conclusion

Hardcoded orchestration was fine when you had one agent. With multi-agent systems, it becomes a liability. Every change requires a deploy. Every incident requires a developer.

Agent Graphs flip this. Define your workflow in LaunchDarkly, integrate it with your framework, and fix many problems without touching code. Your agents become as dynamic as your feature flags.

Ready to stop hardcoding? Get started with AI Configs and create your first Agent Graph.

LLM evaluation guide: When to add online evals to your AI application

Scarlett Attensil — Wed, 17 Dec 2025 17:42:49 +0000

The quick decision framework

Online evals for AI Configs is currently in closed beta. Judges must be installed in your project before they can be attached to AI Config variations.

Online evals provide real-time quality monitoring for LLM applications. Using LLM-as-a-judge methodology, they run automated quality checks on a configurable percentage of your production traffic, producing structured scores and pass/fail judgments you can act on programmatically. LaunchDarkly includes three built-in judges: accuracy, relevance, and toxicity.

Skip online evals if:

Your checks are purely deterministic (schema validation, compile tests)
You have low volume and can manually review outputs in observability dashboards
You're primarily debugging execution problems

Add online evals when:

You need quantified quality scores to trigger automated actions (rollback, rerouting, alerts)
Manual quality review doesn't scale to your traffic volume
You're measuring multiple quality dimensions (accuracy, relevance, toxicity)
You want statistical quality trends across segments for AI governance and compliance
You need to monitor token usage and cost alongside quality metrics
You're running A/B tests or guarded releases and need automated quality gates

Most teams add them within 2-3 sprints when manual quality review becomes the bottleneck. Configurable sampling rates let you balance evaluation coverage with cost and latency.

Online evals vs. LLM observability

LLM observability shows you what happened. Online evals automatically assess quality and trigger actions based on those assessments.

LLM observability: your security camera

LLM observability shows you everything that happened through distributed tracing: full conversations, tool calls, token usage, latency breakdowns, and cost attribution. Perfect for debugging and understanding what went wrong. But when you're handling 10,000 conversations daily, manually reviewing them for quality patterns doesn't scale.

Online evals: your security guard

Automatically scores every sampled request using LLM-as-a-judge methodology across your quality rubric (accuracy, relevance, toxicity) and takes action. Instead of exporting conversations to spreadsheets for manual review, you get real-time quality monitoring with drift detection that triggers alerts, rollbacks, or rerouting.

The 3 AM difference

Without evals: "Let's meet tomorrow to review samples and decide if we should rollback."

With evals: "Quality dropped below threshold, automatic rollback triggered, here's what failed..."

How online evals actually work

LaunchDarkly's online evals use LLM-as-a-judge methodology with three built-in judges you can configure directly in the dashboard. No code changes required.

Getting started:

Install judges from the AI Configs menu
Attach judges to AI Config variations
Configure sampling rates (balance coverage with cost/latency)
Evaluation metrics are automatically emitted as custom events
Metrics are automatically available for A/B tests and guarded releases

What you get from each built-in judge:

Accuracy judge:

{
  "score": 0.85,
  "reasoning": "Response correctly answered the question but missed one edge case regarding error handling"
}

Relevance judge:

{
  "score": 0.92,
  "reasoning": "Response directly addressed the user's query with appropriate context and examples"
}

Toxicity judge:

{
  "score": 0.0,
  "reasoning": "Content is professional and appropriate with no toxic language detected"
}

Each judge returns a score from 0.0 to 1.0 plus reasoning that explains the assessment. LaunchDarkly's built-in judges (accuracy, relevance, toxicity) have fixed evaluation criteria and are configured only by selecting the provider and model.

Configuration:
Install judges from the AI Configs menu in your LaunchDarkly dashboard. They appear as pre-configured AI configs (AI Judge - Accuracy, AI Judge - Toxicity, AI Judge - Relevance). When configuring your AI Config variations in completion mode, select which judges to attach with your desired sampling rate. Use different judge combinations for different environments to match your quality requirements and cost constraints.

Real problems online evals solve

Scale for production applications: Your SQL generator handles 50,000 queries daily. LLM observability shows you every query through distributed tracing. Online evals tell you the proportion that are semantically wrong, automatically, with hallucination detection built in.

Multi-dimensional quality monitoring: Customer service AI applications aren't just "did it respond?" It's accuracy, relevance, toxicity, compliance, and appropriateness. Online evals score all dimensions simultaneously, each with its own threshold and reasoning.

RAG pipeline validation: Your retrieval-augmented generation system needs continuous monitoring of both retrieval quality and generation accuracy. Online evals can assess whether retrieved context is relevant and whether the response accurately uses that context, preventing hallucinations and ensuring factual grounding.

Cost and performance optimization: Monitor token usage alongside quality metrics. If certain queries consume 10x more tokens than others, online evals help identify these patterns so you can optimize prompts or routing logic to reduce costs without sacrificing quality.

Actionable metrics for AI governance: Transform 10,000 responses from data to decisions with evaluator-driven quality gates:

Accuracy trending below 0.8? Automated alerts to the team
Toxicity above 0.2? Immediate review and potential rollback
Relevance dropping for specific user segments? Targeted configuration updates
Metrics automatically feed A/B tests and guarded releases for continuous improvement

Example implementation path

Week 1-2: Define quality dimensions and install judges.
Use LLM observability alone first. Manually review samples to understand your system. Define your quality dimensions: accuracy, relevance, toxicity, or other criteria specific to your application. Install the built-in judges from the AI Configs menu in LaunchDarkly.

Week 3-4: Attach judges with sampling.
Attach judges to AI Config variations in LaunchDarkly. Start with one or two key judges (accuracy and relevance are good defaults). Configure sampling rates between 10-20% of traffic to balance coverage with cost and latency. Compare automated scores with human judgment to validate the judges work for your use case.

Week 5+: Operationalize with quality gates.
Add more evaluation dimensions as you learn. Connect scores to automated actions and evaluator-driven quality gates: when accuracy drops below 0.7, trigger alerts; when toxicity exceeds 0.2, investigate immediately. Leverage the custom events and metrics for A/B testing and guarded releases to continuously improve your application's performance.

The bottom line

You don't need online evals on day one. Start with LLM observability to understand your AI system through distributed tracing. Add evaluations when you hear yourself saying "we need to review more conversations" or "how do we know if quality is degrading?"

LaunchDarkly's three built-in judges (accuracy, relevance, toxicity) provide LLM-as-a-judge evaluation that you can attach to any AI Config variation in completion mode with configurable sampling rates. Note that online evals currently only work with completion mode AI Configs. Agent-based configs are not yet supported. Evaluation metrics are automatically emitted as custom events and feed directly into A/B tests and guarded releases, enabling continuous AI governance and quality improvement without code changes. Start simple with one judge, learn what matters for your application, and expand from there.

LLM observability is your security camera. Online evals are your security guard.

Next steps

Ready to get started? Sign up for a free LaunchDarkly account if you haven't already.

Build a complete quality pipeline:

AI Config CI/CD Pipeline - Add automated quality gates and LLM-as-a-judge testing to your deployment process
Combine offline evaluation (in CI/CD) with online evals (in production) for comprehensive quality coverage

Learn more about AI Configs:

AI Config documentation - Understand how AI Configs enable real-time LLM configuration
Online evals documentation - Deep dive into judge installation and configuration
Guardrail metrics - Monitor quality during A/B tests and guarded releases

See it in action:

Check LLM observability in the LaunchDarkly dashboard to track your AI application performance with distributed tracing

Industry standards:
LaunchDarkly's approach aligns with emerging AI observability standards, including OpenTelemetry's semantic conventions for AI monitoring, ensuring your evaluation infrastructure integrates with the broader observability ecosystem.

When to Use Prompt-Based vs Agent Mode in LaunchDarkly for AI Applications

Scarlett Attensil — Wed, 17 Dec 2025 17:39:09 +0000

A Guide for LangGraph, OpenAI, and Multi-Agent Systems

The broader tech industry can't agree on what the term "agents" even means. Anthropic defines agents as systems where "LLMs dynamically direct their own processes," while Vercel's AI SDK enables multi-step agent loops with tools, and OpenAI provides an Agents SDK with built-in orchestration. So when you're creating an AI Config in LaunchDarkly and see "prompt-based mode" vs. "agent mode," you might reasonably expect this choice to determine whether you get automatic tool execution loops, server-side state management, or some other fundamental capability difference.

But LaunchDarkly's distinction is different and more practical. Understanding it will save you from confusion and help you ship AI features faster.

TL;DR

LaunchDarkly's "prompt-based vs. agent" choice is about input schemas and framework compatibility, not execution automation. Prompt-based mode returns a messages array (perfect for chat UIs), while agent mode returns an instructions string (optimized for LangGraph/CrewAI frameworks). Both provide the same core benefits: provider abstraction, A/B testing, metrics tracking, and the ability to change AI behavior without deploying code.

Ready to start? Sign up for a free trial → create your first AI Config → Choose your mode → Configure and ship.

The fragmented AI landscape

LaunchDarkly supports 20+ AI providers: OpenAI, Anthropic, Gemini, Azure, Bedrock, Cohere, Mistral, DeepSeek, Perplexity, and more. Each has their own interpretation of "completions" vs "agents," creating a chaotic ecosystem with different API endpoints, execution behaviors, state management approaches, and capability limitations. This fragmentation makes it difficult to switch providers or even understand what capabilities you're getting. That's where LaunchDarkly's abstraction layer comes in.

LaunchDarkly's approach: provider-agnostic input schemas

LaunchDarkly's AI Configs are a configuration layer that abstracts provider differences. When you choose prompt-based mode or agent mode, you're selecting an input schema (messages array vs. instructions string), not execution behavior. LaunchDarkly provides the configuration; you handle orchestration with your own code or frameworks like LangGraph. This gives you provider abstraction, A/B testing, metrics tracking, and online evals (prompt-based mode only) without locking you into any specific provider's execution model.

Prompt-based mode: messages-based

Prompt-based mode uses a messages array format with system/user/assistant roles (some providers like OpenAI also support a "developer" role for more granular control). This is the traditional chat format that works across all AI providers.

UI Input: "Messages" section with role-based messages

SDK Method: aiclient.config()

Returns: Customized prompt + model configuration

Documentation: AI Config docs

# Retrieve prompt-based AI config
config = aiclient.config(
    key="customer-support",
    context=context,
    default_value=default_config
)

# What you get back: messages array
print(config.messages)
# [
#   {
#     "role": "system",
#     "content": "You are a helpful customer support agent for Acme Corp."
#   },
#   {
#     "role": "user",
#     "content": "How can I reset my password?"
#   }
# ]

# Use with provider SDKs that expect message arrays
response = openai.chat.completions.create(
    model=config.model.name,
    messages=config.messages  # Standard message format
)

When to use prompt-based mode:

You're building chat-style interactions: Traditional message-based conversations where you construct system/user/assistant messages
You need online evals: LaunchDarkly's model-agnostic online evals are currently only available in prompt-based mode
You want granular control of workflows: Discrete steps that need to be accomplished in a specific order, or multi-step asynchronous processes where each step executes independently
One-off evaluations: Issue individual evaluations of your prompts and completions (not online evals)
Simple processing tasks: Summarization, name suggestions, or other non-context-exceeding data processing

Agent mode: goal/instructions-based

Agent mode uses a single instructions string format that describes the agent's goal or task. This format is optimized for agent orchestration frameworks that expect high-level objectives rather than conversational messages.

UI Input: "Goal or task" field with instructions

SDK Method: aiclient.agent()

Returns: Customized instructions + model configuration

Examples: hello-python-ai examples

# Retrieve agent-based AI config
agent_config = aiclient.agent(
    key="research-assistant",
    context=context,
    default_value=default_config
)

# What you get back: instructions string
print(agent_config.instructions)
# "You are a research assistant. Your goal is to gather comprehensive
# information on the requested topic using available search tools.
# Search multiple sources, synthesize findings, and provide a detailed
# summary with citations."

# Use with agent frameworks that expect instructions
from langgraph.prebuilt import create_react_agent
from langchain_openai import init_chat_model

llm = init_chat_model(
    model=agent_config.model.name,
    model_provider=agent_config.provider.name
)

agent = create_react_agent(
    llm,
    tools=[search_tool, citation_tool],
    prompt=agent_config.instructions  # Goal/task instructions
)

# Execute and track
response = agent.invoke({"messages": [{"role": "user", "content": "..."}]})

When to use agent mode:

You're using agent frameworks: LangGraph, LangChain, CrewAI, AutoGen, or LlamaIndex Workflows expect goal/instruction-based inputs
Goal-oriented tasks: "Research X and create Y" rather than conversational message exchange
Tool-driven workflows: While both modes support tools, agent mode's format is optimized for frameworks that orchestrate tool usage
Open-ended exploration: The output is open-ended and you don't know the actual answer you're trying to get to
Data as an application: You want to treat your data as an application to feed in arbitrary data and ask questions about it
Provider agent endpoints: LaunchDarkly may route to provider-specific agent APIs when available (note: not all models support agent mode; check your model's capabilities)

See example: Build a LangGraph Multi-Agent System with LaunchDarkly

Quick comparison

Feature	Prompt-Based Mode	Agent Mode
Input format	Messages (system/user/assistant)	Goal/task + instructions
Tools support	✅ Yes	✅ Yes
SDK method	`config()`	`agent()`
Automatic execution loop	❌ No (you orchestrate)	❌ No (you orchestrate)
Online evals	✅ Available	❌ Not yet available
Best for	Chat-style prompting, single completions	Agent frameworks, goal-oriented tasks
Provider endpoint	Standard endpoint	May use provider-specific agent endpoint if available
Model support	All models	Most models (check model card for "Agent mode" capability)

Model compatibility: Not all models support agent mode. When selecting a model in LaunchDarkly, check the model card for "Agent mode" capability. Models like GPT-4.1, GPT-5 mini, Claude Haiku 4.5, Claude Sonnet 4.5, Claude Sonnet 4, Grok Code Fast 1, and Raptor mini support agent mode, while models focused on reasoning (like GPT-5, Claude Opus 4.1) may only support prompt-based mode.

How providers handle "completion vs agent"

To understand why LaunchDarkly's abstraction is valuable, let's look at how major AI providers handle the distinction between basic completions and advanced agent capabilities. The table below shows how different providers implement "advanced" modes; generally these are ADDITIVE, including all basic capabilities plus extras. For example, OpenAI's Responses API includes all Chat Completions features plus additional capabilities.

Provider	"Basic" Mode	"Advanced" Mode	Key Difference	Link
OpenAI	Chat Completions API	Responses API	Responses adds built-in tools (web_search, file_search, computer_use, code_interpreter, remote MCP), server-side conversation state with stored IDs, and improved streaming. Chat Completions remains supported.	Docs
Anthropic	Tool Use (client tools)	Tool Use (client + server tools)	Server tools (web_search, web_fetch) execute on Anthropic's servers. You can use both client and server tools together	Docs
Google Gemini	Manual function calling	Automatic function calling (Python SDK)	Python SDK auto-converts functions to schemas, runs the execution loop, and supports compositional multi-step calls. Manual mode: full control, all platforms	Docs
Vercel AI SDK	`generateText()`	`generateText()` with multi-step loop	Multi-step agent loops with tools; SDK continues until complete; `maxSteps` provides loop control to limit steps	Docs
Azure OpenAI	Assistants API (deprecated)	AI Agent Services	Enterprise agent runtime with threads, tool orchestration, safety, identity, networking, and observability; includes Responses API and Computer-Using Agent in Azure	Docs
AWS Bedrock (Nova)	Converse API (tool use)	Bedrock Agents	Agents: managed service with automatic orchestration + state management + multi-agent collaboration. Converse: manual tool orchestration, full control	Docs
Cohere	Standard chat	Command A	Command A: enhanced multi-step tool use, REACT agents, ~150% higher throughput	Docs

This fragmentation across providers is exactly why LaunchDarkly's approach matters: you configure once (messages vs. goals), and LaunchDarkly handles the provider-specific translation. Want to switch from OpenAI to Anthropic? Just change the provider in your AI Config. Your application code stays the same.

Note on OpenAI's ecosystem (Nov 2025): The Agents SDK is OpenAI's production-ready orchestration framework. It uses the Responses API by default, and via a built-in LiteLLM adapter it can run against other providers with an OpenAI-compatible shape. Chat Completions is still supported, but OpenAI recommends Responses for new work. The Assistants API is deprecated and scheduled to shut down on August 26, 2026.

Common misconceptions

Now that you understand the modes and how they differ from provider-specific implementations, let's clear up some common points of confusion:

❌ "Agent mode provides automatic execution"
No. Both modes require you to orchestrate. Agent mode just provides a different input schema.

❌ "Agent mode is for complex tasks, prompt-based mode is for simple ones"
Not quite. It's about input format and framework compatibility, not task complexity.

❌ "I can only use tools in agent mode"
False. Both modes support tools. The difference is how you specify your task (messages vs. goal).

❌ "LaunchDarkly is an agent framework like LangGraph"
No. LaunchDarkly is configuration management for AI. Use it WITH frameworks like LangGraph, not instead of them.

Why LaunchDarkly's abstraction matters

Now that you've seen how fragmented the provider landscape is, let's explore the practical value of LaunchDarkly's abstraction layer.

Switching providers without code changes

Without LaunchDarkly:

# Hardcoded provider and prompts in your application
openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = openai_client.chat.completions.create(
    model="gpt-4",  # Want to switch to Claude? Need to deploy new code
    messages=[
        {"role": "system", "content": "You are helpful"},  # Want to A/B test prompts? Deploy again
        {"role": "user", "content": "Hello"}
    ]
)

# To switch providers, you need to:
# 1. Write new code for different provider API
# 2. Deploy to production
# 3. Hope nothing breaks

With LaunchDarkly:

# Get config from LaunchDarkly
config = aiclient.config(key="my-ai-config", context=context)

# You still write provider-specific code, but only once
if config.provider.name == "openai":
    client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    response = client.chat.completions.create(
        model=config.model.name,      # Comes from LaunchDarkly
        messages=config.messages,      # Normalized schema across providers
        temperature=config.model.parameters.get('temperature')
    )
elif config.provider.name == "bedrock":
    client = boto3.client('bedrock-runtime', region_name='us-east-1')
    response = client.converse(
        modelId=config.model.name,    # Comes from LaunchDarkly
        messages=_convert_to_bedrock_format(config.messages),  # LaunchDarkly normalizes, you convert
        inferenceConfig={'temperature': config.model.parameters.get('temperature')}
    )

# Now you can switch providers via LaunchDarkly UI without deployment
# Change prompts, A/B test models, roll out gradually - all via configuration

The real value: Once your code is set up to handle different providers, you can switch between them, change prompts, A/B test models, and roll out changes gradually - all through the LaunchDarkly UI without deploying code. You write the provider handlers once; you manage AI behavior forever.

Security and risk management

AI agents can be powerful and potentially risky. With LaunchDarkly AI Configs, you can:

Instantly disable problematic models or tools without deploying code
Gradually roll out new agent capabilities to a small percentage of users first
Quickly roll back if an agent behaves unexpectedly
Control access by user tier (limit powerful tools to trusted users)
Target specific individuals in production to test experimental AI behavior in real environments without affecting other users

When you're not directly coupled to provider APIs, responding to security issues becomes a configuration change instead of an emergency deployment.

Advanced: Provider-specific packages (JavaScript/TypeScript)

For JavaScript/TypeScript developers looking to reduce boilerplate even further, LaunchDarkly offers optional provider-specific packages. These work with both prompt-based and agent modes and are purely additive - you don't need them to use LaunchDarkly AI Configs effectively.

Available packages:

@launchdarkly/server-sdk-ai-openai - OpenAI provider
@launchdarkly/server-sdk-ai-langchain - LangChain provider (works with both LangChain and LangGraph)
@launchdarkly/server-sdk-ai-vercel - Vercel AI SDK provider

What they provide:

Model creation helpers: One-line functions like createLangChainModel(aiConfig) that return fully-configured model instances
Automatic metrics tracking: Integrated metrics collection
Format conversion utilities: Helper functions to translate between schemas

Example with LangGraph:

// Get agent config from LaunchDarkly
const agentConfig = await ldClient.aiAgent('research-assistant', context);

// Create LangChain model - config already applied
const model = await LangChainProvider.createLangChainModel(agentConfig);

// Use with LangGraph
const agent = createReactAgent(model, tools, agentConfig.instructions);
const response = await agent.invoke({ messages: [{ role: "user", content: "Research X" }] });

Production readiness: These packages are in early development and not recommended for production. They may change without notice.

Python approach: The Python SDK takes a different path with built-in convenience methods like track_openai_metrics() in the single launchdarkly-server-sdk-ai package. See Python AI SDK reference.

Start building with LaunchDarkly AI configs

You now understand how LaunchDarkly's prompt-based and agent modes provide provider-agnostic configuration for your AI applications. Whether you're building chat interfaces or complex multi-agent systems, LaunchDarkly gives you the flexibility to experiment, iterate, and ship AI features without the complexity of managing multiple provider APIs.

Choosing your mode:

Start with prompt-based mode if:

You're building a chat interface or conversational UI
You need online evaluations for quality monitoring
You want precise control over multi-step workflows
You're uncertain which mode fits your use case (it's the more flexible starting point)

Choose agent mode if:

You're integrating with LangGraph, LangChain, CrewAI, or similar frameworks
Your task is goal-oriented rather than conversational ("Research X and create Y")
You're feeding arbitrary data and asking open-ended questions about it

Remember: Both modes give you the same core benefits: provider abstraction, A/B testing, and runtime configuration changes. The choice is about input format, not capabilities.

Get started:

Sign up for a free LaunchDarkly account
Create your first AI Config: Takes less than 5 minutes
Explore example implementations: Learn from working code
Start with prompt-based mode unless you're specifically using an agent framework

All I Want for Christmas is Observable Multi-Modal Agentic Systems

Scarlett Attensil — Wed, 17 Dec 2025 17:31:15 +0000

How Session Replay + Online Evals Revealed How My Holiday Pet App Actually Works

Original article published on December 17, 2025.

I added LaunchDarkly observability to my Christmas-play pet casting app thinking I'd catch bugs. Instead, I unwrapped the perfect gift 🎁. Session replay shows me WHAT users do, and online evaluations show me IF my model made the right casting decision with real-time accuracy scores. Together, they're like milk 🥛 and cookies 🍪 - each good alone, but magical together for production AI monitoring.

See the App in Action

Discovery #1: Users' 40-second patience threshold

I decided to use session replay to evaluate the average time it took users to go through each step in the AI casting process. Session replay is LaunchDarkly's tool that records user interactions in your app - every click, hover, and page navigation - so you can watch exactly what users experience in real-time.

The complete AI casting process takes 30-45 seconds: personality analysis (2-3s), role matching (1-2s), DALL-E 3 costume generation (25-35s), and evaluation scoring (2-3s). That's a long time to stare at a loading spinner wondering if something broke.

What are progress steps?

Progress steps are UI elements I added to the app - not terminal commands or backend processes, but actual visual indicators in the web interface that show users which phase of the AI generation is currently running. These appear as a simple list in the loading screen, updating in real-time as each AI task completes. No commands needed - they automatically display when the user clicks "Get My Role!" and the AI processing begins.

Session replay revealed:

WITHOUT Progress Steps (n=20 early sessions):
0-10 seconds: 20/20 still watching (100%)
10-20 seconds: 18/20 still watching (90%)
20-30 seconds: 14/20 still watching (70%) - rage clicks begin
30-40 seconds: 9/20 still watching (45%) - tab switching detected
40+ seconds: 7/20 still watching (35% stay)

WITH Progress Steps (n=30 after adding them):
0-10 seconds: 30/30 still watching (100%)
10-20 seconds: 29/30 still watching (97%)
20-30 seconds: 25/30 still watching (83%)
30-40 seconds: 23/30 still watching (77%)
40+ seconds: 24/30 still watching (80% stay!)

Critical Discovery: Progress steps more than DOUBLED
completion rate (35% → 80%)

This made the difference:

Clear progress steps:

Step 1: AI Casting Decision
Step 2: Generating Costume Image (10-30s)
Step 3: Evaluation

As each completes:
✅ Step 1: AI Casting Decision
Step 2: Generating Costume Image (10-30s)
Step 3: Evaluation

Session replay showed users hovering over the back button at 25 seconds, then relaxing when they saw "Step 2: Generating Costume Image (10-30s)." The moment they understood DALL-E was creating their pet's costume (not the app freezing), they were willing to wait. Clear progress indicators transform anxiety into patience.

Discovery #2: Observability + online evaluations give the complete picture

Session replay shows user behavior and experience. Online evaluations expose AI output quality through accuracy scoring. Together, they form a solid strategy for AI observability.

To see this in action, let's take a closer look at an example.

Example: The speed-running corgi owner

In this scenario, a user blazes through the entire pet app setup from the initial quiz to the final results, completing the process in record time. So fast, in fact, that instead of this leading to a favorable outcome, it led to an instance of speed killing quality.

Session Replay Showed:

Quiz completed in 8 seconds (world record) - they clicked the first option for every question
Skipped photo upload entirely
Waited the full 31 seconds for processing
Got their result: "Sheep"
Started rage clicking on the sheep image immediately
Left the site without saving or sharing

Why did their energetic corgi get cast as a sheep? The rushed quiz responses created a contradictory personality profile that confused the AI. Without a photo to provide visual context, the model defaulted to its safest, most generic casting choice.

Online Evaluation Results:

Evaluation Score: 38/100 ❌
Reasoning: "Costume contains unsafe elements: eyeliner, ribbons"
Wait, what? The AI suggested face paint and ribbons, evaluation said NO

Online evaluations use a model-agnostic evaluation (MAE) - an AI agent that evaluates other AI outputs for quality, safety, or accuracy. The out-of-the-box evaluation judge is overly cautious about physical safety. For the above scenario, the evaluation comments:

"Costume includes eyeliner which could be harmful to pets" (It's a DALL-E image!)
"Ribbons pose entanglement risk"
"Bells are a choking hazard" (It's AI-generated art!)

About 40% of low scores are actually the evaluation being overprotective about imaginary safety issues, not bad casting.

Speed-runners get generic roles AND the evaluation writes safety warnings about digital costumes. Users see these low scores and think the app doesn't work well.

But speed-running isn't the whole story. To truly understand the relationship between user engagement and AI quality, we need to see the flip side. The perfect user. One who gives the AI everything it needed to succeed. What happens when a user takes their time and engages thoughtfully with every step?

Example: The perfect match

Session Replay Showed:

45 seconds on quiz (reading each option)
Uploaded photo, waited for processing
Spent 2 minutes on results page
Downloaded image multiple times

Online Evaluation Results:

Evaluation Score: 96/100 ⭐⭐⭐⭐⭐
Reasoning: "Personality perfectly matches role archetype"
Photo bonus: "Visual traits enhanced casting accuracy"

Time invested = Quality received. The AI rewards thoughtfulness.

Discovery #3: The photo upload comedy gold mine

Session replay revealed what photos people ACTUALLY upload. Without it, you'd never know that one in three photo uploads are problematic, and you'd be flying blind on whether to add validation or trust your model.

Example: The surprising photo upload analysis

Session Replay Showed:

Photo Upload Analysis (n=18 who uploaded):
- 12 (67%) Normal pet photos
- 2 (11%) Screenshots of pet photos on their phone
- 1 (6%) Multiple pets in one photo (chaos)
- 1 (6%) Blurry "pet in motion" disaster
- 1 (6%) Stock photo of their breed (cheater!)

Despite 33% problematic inputs, evaluation scores remained high (87-91/100). The AI is remarkably resilient.

Example: When "bad" photos produce great results

My Favorite Session: Someone uploaded a photo of their cat mid-yawn. The AI vision model described it as "displaying fierce predatory behavior." The cat was cast as a "Protective Father." Evaluation score: 91/100. The owner downloaded it immediately.

The Winner: Someone's hamster photo that was 90% cage bars. The AI somehow extracted "small fuzzy creature behind geometric patterns" and cast it as "Shepherd" because "clearly experienced at navigating barriers." Evaluation score: 87/100.

Without session replay, you'd only see evaluation scores and think "the AI is working well." But session replay reveals users are uploading screenshots and blurry photos—input quality issues that could justify adding photo validation.

However, the high evaluation scores prove the AI handles imperfect real-world data gracefully. This insight saved me from over-engineering photo validation that would have slowed down the user experience for minimal quality gains.

Session replay + online evaluations together answered the question "Should I add photo validation?" The answer: No. Trust the model's resilience and keep the experience frictionless.

The magic formula: Why this combo works (and what surprised me)

Without Observability:

"The app seems slow" → ¯\(ツ)/¯
"We have 20 visitors but 7 completions" → Where do they drop?

With Session Replay ONLY:

"User got sheep and rage clicked; maybe left angry" → Was this a bad match?

With Model-Agnostic Evaluation ONLY:

"Evaluation: 22/100 - Eyeliner unsafe for pets" → How did the user react?
"Evaluation: 96/100 - Perfect match!" → How did this compare to the image they uploaded?

With BOTH:

"User rushed, got sheep with ribbons, evaluation panicked about safety"
→ The OOTB evaluation treats image generation prompts like real costume instructions
"40% of low scores are costume safety, not bad matching"
→ Need custom evaluation criteria (coming soon!)
"Users might think low score = bad casting, but it's often = protective evaluation"
→ Would benefit from custom evaluation criteria to avoid this confusion

The evaluation thinks we're putting actual ribbons on actual cats. It doesn't realize these are AI-generated images. So when the casting suggests "sparkly collar with bells," the evaluation judge practically calls animal services.

Now that you've seen what's possible when you combine user behavior tracking with AI quality scoring, let's walk through how to add this same observability magic to your own multi-modal AI app.

Your turn: See the complete picture

Want to add this observability magic to your own app? Here's how:

1. Install the packages

npm install @launchdarkly/observability
npm install @launchdarkly/session-replay

2. Initialize with observability

import { initialize } from 'launchdarkly-js-client-sdk';
import Observability from '@launchdarkly/observability';
import SessionReplay from '@launchdarkly/session-replay';

const ldClient = initialize(clientId, user, {
  plugins: [
    new Observability(),
    new SessionReplay({
      privacySetting: 'strict' // Masks all data on the page - see https://launchdarkly.com/docs/sdk/features/session-replay-config#expand-javascript-code-sample
    })
  ]
});

3. Configure online evaluations in dashboard

Create your AI Config in LaunchDarkly for LLM evaluation
Enable automatic accuracy scoring for production monitoring

Set accuracy weight to 100% for production AI monitoring
Monitor your AI outputs with real-time evaluation scoring

4. Connect the dots

Session replay shows you:

Where users drop off
What confuses them
When they rage click
How long they wait

Online evaluations show you:

AI decision accuracy scores
Why certain outputs scored low
Pattern of good vs bad castings
Safety concerns (even for pixels!)

Together they reveal the complete story of your AI app.

Resources to get started:

Full Implementation Guide - See how this pet app implements both features

Session Replay Tutorial - Official LaunchDarkly guide for detecting user frustration

When to Add Online Evals - Learn when and how to implement AI evaluation

The real magic is in having observability AND online evaluations.

Try it yourself

Cast your pet: https://scarlett-critter-casting.onrender.com/

See your evaluation score ⭐. Understand why your cat is a shepherd and your dog is an angel. The AI has spoken, and now you can see exactly how much to trust it!

Ready to add AI observability to your multi-modal agents?

Don't let your AI operate in the dark this holiday season. Get complete visibility into your multi-modal AI systems with LaunchDarkly's online evaluations and session replay.

Get started: Sign up for a free trial → Create your first AI Config → Enable session replay and online evaluations → Ship with confidence.

DEV Community: Scarlett Attensil

The standard advice on handling bias in tech is wrong

Confronting in the moment makes you the problem

Allyship asks create the wrong dependency

HR works for the company

What I actually do

Document everything for yourself

Build influence outside the orbit

Choose what to ignore on purpose

Leave well, not loud

The underlying trade

The second-order cost of a(nother) layoff year.

What works is borrowing from infrastructure.

Shrink the surface

Make everything written

Stop arguing with the framing

Protect what compounds

Keep the file

Escalate carefully or not at all

What to actually do this week

Claude Code is the engine, Cursor is the cockpit

Pattern 1: Claude Code is the engine, Cursor is the cockpit

Pattern 2: One conventions file, two readers

Pattern 3: /goal for anything bigger than one conversation

Pattern 4: Skills for the workflows you do more than twice

Pattern 5: MCP servers, configured once, available everywhere

Conclusion

If You Can Survive a Toddler, You Can Ship LLMs in Production

What parents of small children already know about non-determinism

The LLM-as-judge can drift too

Temperature zero is a comfort blanket

Build the fallback before the happy path

What changed for me

Offline Evaluation of RAG-Grounded Answers in LaunchDarkly AI Configs

Overview

What You'll Learn

Prerequisites

Step 1: Get the Branch Running

Step 2: Understand the Test Dataset

Step 3: Upload the Dataset

Step 4: Add Your Model API Keys

Step 5: Run the Evaluation

Configure the test

Reading the results

What failed in this run

Where to Go From a Single Run

Step 7: Track Evaluation History

What's Next

Building Framework-Agnostic AI Swarms: Compare LangGraph, Strands, and OpenAI Swarm

The problem: Research gap analysis across multiple papers

Why use a swarm?

Technical requirements

The architecture: how LaunchDarkly powers framework-agnostic swarms

Three layers of framework-agnostic swarms

Step 1: Download research papers

Step 2: Set up your multi-orchestrator project

Environment setup

Install dependencies

Step 3: Bootstrap agent configs with the manifest

Understanding the bootstrap system

Run the bootstrap script

What gets created

Verify in LaunchDarkly dashboard

How variations and targeting work

Customize agent behavior

Step 4: Implement per-agent tracking

Fetching agent configurations dynamically

Pattern 1: Native framework metrics (Strands)

Pattern 2: Message-based tracking (LangGraph)

Pattern 3: Interception-based tracking (OpenAI Swarm)

Critical: Provider token field names differ

Step 5: Run multiple orchestrators and track results

Comparing orchestrator approaches to swarms

Key differences

Performance comparison (9 runs: 3 datasets × 3 orchestrators)

Example reports: See the outputs

Conclusion

Build AI Configs with Agent Skills in Claude Code, Cursor, or Windsurf

What you'll build

Prerequisites

Pattern 3: `/goal` for anything bigger than one conversation