<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tang Weigang</title>
    <description>The latest articles on DEV Community by Tang Weigang (@doramagic).</description>
    <link>https://dev.to/doramagic</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936036%2F9206ba55-8d8f-457e-a399-6e3316ef715f.png</url>
      <title>DEV Community: Tang Weigang</title>
      <link>https://dev.to/doramagic</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/doramagic"/>
    <language>en</language>
    <item>
      <title>Do Not Treat Pydantic AI as an Agent Magic Layer</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Sat, 27 Jun 2026 06:48:39 +0000</pubDate>
      <link>https://dev.to/doramagic/do-not-treat-pydantic-ai-as-an-agent-magic-layer-5hgd</link>
      <guid>https://dev.to/doramagic/do-not-treat-pydantic-ai-as-an-agent-magic-layer-5hgd</guid>
      <description>&lt;p&gt;Pydantic AI is easy to describe as another Python agent framework. That is technically true, but it misses the useful part. The project becomes interesting when you treat it as a way to make agent behavior inspectable: typed outputs, dependency injection, tool schemas, provider boundaries, traces, evals, and human approval points in one engineering surface.&lt;/p&gt;

&lt;p&gt;The Doramagic pydantic-ai manual classifies it as an Agent SDK and runtime. The best fit is not "anyone who wants a chatbot." It is developers building observable, testable, multi-tool agent applications. The bad fit is also important: if you only need one prompt, a simple API call, or an environment where tool permissions cannot be isolated, Pydantic AI may be more machinery than you need.&lt;/p&gt;

&lt;p&gt;My first rule for adopting it would be simple: do not start by asking the agent to do a large real task. Start by proving that a minimal agent can run with fake tools, temporary dependencies, typed output, and no unexpected side effects.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real object is a workflow contract
&lt;/h2&gt;

&lt;p&gt;The value of Pydantic AI shows up when your agent has to do more than produce text. It is most useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the result needs to validate against a Pydantic &lt;code&gt;BaseModel&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;runtime state has to be passed through a typed &lt;code&gt;RunContext&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;tool arguments must be schema-checked before execution;&lt;/li&gt;
&lt;li&gt;provider-specific behavior needs to stay behind a clear adapter boundary;&lt;/li&gt;
&lt;li&gt;traces need to show tool choices, retries, failures, and branches;&lt;/li&gt;
&lt;li&gt;evals need to catch regressions before release;&lt;/li&gt;
&lt;li&gt;some tool calls require approval before they execute.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a different problem from "make the model answer nicely." It is closer to "make the model-driven workflow reviewable."&lt;/p&gt;

&lt;h2&gt;
  
  
  Structured output is not just prettier JSON
&lt;/h2&gt;

&lt;p&gt;The README example pattern with &lt;code&gt;output_type=SupportOutput&lt;/code&gt; is easy to underestimate. It is not only a formatting trick. It turns the model response into a business object that the rest of the application can inspect.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SupportOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;support_advice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;block_card&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;block_card&lt;/code&gt; is not something the caller has to infer from a paragraph. It is a field that can be reviewed, logged, tested, and gated. If the output fails validation, the framework can feed that error back to the model and retry.&lt;/p&gt;

&lt;p&gt;The boundary is just as important: a typed field is not a policy. If &lt;code&gt;risk=8&lt;/code&gt; triggers a real action, the team still needs to define what risk means, who can approve the next step, and which action is allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool calls should start with fake tools
&lt;/h2&gt;

&lt;p&gt;Pydantic AI tools are normal Python functions registered through decorators, with arguments validated through schema. That is good engineering surface area. It also means the first mistake can be expensive if the tool has real permissions.&lt;/p&gt;

&lt;p&gt;My adoption sequence would be:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;register one fake tool that returns a fixed value;&lt;/li&gt;
&lt;li&gt;pass temporary dependencies through &lt;code&gt;RunContext&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;verify that the agent selects the expected tool;&lt;/li&gt;
&lt;li&gt;verify that arguments match the intended schema;&lt;/li&gt;
&lt;li&gt;only then replace the fake tool with a real implementation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do not give the first agent production API keys, write access, browser automation, or a broad filesystem view. The Doramagic boundary card says the first use should start with least privilege, a temporary environment, and rollback. That is the right default.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;RunContext&lt;/code&gt; is a state channel, not a junk drawer
&lt;/h2&gt;

&lt;p&gt;The typed dependency pattern is one of the cleanest parts of Pydantic AI. It lets tools and dynamic instructions access runtime state without global variables. But it can also become a quiet permission leak.&lt;/p&gt;

&lt;p&gt;I would split dependencies into three buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;required state: current task, current user scope, allowed data range;&lt;/li&gt;
&lt;li&gt;read-only configuration: model choice, thresholds, feature flags;&lt;/li&gt;
&lt;li&gt;forbidden state: long-lived secrets, full user tables, production write clients, and unrelated internal documents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a tool needs a user id, pass a user id. Do not pass the entire account object. If it only needs read access, do not pass a write-capable client. A stronger agent framework makes state minimization more important, not less.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trace should explain decisions, not dump everything
&lt;/h2&gt;

&lt;p&gt;Pydantic AI's observability surface is one reason it is useful for serious agent work. For agents, the trace is often more important than the final sentence. A correct answer can still hide an unsafe path: the wrong tool, too many retries, weak retrieval, swallowed errors, or a side effect that should have waited for approval.&lt;/p&gt;

&lt;p&gt;A useful trace should answer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which tool was selected?&lt;/li&gt;
&lt;li&gt;Which context or source was used?&lt;/li&gt;
&lt;li&gt;Where did the run retry, fail, or branch?&lt;/li&gt;
&lt;li&gt;Did structured output validation fail and recover?&lt;/li&gt;
&lt;li&gt;Was any side effect attempted?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But observability has a data boundary. The Doramagic manual calls out community concerns around large OpenTelemetry attributes such as serialized request parameters. That is a reminder to decide what enters traces, what gets redacted, and what never gets logged.&lt;/p&gt;

&lt;p&gt;The goal is to preserve the decision path, not to archive sensitive prompts, private documents, secrets, or oversized context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Toolsets, MCP, and capabilities need an allowlist
&lt;/h2&gt;

&lt;p&gt;Pydantic AI's toolsets, MCP support, capabilities, deferred loading, and human-in-the-loop tools are powerful because they let an agent load external capabilities in a structured way. They are also exactly where teams should slow down.&lt;/p&gt;

&lt;p&gt;Before enabling a capability, I would write a small allowlist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which toolsets this agent may see;&lt;/li&gt;
&lt;li&gt;which tools are disabled by default;&lt;/li&gt;
&lt;li&gt;which tools require approval;&lt;/li&gt;
&lt;li&gt;what happens when a tool fails;&lt;/li&gt;
&lt;li&gt;whether tool output can enter the final answer;&lt;/li&gt;
&lt;li&gt;where tool calls are audited.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an agent can see every tool all the time, you do not have a capability system. You have a permission problem waiting for the wrong prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  A host rule I would load first
&lt;/h2&gt;

&lt;p&gt;If I were asking Claude Code, Codex, Cursor, or another AI coding host to help with Pydantic AI, I would give it this rule before asking for code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You may help design a Pydantic AI agent, but first state:
1. whether the target is chat, RAG, a tool-using agent, or a multi-agent workflow;
2. whether output needs a Pydantic BaseModel contract;
3. which state is allowed in deps and which state is forbidden;
4. each tool's permission level, input schema, failure behavior, and approval rule;
5. which trace fields are recorded, redacted, or excluded;
6. which smoke check or eval proves the agent did not cross its boundary.

Do not claim Pydantic AI is installed or working locally without a separate run log.
Do not use real secrets, production write access, or broad filesystem access unless the user explicitly approves it.
Do not treat a prompt preview as a real project run.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That rule is more useful than "build me an agent." It turns the task into a reviewable boundary contract.&lt;/p&gt;

&lt;h2&gt;
  
  
  A sane first day
&lt;/h2&gt;

&lt;p&gt;A practical first day would be deliberately small:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;create a temporary directory;&lt;/li&gt;
&lt;li&gt;follow the official quick start in isolation;&lt;/li&gt;
&lt;li&gt;define one &lt;code&gt;Agent&lt;/code&gt; with one simple &lt;code&gt;BaseModel&lt;/code&gt; output;&lt;/li&gt;
&lt;li&gt;add one fake tool that returns a fixed value;&lt;/li&gt;
&lt;li&gt;pass temporary dependencies through &lt;code&gt;RunContext&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;capture the tool path or trace;&lt;/li&gt;
&lt;li&gt;write a smoke check that verifies the tool call, output validation, and no unexpected side effect.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not flashy. It is useful because it answers the first adoption question: can this framework make my agent behavior more inspectable?&lt;/p&gt;

&lt;p&gt;Pydantic AI's strength is not that it makes agents feel magical. Its strength is that it gives agent work a shape: types, tools, runtime state, traces, evals, approval points, and stop conditions. That is the part worth loading into an AI host.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference roles
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Upstream project: pydantic/pydantic-ai, the source for code, installation, release behavior, and API facts, &lt;a href="https://github.com/pydantic/pydantic-ai" rel="noopener noreferrer"&gt;https://github.com/pydantic/pydantic-ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Doramagic project page: an independent capability asset for AI hosts, &lt;a href="https://doramagic.ai/en/projects/pydantic-ai/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/pydantic-ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Doramagic manual: a practical reading path for agents, providers, structured outputs, toolsets, MCP, traces, evals, and pitfalls, &lt;a href="https://doramagic.ai/en/projects/pydantic-ai/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/pydantic-ai/manual/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aiagentspythonpydantic</category>
    </item>
    <item>
      <title>Before You Ship an Agent, Make DeepEval Test the Failure Path</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Fri, 26 Jun 2026 02:38:39 +0000</pubDate>
      <link>https://dev.to/doramagic/before-you-ship-an-agent-make-deepeval-test-the-failure-path-3n5l</link>
      <guid>https://dev.to/doramagic/before-you-ship-an-agent-make-deepeval-test-the-failure-path-3n5l</guid>
      <description>&lt;h1&gt;
  
  
  Before You Ship an Agent, Make DeepEval Test the Failure Path
&lt;/h1&gt;

&lt;p&gt;Most AI agent projects add evaluation too late. The usual order is: connect the model, wire the tools, add retrieval, make the demo work, then think about evals. That is convenient, but it means the team only knows that a few happy paths looked fine. It does not know which failures are stable, which ones are dangerous, and which ones will quietly return with the next prompt or model change.&lt;/p&gt;

&lt;p&gt;DeepEval is useful when you treat it as a release gate, not as a dashboard you add after launch. The Doramagic DeepEval manual breaks the project into the practical pieces that matter for that gate: &lt;code&gt;LLMTestCase&lt;/code&gt;, &lt;code&gt;GEval&lt;/code&gt;, &lt;code&gt;AnswerRelevancyMetric&lt;/code&gt;, &lt;code&gt;TaskCompletionMetric&lt;/code&gt;, hallucination checks, &lt;code&gt;deepeval test run&lt;/code&gt;, &lt;code&gt;deepeval generate golden&lt;/code&gt;, trace-aware evaluation, framework integrations, and the difference between local evaluation and Confident AI cloud synchronization.&lt;/p&gt;

&lt;p&gt;The point is not to say "use every metric." The point is to make agent failure testable before the agent touches real workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with failure examples
&lt;/h2&gt;

&lt;p&gt;The first question should not be "which metric should we use?" A better first question is: what does a bad answer look like in this product?&lt;/p&gt;

&lt;p&gt;For an agent or RAG system, useful failure examples might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the retriever found the right context, but the answer ignored the key fact;&lt;/li&gt;
&lt;li&gt;the agent completed the task with the wrong tool;&lt;/li&gt;
&lt;li&gt;the final answer sounded confident, but the &lt;code&gt;retrieval_context&lt;/code&gt; did not support it;&lt;/li&gt;
&lt;li&gt;the tool path worked once, but the trace showed repeated retries or a wrong branch;&lt;/li&gt;
&lt;li&gt;the answer looked correct but violated a permission rule, policy rule, or user constraint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once those examples are written down, metrics become meaningful. Without them, a threshold such as 0.7 is just a number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keep the first test case boring
&lt;/h2&gt;

&lt;p&gt;A minimal DeepEval test case can be very small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;deepeval.test_case&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMTestCase&lt;/span&gt;

&lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMTestCase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customers can get a free refund within 30 days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;We offer a 30-day full refund at no extra costs.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retrieval_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All customers are eligible for a 30 day full refund at no extra costs.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value is not in the amount of code. The value is in separating input, actual output, expected output, and retrieval context. That separation prevents a common RAG mistake: judging whether the answer reads well instead of checking whether the allowed context supports it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick metrics by failure type
&lt;/h2&gt;

&lt;p&gt;DeepEval gives you several metric families. I would not start by wiring all of them into a pipeline. Pick one or two that map to a known failure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use answer relevancy when the answer drifts away from the user's question.&lt;/li&gt;
&lt;li&gt;Use task completion when the agent may finish the wrong job or skip a step.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;GEval&lt;/code&gt; when the product has custom criteria that need to be spelled out.&lt;/li&gt;
&lt;li&gt;Use retrieval-aware tests when the system depends on context, sources, or documents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical rule: every metric should tell you who needs to fix the failure. Is it the prompt, the retriever, the tool router, the dataset, the threshold, or the product boundary? If a failed eval only tells you "the model was bad," the eval is still too vague.&lt;/p&gt;

&lt;h2&gt;
  
  
  For agents, trace matters more than the final sentence
&lt;/h2&gt;

&lt;p&gt;The Doramagic manual distinguishes end-to-end evaluation from trace-aware evaluation. For simple chatbots, final-answer checks already help. For agents, path quality matters more.&lt;/p&gt;

&lt;p&gt;An agent can produce the right final sentence while taking an unsafe or unstable path. It may call a tool it should not use, retry the same step several times, continue after low-quality retrieval, or swallow a tool error and write a polished conclusion.&lt;/p&gt;

&lt;p&gt;For agent evaluation, I would want the trace to answer four questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which tool was selected?&lt;/li&gt;
&lt;li&gt;Which context or source was used?&lt;/li&gt;
&lt;li&gt;Where did the run retry, fail, or branch?&lt;/li&gt;
&lt;li&gt;Which evidence supported the final answer?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without that, you may only be evaluating writing quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generated goldens still need review
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;deepeval generate golden&lt;/code&gt; is valuable because it lowers the cost of starting an eval set. It can generate candidate goldens from documents, contexts, scratch, or existing golden examples. But generated goldens are not the same thing as reviewed truth.&lt;/p&gt;

&lt;p&gt;A safer path is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate 20 to 50 candidate cases;&lt;/li&gt;
&lt;li&gt;remove duplicates, vague questions, and unsupported answers;&lt;/li&gt;
&lt;li&gt;mark 5 to 10 cases as critical regression cases;&lt;/li&gt;
&lt;li&gt;rerun those cases whenever the prompt, retriever, tool router, or model version changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That turns DeepEval into a regression habit instead of a one-time screenshot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local eval and cloud sync are different risk levels
&lt;/h2&gt;

&lt;p&gt;The basic local path can be small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; deepeval
deepeval &lt;span class="nb"&gt;test &lt;/span&gt;run test_chatbot.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you log in and sync reports, datasets, traces, or production monitoring to Confident AI, you should treat it as a separate data-boundary decision. Before doing that, answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;will inputs, outputs, retrieval context, or traces be uploaded?&lt;/li&gt;
&lt;li&gt;do any cases contain user data, internal documents, or secrets?&lt;/li&gt;
&lt;li&gt;who can view the report?&lt;/li&gt;
&lt;li&gt;can failure cases be redacted?&lt;/li&gt;
&lt;li&gt;should CI be allowed to sync results automatically?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a criticism of DeepEval. It is just the normal boundary work for any eval or observability system.&lt;/p&gt;

&lt;h2&gt;
  
  
  A useful host rule
&lt;/h2&gt;

&lt;p&gt;If an AI coding host is going to help set up DeepEval, I would give it this rule first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You may design DeepEval tests, but first state:
1. whether the target is RAG, an agent, a chatbot, or a single prompt;
2. the failure examples being tested;
3. which metric maps to which failure;
4. why the threshold is chosen;
5. whether trace-aware evaluation is needed;
6. whether cloud sync, API keys, or user data are involved.

Do not treat LLM-as-a-Judge scores as absolute truth.
Do not treat generated goldens as human-reviewed labels.
Do not claim DeepEval is installed or validated locally without a separate run log.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That rule is more useful than "add evals." It makes the evaluation plan reviewable.&lt;/p&gt;

&lt;h2&gt;
  
  
  A sane first day
&lt;/h2&gt;

&lt;p&gt;For a first run, I would pick one real workflow and write ten cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;three normal success cases;&lt;/li&gt;
&lt;li&gt;three likely hallucination cases;&lt;/li&gt;
&lt;li&gt;two missing-context cases;&lt;/li&gt;
&lt;li&gt;one permission-boundary case;&lt;/li&gt;
&lt;li&gt;one empty-result case.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then I would add one metric and make the failure explanation useful before adding more metrics, trace collection, generated goldens, or CI gates.&lt;/p&gt;

&lt;p&gt;DeepEval's value is not that it makes AI systems look controlled. Its value is that it makes failure earlier, sharper, and easier to reproduce.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference roles
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Upstream project: confident-ai/deepeval, the source for code, installation, releases, and API facts, &lt;a href="https://github.com/confident-ai/deepeval" rel="noopener noreferrer"&gt;https://github.com/confident-ai/deepeval&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Doramagic project page: an independent capability asset for AI hosts, &lt;a href="https://doramagic.ai/en/projects/deepeval/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/deepeval/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Doramagic manual: a practical reading path for test cases, metrics, tracing, generated goldens, pitfalls, and boundaries, &lt;a href="https://doramagic.ai/en/projects/deepeval/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/deepeval/manual/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
      <category>agents</category>
    </item>
    <item>
      <title>Before an Agent Uses Qdrant, Write the Retrieval Contract</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Thu, 25 Jun 2026 02:25:03 +0000</pubDate>
      <link>https://dev.to/doramagic/before-an-agent-uses-qdrant-write-the-retrieval-contract-21p1</link>
      <guid>https://dev.to/doramagic/before-an-agent-uses-qdrant-write-the-retrieval-contract-21p1</guid>
      <description>&lt;h1&gt;
  
  
  Before an Agent Uses Qdrant, Write the Retrieval Contract
&lt;/h1&gt;

&lt;p&gt;When people wire a vector database into an agent, the first milestone is usually simple: store embeddings, run a search, pass the hits back to the model. That is enough for a demo, but it leaves the most important questions implicit. Which collection was searched? Was the payload filter mandatory or optional? Is the score threshold a business rule or a temporary guess? If nothing is returned, does that mean the knowledge is missing, the filter was too strict, the embedding model drifted, or the index is not ready?&lt;/p&gt;

&lt;p&gt;The Doramagic Qdrant project pack is a useful reminder that Qdrant is not just a "semantic search endpoint." Its manual points at HNSW, sparse and multivector search, quantization, payload indexing, filtering, sharding, replication, WAL, and storage internals. That shape matters when an AI host is going to call Qdrant as a tool.&lt;/p&gt;

&lt;p&gt;The thesis is simple: Qdrant should enter an agent workflow as a retrieval layer with a contract, not as a magic memory button. The Doramagic pack is not a prompt library. It is an independent capability asset: the Human Manual gives a reading route, the Prompt Preview gives pre-install host instructions, the pitfall notes and boundary card define limits, the source map points back to repo evidence, and eval or smoke-check thinking turns "it sounds right" into something reviewable.&lt;/p&gt;

&lt;p&gt;This is a technical workflow note, not an official Qdrant guide and not a claim that Qdrant was installed in this environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do not stop at "the vector insert worked"
&lt;/h2&gt;

&lt;p&gt;For an agent, the risk is rarely that Qdrant has no search API. The real risk is that the agent treats retrieval as a black box and turns a plausible hit into a confident answer.&lt;/p&gt;

&lt;p&gt;Before giving Qdrant to an agent, I would make the agent state five things for every retrieval call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieval intent: fact lookup, similar case, code snippet, or candidate recall;&lt;/li&gt;
&lt;li&gt;collection scope: one collection, several collections, or a routed search;&lt;/li&gt;
&lt;li&gt;payload filters: tenant, permission group, document type, version, and time window;&lt;/li&gt;
&lt;li&gt;score use: top-k, threshold, reranking, and who owns each decision;&lt;/li&gt;
&lt;li&gt;empty-result handling: missing knowledge, bad filter, stale index, or embedding mismatch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If these are not explicit, the model can accidentally turn "nearest neighbor" into "verified fact."&lt;/p&gt;

&lt;h2&gt;
  
  
  Treat payload filters as permission boundaries
&lt;/h2&gt;

&lt;p&gt;In small RAG demos, filters often look like convenience. In a real system, payload filters are often access control.&lt;/p&gt;

&lt;p&gt;If your payload includes fields like &lt;code&gt;tenant_id&lt;/code&gt;, &lt;code&gt;acl_group&lt;/code&gt;, &lt;code&gt;source_type&lt;/code&gt;, or &lt;code&gt;document_version&lt;/code&gt;, they should not be optional knobs that the agent can drop when it wants more recall. They should be part of the tool contract.&lt;/p&gt;

&lt;p&gt;One plain rule is enough:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every user-facing retrieval call must include tenant, permission, and version filters.
If those fields are missing, the tool should refuse or return a structured "filter_required" error.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not elegant, but it prevents a familiar failure: the agent widens the query because it is trying to be helpful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector score is not answer confidence
&lt;/h2&gt;

&lt;p&gt;A Qdrant score says something about similarity under a particular representation. It does not prove the source is current, authorized, complete, or correct.&lt;/p&gt;

&lt;p&gt;I prefer to make the agent keep two layers separate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieval evidence: point id, source id, payload, collection, filter, score;&lt;/li&gt;
&lt;li&gt;answer evidence: which retrieved items were used, which were rejected, and why.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation makes the output slightly more verbose, but it catches a common bug: a high-scoring stale chunk becomes the basis for a very confident answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantization and multivectors need a test set
&lt;/h2&gt;

&lt;p&gt;Qdrant exposes serious retrieval machinery: quantization, sparse vectors, dense vectors, multivectors, and payload indexes. Those features are not automatic upgrades. They change memory use, latency, recall, and debugging behavior.&lt;/p&gt;

&lt;p&gt;Before asking an agent to optimize the retrieval setup, I would build a tiny acceptance set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;five questions that should retrieve the right source;&lt;/li&gt;
&lt;li&gt;three questions that are likely to retrieve the wrong neighbor;&lt;/li&gt;
&lt;li&gt;two permission or version-boundary questions;&lt;/li&gt;
&lt;li&gt;one empty-result question.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run this before changing quantization, hybrid search, reranking, or collection structure. Otherwise you may improve a benchmark while making your actual agent less reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  A minimal retrieval contract
&lt;/h2&gt;

&lt;p&gt;The contract I would give an AI host is short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You may use Qdrant for candidate retrieval, but every call must state:
1. retrieval intent;
2. collection and payload filter;
3. top-k, score threshold, and rerank setting;
4. which returned fields may be used in the final answer;
5. how to handle empty, low-score, or missing-permission results.

Do not treat vector similarity as factual confidence.
Do not broaden a query when tenant, permission, or version filters are missing.
Do not claim Qdrant has been installed or validated locally unless a separate run log proves it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That contract turns retrieval into a reviewable step instead of an invisible tool call.&lt;/p&gt;

&lt;h2&gt;
  
  
  A safer first run
&lt;/h2&gt;

&lt;p&gt;If I were starting today, I would not begin with a large import. I would create a small collection with 20 to 50 source-controlled samples and complete payload fields. Then I would expose only a read-only search tool and require the agent to output the retrieval plan, raw evidence, and answer citation.&lt;/p&gt;

&lt;p&gt;After that loop is stable, I would add writes, bulk import, quantization, multivectors, and more aggressive routing.&lt;/p&gt;

&lt;p&gt;Qdrant can be a strong retrieval layer. The AI host still needs a contract for scope, permissions, and failure interpretation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference roles
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Upstream project: Qdrant repository, the source of code, API, and release facts, &lt;a href="https://github.com/qdrant/qdrant" rel="noopener noreferrer"&gt;https://github.com/qdrant/qdrant&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Capability asset: Doramagic Qdrant project page, an independent capability resource pack for AI hosts, &lt;a href="https://doramagic.ai/en/projects/qdrant/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/qdrant/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Guide and validation notes: Doramagic Qdrant manual, useful before loading the pack into an agent workflow, &lt;a href="https://doramagic.ai/en/projects/qdrant/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/qdrant/manual/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
    </item>
    <item>
      <title>Before an Agent Runs Code in E2B, Define the Sandbox Contract First</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Wed, 24 Jun 2026 02:26:14 +0000</pubDate>
      <link>https://dev.to/doramagic/before-an-agent-runs-code-in-e2b-define-the-sandbox-contract-first-3hng</link>
      <guid>https://dev.to/doramagic/before-an-agent-runs-code-in-e2b-define-the-sandbox-contract-first-3hng</guid>
      <description>&lt;p&gt;E2B is easy to describe too quickly: give an AI agent a secure sandbox so it can run code.&lt;/p&gt;

&lt;p&gt;That description is useful, but it skips the part I would care about before connecting it to Claude, Codex, Cursor, or a custom agent:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is the agent allowed to touch inside the sandbox, and what must happen when the run fails?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This note is based on the independent Doramagic E2B manual:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://doramagic.ai/en/projects/e2b/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/e2b/manual/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Project page:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://doramagic.ai/en/projects/e2b/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/e2b/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is not an official E2B document. I am using it as a pre-adoption checklist. One important caveat from the local Doramagic pack: the test log says real host dogfooding and runtime install evidence have not been executed. So this is not a production-readiness claim. It is a boundary note for the first safe evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Treat E2B as a sandbox contract, not a permission switch
&lt;/h2&gt;

&lt;p&gt;The useful promise is clear: E2B gives developers isolated cloud sandboxes where AI agents and apps can execute code, run commands, process data, and manage files.&lt;/p&gt;

&lt;p&gt;That is exactly why the first question should not be "can it run code?"&lt;/p&gt;

&lt;p&gt;The first question should be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which commands are allowed;&lt;/li&gt;
&lt;li&gt;which paths can be read and written;&lt;/li&gt;
&lt;li&gt;whether network access is allowed;&lt;/li&gt;
&lt;li&gt;whether dependencies may be installed;&lt;/li&gt;
&lt;li&gt;whether credentials are allowed at all;&lt;/li&gt;
&lt;li&gt;how stdout, stderr, and non-zero exit codes are handled;&lt;/li&gt;
&lt;li&gt;how artifacts are exported;&lt;/li&gt;
&lt;li&gt;how the sandbox is torn down.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that contract, the upstream tool may be fine while the agent workflow is still unsafe.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The first run should be disposable and boring
&lt;/h2&gt;

&lt;p&gt;The official first install entry recorded in the Doramagic pack is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i e2b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I would not start by attaching that to a real project. My first E2B check would be intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Create one disposable sandbox.
Use no secrets.
Use a tiny input.
Allow one command.
Write only under /tmp/e2b-first-run/.
Set a fixed timeout.
Export one small artifact.
Destroy the sandbox.
Record stdout, stderr, exit code, and cleanup status.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That may sound slow, but it answers the operational question that matters: can the agent complete one reversible run without widening the boundary?&lt;/p&gt;

&lt;p&gt;If that first fixture is unstable, adding templates, network calls, persistent volumes, or MCP servers will only make debugging harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Command execution needs a failure contract
&lt;/h2&gt;

&lt;p&gt;The E2B manual describes command execution as returning stdout, stderr, exit code, and optional error information. Non-zero exits can become exceptions.&lt;/p&gt;

&lt;p&gt;That means the host agent needs rules for failure, not just success.&lt;/p&gt;

&lt;p&gt;A practical first contract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;non-zero exit code stops the workflow;&lt;/li&gt;
&lt;li&gt;stderr is summarized before being fed back to the model;&lt;/li&gt;
&lt;li&gt;generated files are listed before export;&lt;/li&gt;
&lt;li&gt;no command may read outside the allowed working directory;&lt;/li&gt;
&lt;li&gt;no environment variable is printed unless explicitly approved;&lt;/li&gt;
&lt;li&gt;timeout is treated as a failed run, not as a reason to retry with a larger scope.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most common failure in agent tooling is not a dramatic exploit. It is a quiet expansion of scope: one more command, one more path, one more network call, one more unreviewed artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Do not enable templates, network, volumes, and MCP at the same time
&lt;/h2&gt;

&lt;p&gt;E2B has more surface area than "run a command." The manual covers SDKs, templates, filesystem operations, network behavior, ready commands, persistent volumes, and MCP server integration.&lt;/p&gt;

&lt;p&gt;Those are useful features. They should not all be part of the first test.&lt;/p&gt;

&lt;p&gt;My order would be:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;start and destroy a default sandbox;&lt;/li&gt;
&lt;li&gt;run one harmless command;&lt;/li&gt;
&lt;li&gt;write one small file under a controlled path;&lt;/li&gt;
&lt;li&gt;export one artifact;&lt;/li&gt;
&lt;li&gt;add a template only after the basic path works;&lt;/li&gt;
&lt;li&gt;add network only when the task requires it;&lt;/li&gt;
&lt;li&gt;add persistent volume or MCP only after cleanup and access boundaries are clear.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This keeps failures legible. If a first run already includes template build, network, file upload, ready command, and long-running process behavior, the failure will be difficult to assign.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Use the pitfall log as a test plan
&lt;/h2&gt;

&lt;p&gt;The Doramagic pitfall log lists source-linked issues such as auto-paused processes, closed port errors, template creation failures, build polling timeouts, API key authorization problems, file/template confusion, and paused sandbox persistence questions.&lt;/p&gt;

&lt;p&gt;I would not present those as guaranteed current bugs. Some may have changed by version.&lt;/p&gt;

&lt;p&gt;I would turn them into checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;after pause/resume, do process and file states match the expected contract?&lt;/li&gt;
&lt;li&gt;when a port is unavailable, does the run fail with a bounded timeout?&lt;/li&gt;
&lt;li&gt;when template build fails, does the agent stop instead of guessing a workaround?&lt;/li&gt;
&lt;li&gt;when an API key is wrong, is the key kept out of logs?&lt;/li&gt;
&lt;li&gt;after export, is the sandbox actually destroyed?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the value of the manual: it gives a better first-run checklist than a feature list.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. A safe AI-host handoff
&lt;/h2&gt;

&lt;p&gt;If I were handing E2B context to an AI coding host, I would not only paste the project link. I would add an instruction like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You may consider E2B as a candidate sandbox runtime.
Do not install anything yet.
First return a go/no-go review:
- whether this task actually needs code execution;
- whether it needs network;
- whether it needs filesystem access;
- whether credentials are involved;
- the smallest reversible verification fixture;
- timeout, cleanup, and artifact export plan;
- missing evidence that must stop the run.
Do not run install commands, read private local files, or use real API keys unless the command and rollback plan are approved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That wording is not bureaucracy. It keeps the model from turning "sandbox exists" into "everything inside the sandbox is safe."&lt;/p&gt;

&lt;h2&gt;
  
  
  7. My working conclusion
&lt;/h2&gt;

&lt;p&gt;E2B is interesting for agent workflows because it can give code execution a controlled runtime.&lt;/p&gt;

&lt;p&gt;But the adoption question is not "can my agent run code now?"&lt;/p&gt;

&lt;p&gt;The better question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can my agent run one tiny task, with no secrets, a known command boundary, fixed timeout, inspectable output, and a cleanup path?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I would not move to real workloads until that answer is yes.&lt;/p&gt;

</description>
      <category>sandbox</category>
    </item>
    <item>
      <title>Do not treat LangGraph as a longer chain: define state, interrupts, and recovery first</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Tue, 23 Jun 2026 06:55:06 +0000</pubDate>
      <link>https://dev.to/doramagic/do-not-treat-langgraph-as-a-longer-chain-define-state-interrupts-and-recovery-first-4n3n</link>
      <guid>https://dev.to/doramagic/do-not-treat-langgraph-as-a-longer-chain-define-state-interrupts-and-recovery-first-4n3n</guid>
      <description>&lt;p&gt;The easiest way to misunderstand LangGraph is to see it as “LangChain, but with more steps.”&lt;/p&gt;

&lt;p&gt;That misses the point.&lt;/p&gt;

&lt;p&gt;LangGraph becomes useful when an agent is no longer a single prompt or a simple chain. It becomes useful when the workflow has state, branches, tool calls, human approval, checkpointing, and recovery behavior that must be inspected before the agent is trusted inside a real AI host.&lt;/p&gt;

&lt;p&gt;I used the Doramagic LangGraph manual as the source-backed reading layer for this note:&lt;br&gt;
&lt;a href="https://doramagic.ai/en/projects/langgraph/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/langgraph/manual/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is an independent project guide, not an official LangGraph document. I use it as a pre-adoption checklist: what should be understood before wiring a project into Claude, ChatGPT, Cursor, Codex, or another AI host.&lt;/p&gt;

&lt;p&gt;The point is not to create another prompt library. The useful artifact is a capability resource pack: a manual, source map, boundary notes, pitfall log, smoke check, lightweight eval criteria, feedback notes, and host-ready context that help a developer decide what to verify before adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The real boundary is State, not the prompt
&lt;/h2&gt;

&lt;p&gt;For a one-shot model call, the prompt is often the main boundary.&lt;/p&gt;

&lt;p&gt;For LangGraph, the first boundary is the State schema:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which fields move between nodes;&lt;/li&gt;
&lt;li&gt;which fields a node may update;&lt;/li&gt;
&lt;li&gt;how concurrent branches merge values;&lt;/li&gt;
&lt;li&gt;which values enter a checkpoint;&lt;/li&gt;
&lt;li&gt;which values should never be persisted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why reducers matter. A message list is usually not just overwritten. It needs an append or merge rule such as &lt;code&gt;add_messages&lt;/code&gt; or the TypeScript equivalent. That small implementation detail decides whether parallel work preserves context or silently drops it.&lt;/p&gt;

&lt;p&gt;My preferred first run is not a “universal agent.” It is a tiny graph with one State schema, one node, one partial update, and one explicit reducer. If that is not clear, adding tools will only hide the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. &lt;code&gt;compile()&lt;/code&gt; is the boundary between description and runtime
&lt;/h2&gt;

&lt;p&gt;Before &lt;code&gt;compile()&lt;/code&gt;, a LangGraph graph is a description: nodes, edges, conditional routes, state fields, reducers.&lt;/p&gt;

&lt;p&gt;After &lt;code&gt;compile()&lt;/code&gt;, the Pregel-style runtime takes over. Nodes execute in supersteps. Partial state updates go through channels and reducers. Conditional edges choose the next node.&lt;/p&gt;

&lt;p&gt;That changes how debugging should work. When a graph fails, do not inspect only the node function. Inspect four contracts at the same time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the State schema allow this node to write the returned key?&lt;/li&gt;
&lt;li&gt;Does the node return at least one valid State field?&lt;/li&gt;
&lt;li&gt;Does the reducer match the intended merge behavior?&lt;/li&gt;
&lt;li&gt;Does the conditional edge have a real exit path?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Doramagic manual highlights errors such as &lt;code&gt;InvalidUpdateError: Must write to at least one of [...]&lt;/code&gt;. That is not just a random runtime annoyance. It often means the node return shape and State schema do not agree.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Human-in-the-loop is an execution contract
&lt;/h2&gt;

&lt;p&gt;LangGraph’s interrupt and human-in-the-loop support should not be treated as interface decoration.&lt;/p&gt;

&lt;p&gt;If an agent can send email, edit files, call external APIs, or touch production systems, the approval step should be part of the graph contract, not a person watching the screen and hoping to catch mistakes.&lt;/p&gt;

&lt;p&gt;A practical pattern is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the model proposes a tool call;&lt;/li&gt;
&lt;li&gt;a graph node raises an interrupt;&lt;/li&gt;
&lt;li&gt;the human approves, ignores, or edits one concrete action;&lt;/li&gt;
&lt;li&gt;the structured response returns to the graph;&lt;/li&gt;
&lt;li&gt;execution resumes from the interrupt point instead of restarting the whole workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why LangGraph fits recoverable workflow agents better than simple chat demos.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Checkpointing is not logging
&lt;/h2&gt;

&lt;p&gt;Checkpointing is easy to describe as “saving history.” In LangGraph, that description is too weak.&lt;/p&gt;

&lt;p&gt;The manual separates checkpointing, serialization, and stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a checkpointer is scoped to a thread and supports durable execution, replay, and resume;&lt;/li&gt;
&lt;li&gt;a store is cross-thread long-term memory;&lt;/li&gt;
&lt;li&gt;serialization defines how Python objects become stored and reconstructed data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One security detail deserves attention before adoption: the checkpoint serializer can handle Python objects from checkpoint data. New applications should consider &lt;code&gt;LANGGRAPH_STRICT_MSGPACK=true&lt;/code&gt; or an explicit &lt;code&gt;allowed_msgpack_modules&lt;/code&gt; list for &lt;code&gt;JsonPlusSerializer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is not an advanced edge case. If checkpoint data may cross a trust boundary, deserialization policy is part of the system boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. A better first-run checklist
&lt;/h2&gt;

&lt;p&gt;For a first LangGraph test, I would keep the path intentionally small:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use a temporary directory. Do not connect production data.&lt;/li&gt;
&lt;li&gt;Define a minimal State with &lt;code&gt;messages&lt;/code&gt; and one business field.&lt;/li&gt;
&lt;li&gt;Write one node that returns only valid State fields.&lt;/li&gt;
&lt;li&gt;Attach a reducer to any append-like field.&lt;/li&gt;
&lt;li&gt;Add one interrupt around a concrete tool-like action.&lt;/li&gt;
&lt;li&gt;Add a checkpointer.&lt;/li&gt;
&lt;li&gt;Force one node to fail, then verify resume behavior.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If this path fails, do not move on to real tools yet. The problem is not that the agent is weak. The runtime boundary has not been verified.&lt;/p&gt;

&lt;p&gt;The smoke check should have pass/fail criteria, not a vague “seems to work” result. A useful minimum is: the node writes only declared State fields; the interrupt happens before a real side effect; resume behavior does not repeat already successful sibling writes; and serializer limits are explicit. If any of those are unclear, the decision should be HOLD, not “try it in production and see.”&lt;/p&gt;

&lt;h2&gt;
  
  
  6. My practical take
&lt;/h2&gt;

&lt;p&gt;LangGraph is worth studying because it makes the hard parts of agent workflows explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;state;&lt;/li&gt;
&lt;li&gt;branching;&lt;/li&gt;
&lt;li&gt;tool calls;&lt;/li&gt;
&lt;li&gt;human approval;&lt;/li&gt;
&lt;li&gt;checkpointing;&lt;/li&gt;
&lt;li&gt;failure recovery;&lt;/li&gt;
&lt;li&gt;thread-scoped state versus long-term memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your task is a one-off model call, LangGraph may be more machinery than you need.&lt;/p&gt;

&lt;p&gt;If your AI host needs to run multi-step work where each step must be inspectable, pausable, recoverable, and reviewable, LangGraph becomes much more interesting.&lt;/p&gt;

&lt;p&gt;The better question is not “Can LangGraph build an agent?”&lt;/p&gt;

&lt;p&gt;The better question is: “Can my agent’s state, tool boundary, and recovery path be written as a graph?”&lt;/p&gt;

&lt;p&gt;Answer that first. Then decide whether LangGraph belongs in the stack.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>Do not add LLM evals after launch: use promptfoo to define the failure boundary first</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Mon, 22 Jun 2026 02:57:11 +0000</pubDate>
      <link>https://dev.to/doramagic/do-not-add-llm-evals-after-launch-use-promptfoo-to-define-the-failure-boundary-first-1c2d</link>
      <guid>https://dev.to/doramagic/do-not-add-llm-evals-after-launch-use-promptfoo-to-define-the-failure-boundary-first-1c2d</guid>
      <description>&lt;p&gt;A common LLM workflow mistake is to tune the prompt first and add evaluation later. That order feels fast, but it leaves the team with a weak release boundary. When the workflow fails in production, nobody can say exactly which behavior should have blocked the release.&lt;/p&gt;

&lt;p&gt;The category claim matters: Doramagic is not a prompt library and not a README summary. It turns an open-source project into a portable AI agent capability asset: source map, host instructions, prompt preview, pitfall log, eval or smoke check, boundary card, human manual, test log, and feedback path.&lt;/p&gt;

&lt;p&gt;The useful way to read promptfoo is not as a leaderboard tool. It is a contract layer for LLM workflows: before changing prompts, switching models, adding tools, or giving an agent more authority, define what must pass, what must fail, and what evidence is enough.&lt;/p&gt;

&lt;p&gt;The Doramagic promptfoo manual highlights four surfaces that matter in practice.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. An eval is not a score. It is a reproducible judgment set.
&lt;/h4&gt;

&lt;p&gt;Promptfoo's core engine ties together configuration, provider calls, assertion grading, and result aggregation. YAML or JSON config can place prompts, test cases, providers, and assertions in one run.&lt;/p&gt;

&lt;p&gt;That changes the question from "did the model sound good?" to a more useful checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which inputs must pass every time?&lt;/li&gt;
&lt;li&gt;Which outputs must be rejected?&lt;/li&gt;
&lt;li&gt;Which tool-call structures must be valid?&lt;/li&gt;
&lt;li&gt;Which checks should be deterministic, and which deserve an LLM-as-judge rubric?&lt;/li&gt;
&lt;li&gt;After a prompt, provider, or model change, which cases must run again?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If those answers are not in the eval config, the final score is mostly decoration.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Agents can run evals, but the scope must be explicit.
&lt;/h4&gt;

&lt;p&gt;One important detail in the manual is promptfoo's MCP tool surface. Tools such as &lt;code&gt;list_evaluations&lt;/code&gt;, &lt;code&gt;get_evaluation_details&lt;/code&gt;, &lt;code&gt;run_evaluation&lt;/code&gt;, and &lt;code&gt;share_evaluation&lt;/code&gt; allow AI agents to drive evaluations programmatically.&lt;/p&gt;

&lt;p&gt;That is powerful, but it needs a boundary. &lt;code&gt;run_evaluation&lt;/code&gt; accepts a config path, optional test-case filters, prompt and provider filters, concurrency settings, timeout limits, and pagination controls. An agent should not treat this as a casual "run everything" button.&lt;/p&gt;

&lt;p&gt;Before running an eval, the agent should state which config it will use, which cases it will run, how much concurrency it needs, what timeout applies, and why this run is the right evidence for the current change.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Many providers do not mean automatic portability.
&lt;/h4&gt;

&lt;p&gt;Promptfoo supports a broad provider ecosystem: OpenAI, Anthropic, Google Vertex, xAI, Bedrock, Cerebras, agent SDKs, MCP tools, and custom gateways. That breadth is useful because one test set can compare different execution surfaces.&lt;/p&gt;

&lt;p&gt;But every provider has its own behavior around structured outputs, tool calls, timeouts, caching, permissions, and cost. A test that passes with one provider should not be treated as proof that another provider is safe.&lt;/p&gt;

&lt;p&gt;For a first run, I would not start with a giant model comparison. I would fix one provider and one real workflow, run a small smoke eval, then add the second provider only after the assertions are stable.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Red teaming should become a negative-case release contract.
&lt;/h4&gt;

&lt;p&gt;Promptfoo's redteam layer reuses the evaluation engine and adds adversarial providers, target invocation, judging, and iterative feedback. The manual notes presets around prompt injection, harmful content, PII, and related risks.&lt;/p&gt;

&lt;p&gt;The mistake is treating red teaming as a one-off report. The useful version is a release gate: which prompt-injection attempts must fail, which PII outputs must be blocked, and which tool permissions must be denied before the workflow ships.&lt;/p&gt;

&lt;p&gt;If the redteam output does not feed CI, a release checklist, or a human review step, it is only a demo.&lt;/p&gt;

&lt;h4&gt;
  
  
  A safer first-run path
&lt;/h4&gt;

&lt;p&gt;For an AI coding agent, I would use promptfoo like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the Doramagic manual and identify the runtime boundary.&lt;/li&gt;
&lt;li&gt;Create a minimal promptfoo config in a temporary directory.&lt;/li&gt;
&lt;li&gt;Use one provider and one real workflow first.&lt;/li&gt;
&lt;li&gt;Run 3 to 5 high-value test cases before expanding the suite.&lt;/li&gt;
&lt;li&gt;Start with deterministic assertions, then add LLM rubrics only where judgment is unavoidable.&lt;/li&gt;
&lt;li&gt;Report passing examples, failing examples, cost, latency, and what changed.&lt;/li&gt;
&lt;li&gt;Do not claim the workflow is ready until failed cases are reviewed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Doramagic promptfoo pack does not replace the upstream docs. Its job is to make promptfoo loadable by an AI coding host as an operating contract: read the manual, check the pitfalls, run the evals, and only then recommend a prompt or model change.&lt;/p&gt;

&lt;p&gt;The feedback loop is part of the asset. If the first run exposes a new failure case, it should update the pitfall log, eval suite, boundary card, or human manual. Otherwise the article is just content; the capability asset is what lets the next agent start from better evidence.&lt;/p&gt;

&lt;p&gt;Doramagic manual: &lt;a href="https://doramagic.ai/en/projects/promptfoo/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/promptfoo/manual/&lt;/a&gt;&lt;br&gt;
Doramagic project page: &lt;a href="https://doramagic.ai/en/projects/promptfoo/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/promptfoo/&lt;/a&gt;&lt;br&gt;
GitHub pack: &lt;a href="https://github.com/tangweigang-jpg/doramagic-promptfoo-pack" rel="noopener noreferrer"&gt;https://github.com/tangweigang-jpg/doramagic-promptfoo-pack&lt;/a&gt;&lt;br&gt;
Upstream project: &lt;a href="https://github.com/promptfoo/promptfoo" rel="noopener noreferrer"&gt;https://github.com/promptfoo/promptfoo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Disclosure: this is an independent Doramagic project asset, not an official upstream release.&lt;/p&gt;

</description>
      <category>opensource</category>
    </item>
    <item>
      <title>Before You Let an Agent Convert Documents, Narrow the MarkItDown Boundary</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Mon, 22 Jun 2026 00:05:11 +0000</pubDate>
      <link>https://dev.to/doramagic/before-you-let-an-agent-convert-documents-narrow-the-markitdown-boundary-4ne0</link>
      <guid>https://dev.to/doramagic/before-you-let-an-agent-convert-documents-narrow-the-markitdown-boundary-4ne0</guid>
      <description>&lt;p&gt;MarkItDown looks like a simple utility at first glance: point it at a PDF, Word file, spreadsheet, image, HTML page, archive, audio file, or URL, and get Markdown back.&lt;/p&gt;

&lt;p&gt;For an AI agent, that simplicity is exactly where I would slow down.&lt;/p&gt;

&lt;p&gt;Once a coding agent or MCP host can call a document converter, the real question is not "can it convert files?" The real question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is the smallest input surface the agent may touch, and what evidence proves the Markdown output is good enough for the next step?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This note is based on the independent Doramagic MarkItDown manual:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://doramagic.ai/en/projects/markitdown/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/markitdown/manual/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is not an official Microsoft or MarkItDown document.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use it for LLM ingestion, not layout reconstruction
&lt;/h2&gt;

&lt;p&gt;MarkItDown is useful because Markdown is close to plain text while still preserving headings, lists, links, and some structural cues. That makes it practical for retrieval, summarization, indexing, and agent workflows.&lt;/p&gt;

&lt;p&gt;But I would not treat it as a high-fidelity document renderer.&lt;/p&gt;

&lt;p&gt;The first boundary check is plain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the task needs a human-perfect copy of a PDF layout, hold.&lt;/li&gt;
&lt;li&gt;If the task needs searchable text with inspectable structure, continue.&lt;/li&gt;
&lt;li&gt;If the task involves untrusted uploads or remote URLs, define the allowed paths and schemes before any agent runs it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction prevents a bad handoff where the agent reasons over Markdown that was never fit for the downstream claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with one format, not every optional dependency
&lt;/h2&gt;

&lt;p&gt;The convenient install path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"markitdown[all]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is fine for exploration. It is not always the best first production-style test.&lt;/p&gt;

&lt;p&gt;For a first agent workflow, I prefer a narrow install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"markitdown[pdf,docx]"&lt;/span&gt;
markitdown sample.pdf &lt;span class="nt"&gt;-o&lt;/span&gt; sample.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then verify the artifact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the Markdown file exists and is non-empty;&lt;/li&gt;
&lt;li&gt;headings and links survived well enough for the task;&lt;/li&gt;
&lt;li&gt;tables are manually sampled instead of trusted blindly;&lt;/li&gt;
&lt;li&gt;the command, package version, input path, and output path are recorded;&lt;/li&gt;
&lt;li&gt;the agent is instructed to read the Markdown artifact, not go back and inspect arbitrary source files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That small test catches the most common mistake: calling "installation works" the same thing as "the document pipeline is safe."&lt;/p&gt;

&lt;h2&gt;
  
  
  PDF and OCR are where expectations drift
&lt;/h2&gt;

&lt;p&gt;The Doramagic manual is most useful where it keeps the limits visible.&lt;/p&gt;

&lt;p&gt;PDF conversion is best effort. Complex tables, headers, footers, multi-column layouts, scanned pages, and image-only documents can lose structure. The output may still be useful for search or rough summarization, but it should not become a source of record without sampling.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;markitdown-ocr&lt;/code&gt; plugin extends the system with LLM Vision OCR for embedded images and scanned PDFs. That can be the right tool, but it changes the operating model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OCR may add model cost per page or image.&lt;/li&gt;
&lt;li&gt;If no &lt;code&gt;llm_client&lt;/code&gt; is provided, the plugin can load while OCR is skipped and the standard converter is used.&lt;/li&gt;
&lt;li&gt;OCR output should be sampled before an agent uses it for legal, financial, medical, or compliance-sensitive reasoning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My first-run decision rule is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GO: controlled DOCX/PDF input, Markdown generated, sample checks pass.&lt;/li&gt;
&lt;li&gt;HOLD: tables or scanned pages are readable but not structurally reliable.&lt;/li&gt;
&lt;li&gt;NO-GO: sensitive scanned documents are pushed into an agent workflow before OCR quality and access boundaries are checked.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Treat the MCP server as a local tool, not a public service
&lt;/h2&gt;

&lt;p&gt;MarkItDown also has an MCP server package. That is useful when an MCP-capable host needs a document-to-Markdown tool.&lt;/p&gt;

&lt;p&gt;The safe default is local and narrow: trusted local agents, localhost binding, a small mounted work directory, and a clear rule for remote URLs.&lt;/p&gt;

&lt;p&gt;A first instruction I would actually use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Allow MarkItDown MCP to convert only files under /workdir/inbox/.
Do not fetch external URLs.
Write Markdown outputs under /workdir/out/.
Record input path, output path, file size, and MarkItDown version.
Stop on scanned PDFs, archives, remote URLs, empty output, or unknown extensions.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That turns MarkItDown into a bounded capability. Without that boundary, a document converter can quietly become a broad file and URL access tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  A compact acceptance checklist
&lt;/h2&gt;

&lt;p&gt;Before I let an agent rely on MarkItDown output, I want these checks visible:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Acceptance rule&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Install scope&lt;/td&gt;
&lt;td&gt;Use only needed extras, or explain why &lt;code&gt;[all]&lt;/code&gt; is required.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input boundary&lt;/td&gt;
&lt;td&gt;Restrict to known directories or approved URLs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output evidence&lt;/td&gt;
&lt;td&gt;Save Markdown and inspect headings, links, lists, and table samples.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PDF caveat&lt;/td&gt;
&lt;td&gt;Mark complex layouts and scanned pages as lower confidence.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OCR path&lt;/td&gt;
&lt;td&gt;Record whether &lt;code&gt;markitdown-ocr&lt;/code&gt;, model client, and sampling are enabled.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP exposure&lt;/td&gt;
&lt;td&gt;Keep local by default; do not expose a converter server without auth and network controls.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure handling&lt;/td&gt;
&lt;td&gt;Empty output or unsupported format stops the agent workflow.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The practical value of MarkItDown is not that it magically understands every document. It gives an agent a narrow path from messy files to inspectable text. The narrower that path is on day one, the easier it is to trust the next step.&lt;/p&gt;

&lt;p&gt;Sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Doramagic MarkItDown manual: &lt;a href="https://doramagic.ai/en/projects/markitdown/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/markitdown/manual/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Doramagic project page: &lt;a href="https://doramagic.ai/en/projects/markitdown/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/markitdown/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Official repository: &lt;a href="https://github.com/microsoft/markitdown" rel="noopener noreferrer"&gt;https://github.com/microsoft/markitdown&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Disclosure: this is a practitioner note based on an independent Doramagic capability manual and public MarkItDown repository material. It is not affiliated with or endorsed by Microsoft unless explicitly stated.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Before You Add Memory to an AI Agent, Decide What the Agent Is Allowed to Remember</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Sun, 21 Jun 2026 11:44:28 +0000</pubDate>
      <link>https://dev.to/doramagic/before-you-add-memory-to-an-ai-agent-decide-what-the-agent-is-allowed-to-remember-42pn</link>
      <guid>https://dev.to/doramagic/before-you-add-memory-to-an-ai-agent-decide-what-the-agent-is-allowed-to-remember-42pn</guid>
      <description>&lt;h1&gt;
  
  
  Before You Add Memory to an AI Agent, Decide What the Agent Is Allowed to Remember
&lt;/h1&gt;

&lt;p&gt;Memory is one of those agent features that sounds obvious until it is connected to a real system.&lt;/p&gt;

&lt;p&gt;The naive version is simple: save the conversation, retrieve something similar later, and call it memory.&lt;/p&gt;

&lt;p&gt;The operational version is harder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is short-term conversation context?&lt;/li&gt;
&lt;li&gt;What is long-term user or domain knowledge?&lt;/li&gt;
&lt;li&gt;What is a reasoning trace?&lt;/li&gt;
&lt;li&gt;Who owns a remembered entity?&lt;/li&gt;
&lt;li&gt;Which memory can be reused across sessions?&lt;/li&gt;
&lt;li&gt;Which memory should expire, be corrected, or stay private?&lt;/li&gt;
&lt;li&gt;How do you prove that the agent used the right memory instead of a convenient hallucination?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the useful way to read &lt;code&gt;agent-memory&lt;/code&gt;, the Neo4j Labs project covered in the Doramagic manual.&lt;/p&gt;

&lt;p&gt;It is not just a "vector store for agents". The manual frames it as a graph-native memory layer for AI agents and context graphs, backed by Neo4j, with Python and TypeScript SDK surfaces and a hosted NAMS backend option.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first useful mental model: three memory tiers
&lt;/h2&gt;

&lt;p&gt;The most important part is the separation of memory types.&lt;/p&gt;

&lt;p&gt;The Doramagic manual describes three main tiers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What it holds&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Short-term memory&lt;/td&gt;
&lt;td&gt;Session or conversation message history&lt;/td&gt;
&lt;td&gt;Keeps the current turn grounded without pretending every message is permanent knowledge.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-term memory&lt;/td&gt;
&lt;td&gt;Entities, preferences, relationships&lt;/td&gt;
&lt;td&gt;Lets the system remember durable facts, but also creates privacy and correction obligations.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning memory&lt;/td&gt;
&lt;td&gt;Steps, tool calls, traces, similar traces&lt;/td&gt;
&lt;td&gt;Makes the agent's behavior reviewable instead of turning memory into an invisible black box.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That split is practical because "remember everything" is not an implementation strategy. It is a data governance problem wearing a friendly product name.&lt;/p&gt;

&lt;p&gt;If an AI host is going to use memory, the host should know which tier it is touching.&lt;/p&gt;

&lt;p&gt;A user message might belong in short-term memory.&lt;/p&gt;

&lt;p&gt;A confirmed customer preference might belong in long-term memory.&lt;/p&gt;

&lt;p&gt;A failed tool call and recovery path might belong in reasoning memory.&lt;/p&gt;

&lt;p&gt;Those are different records with different risks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The graph part is not decoration
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;agent-memory&lt;/code&gt; uses Neo4j as the backing graph. That matters because agent memory is rarely just a bag of text chunks.&lt;/p&gt;

&lt;p&gt;Useful memory often has structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a person belongs to an organization&lt;/li&gt;
&lt;li&gt;a task was requested in a session&lt;/li&gt;
&lt;li&gt;a tool call touched an entity&lt;/li&gt;
&lt;li&gt;a preference applies to one user but not another&lt;/li&gt;
&lt;li&gt;a reasoning trace created or updated a record&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The manual highlights POLE+O entity typing: &lt;code&gt;PERSON&lt;/code&gt;, &lt;code&gt;ORGANIZATION&lt;/code&gt;, &lt;code&gt;LOCATION&lt;/code&gt;, &lt;code&gt;EVENT&lt;/code&gt;, and &lt;code&gt;OBJECT&lt;/code&gt;, plus extension entity types. That gives the memory system a vocabulary for durable knowledge instead of treating every remembered thing as the same kind of note.&lt;/p&gt;

&lt;p&gt;The result is not automatically safe or correct. It is just more inspectable.&lt;/p&gt;

&lt;p&gt;That is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backend choice changes the operating boundary
&lt;/h2&gt;

&lt;p&gt;The manual describes two backend paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;direct Neo4j through Bolt&lt;/li&gt;
&lt;li&gt;hosted NAMS through a REST backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a small deployment detail. It changes the boundary you need to check.&lt;/p&gt;

&lt;p&gt;With a local or self-hosted Neo4j path, you are responsible for database configuration, tenant isolation, backups, and operational access.&lt;/p&gt;

&lt;p&gt;With NAMS, you get a hosted memory service path and ontology surfaces, but now the remote service boundary, workspace ownership, and API configuration matter.&lt;/p&gt;

&lt;p&gt;The practical first question is not "which one is better?"&lt;/p&gt;

&lt;p&gt;The practical first question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Where is the memory allowed to live, and who can read it later?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you cannot answer that, do not let an agent write durable memory yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ontology is the part teams will underestimate
&lt;/h2&gt;

&lt;p&gt;The manual calls out a typed, versioned ontology layer in NAMS. This is more important than it sounds.&lt;/p&gt;

&lt;p&gt;Without an ontology boundary, agent memory can quietly drift:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the same entity appears under multiple names&lt;/li&gt;
&lt;li&gt;preferences become mixed with facts&lt;/li&gt;
&lt;li&gt;tool results become treated as user intent&lt;/li&gt;
&lt;li&gt;stale knowledge remains in retrieval because nothing marks it as old&lt;/li&gt;
&lt;li&gt;private and shared memory end up in the same retrieval pool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An ontology does not solve those problems by itself, but it gives the team a place to define what is valid.&lt;/p&gt;

&lt;p&gt;For a first run, I would not start by building a complex domain ontology.&lt;/p&gt;

&lt;p&gt;I would start with a deliberately small one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one user&lt;/li&gt;
&lt;li&gt;one session&lt;/li&gt;
&lt;li&gt;two entity types&lt;/li&gt;
&lt;li&gt;one relationship type&lt;/li&gt;
&lt;li&gt;one trace&lt;/li&gt;
&lt;li&gt;one correction case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that cannot be inspected and corrected, scaling the memory system will only make the failure harder to see.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first safe verification run
&lt;/h2&gt;

&lt;p&gt;Before wiring &lt;code&gt;agent-memory&lt;/code&gt; into a serious AI workflow, I would run a small sandbox test.&lt;/p&gt;

&lt;p&gt;The test should not require production credentials or real user data.&lt;/p&gt;

&lt;p&gt;A good first run looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a temporary test user and session.&lt;/li&gt;
&lt;li&gt;Add a small conversation message to short-term memory.&lt;/li&gt;
&lt;li&gt;Add one explicit long-term entity, such as a fake preference.&lt;/li&gt;
&lt;li&gt;Record one reasoning step or tool call.&lt;/li&gt;
&lt;li&gt;Retrieve context on the next turn.&lt;/li&gt;
&lt;li&gt;Verify which memory tier produced which returned item.&lt;/li&gt;
&lt;li&gt;Correct or delete one memory record.&lt;/li&gt;
&lt;li&gt;Confirm the correction is visible in the next retrieval.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key artifact is not the demo output.&lt;/p&gt;

&lt;p&gt;The key artifact is the audit trail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what was stored&lt;/li&gt;
&lt;li&gt;why it was stored&lt;/li&gt;
&lt;li&gt;where it lives&lt;/li&gt;
&lt;li&gt;how it is retrieved&lt;/li&gt;
&lt;li&gt;how it is corrected&lt;/li&gt;
&lt;li&gt;what the agent is not allowed to remember&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The main pitfall
&lt;/h2&gt;

&lt;p&gt;The biggest mistake is treating memory as a feature toggle.&lt;/p&gt;

&lt;p&gt;"Add memory" sounds like a product improvement.&lt;/p&gt;

&lt;p&gt;In practice, it changes the agent's state model.&lt;/p&gt;

&lt;p&gt;A stateless agent can be wrong in one run.&lt;/p&gt;

&lt;p&gt;A stateful agent can be wrong, remember the wrong thing, and use that wrong memory later with confidence.&lt;/p&gt;

&lt;p&gt;That does not mean memory is bad. It means memory needs a smaller first run, clearer permissions, and a visible review path.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical adoption checklist
&lt;/h2&gt;

&lt;p&gt;Before giving an AI host access to a memory layer, answer these questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What memory tiers are enabled?&lt;/li&gt;
&lt;li&gt;Which writes are automatic and which require approval?&lt;/li&gt;
&lt;li&gt;Where is the backing store?&lt;/li&gt;
&lt;li&gt;Is memory scoped by user, workspace, tenant, or project?&lt;/li&gt;
&lt;li&gt;Can a user inspect and correct remembered facts?&lt;/li&gt;
&lt;li&gt;Are reasoning traces stored separately from durable user knowledge?&lt;/li&gt;
&lt;li&gt;Does retrieval show provenance?&lt;/li&gt;
&lt;li&gt;Is there a deletion path?&lt;/li&gt;
&lt;li&gt;Is there a sandbox test that proves all of the above?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer is unclear, the next step is not production integration.&lt;/p&gt;

&lt;p&gt;The next step is a smaller verification loop.&lt;/p&gt;

&lt;p&gt;Reference: Doramagic agent-memory project page and manual: &lt;a href="https://doramagic.ai/en/projects/agent-memory/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/agent-memory/manual/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Disclosure: this post is based on an independent Doramagic project pack for &lt;code&gt;neo4j-labs/agent-memory&lt;/code&gt;. It is not official Neo4j documentation and does not imply endorsement by the upstream project.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>neo4j</category>
      <category>llmops</category>
    </item>
    <item>
      <title>Before You Give an AI Agent a Browser, Define the Puppeteer Boundary</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Fri, 19 Jun 2026 03:51:39 +0000</pubDate>
      <link>https://dev.to/doramagic/before-you-give-an-ai-agent-a-browser-define-the-puppeteer-boundary-1502</link>
      <guid>https://dev.to/doramagic/before-you-give-an-ai-agent-a-browser-define-the-puppeteer-boundary-1502</guid>
      <description>&lt;h1&gt;
  
  
  Before You Give an AI Agent a Browser, Define the Puppeteer Boundary
&lt;/h1&gt;

&lt;p&gt;Puppeteer is one of the most practical tools you can give an AI coding agent. It lets a Node.js workflow control Chrome or Firefox through a high-level JavaScript API, which makes it useful for browser automation, screenshots, scraping, page checks, and repeatable web tasks.&lt;/p&gt;

&lt;p&gt;That power is also the risk.&lt;/p&gt;

&lt;p&gt;Once an agent can open a browser, it may touch live web sessions, read page content, follow links, download files, submit forms, or capture screenshots that contain private state. The first question should not be "Can the agent use Puppeteer?" The better question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is the smallest browser task the agent can run while producing evidence and staying inside a clear boundary?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the first-use checklist I would apply before loading a Puppeteer-oriented capability into an AI coding host.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Separate the Two Install Paths
&lt;/h2&gt;

&lt;p&gt;Puppeteer has two common install paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;npm i puppeteer&lt;/code&gt;: the full package, including browser download behavior.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;npm i puppeteer-core&lt;/code&gt;: the lighter core library, where you provide the browser executable yourself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That choice matters for agent workflows.&lt;/p&gt;

&lt;p&gt;If the agent is only checking a known local page or a CI preview, &lt;code&gt;puppeteer-core&lt;/code&gt; plus an explicit &lt;code&gt;executablePath&lt;/code&gt; may be easier to reason about. If the agent needs a bundled browser path for a quick isolated smoke test, the full package can be convenient, but it also changes the install surface.&lt;/p&gt;

&lt;p&gt;Do not let the agent choose this casually. Ask it to state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which package it wants;&lt;/li&gt;
&lt;li&gt;whether browser download is expected;&lt;/li&gt;
&lt;li&gt;where temporary files and browser cache will live;&lt;/li&gt;
&lt;li&gt;whether the run needs network access;&lt;/li&gt;
&lt;li&gt;what command proves the first task worked.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Make the First Run Evidence-Oriented
&lt;/h2&gt;

&lt;p&gt;A useful first Puppeteer run should produce a small artifact, not just a confident explanation.&lt;/p&gt;

&lt;p&gt;Good first-run evidence can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a screenshot from a local test page;&lt;/li&gt;
&lt;li&gt;a saved HTML excerpt from a controlled URL;&lt;/li&gt;
&lt;li&gt;a list of network requests for a single page;&lt;/li&gt;
&lt;li&gt;a console log capture;&lt;/li&gt;
&lt;li&gt;a short JSON report containing URL, status, title, selector result, and timestamp.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad first-run evidence is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I inspected the page and it looks fine";&lt;/li&gt;
&lt;li&gt;"the automation should work";&lt;/li&gt;
&lt;li&gt;"Puppeteer is installed";&lt;/li&gt;
&lt;li&gt;a screenshot from a logged-in personal account;&lt;/li&gt;
&lt;li&gt;a run that requires production credentials before proving the basic path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For AI coding agents, the rule should be simple: if no artifact is produced, the browser task is not verified yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Treat Browser Access as a Permission Boundary
&lt;/h2&gt;

&lt;p&gt;The Doramagic Puppeteer boundary card recommends starting with minimal permissions, a temporary directory, and rollbackable configuration. That is the right default.&lt;/p&gt;

&lt;p&gt;Before the agent runs Puppeteer, define the boundary in plain language:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;allowed URLs or domains;&lt;/li&gt;
&lt;li&gt;whether login state may be used;&lt;/li&gt;
&lt;li&gt;whether screenshots may be captured;&lt;/li&gt;
&lt;li&gt;whether downloads are allowed;&lt;/li&gt;
&lt;li&gt;whether form submission is allowed;&lt;/li&gt;
&lt;li&gt;where browser data and temporary files may be written;&lt;/li&gt;
&lt;li&gt;what must stop the run immediately.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use Puppeteer only against http://localhost:3000.
Do not use existing browser profiles.
Do not submit forms.
Save one screenshot and one JSON report under ./artifacts/browser-smoke/.
If the page redirects outside localhost, stop and report.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the agent from turning a simple UI check into an open-ended web session.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Watch the Real First-Use Pitfalls
&lt;/h2&gt;

&lt;p&gt;The Doramagic Puppeteer pack records source-linked pitfalls around install behavior, browser versions, package alerts, flaky tests, cache behavior, Firefox viewport behavior, and browser binary availability. These should not be exaggerated into "Puppeteer is unsafe" or "Puppeteer is broken." They are check points.&lt;/p&gt;

&lt;p&gt;The useful habit is to turn each risk into a verification question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which Node version is being used?&lt;/li&gt;
&lt;li&gt;Is the project installing &lt;code&gt;puppeteer&lt;/code&gt; or &lt;code&gt;puppeteer-core&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;Was Chrome downloaded, skipped, or supplied externally?&lt;/li&gt;
&lt;li&gt;Does the browser executable exist where the agent thinks it exists?&lt;/li&gt;
&lt;li&gt;Is the cache directory temporary and disposable?&lt;/li&gt;
&lt;li&gt;Does the smoke test work on the target browser, not just on the agent's assumption?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is especially important in CI, containers, and remote development environments where browser dependencies differ from a developer laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Use GO / HOLD / NO-GO for Agent Runs
&lt;/h2&gt;

&lt;p&gt;For a first Puppeteer capability run, I would use this decision rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GO: the task uses a controlled URL, produces a screenshot or report, avoids existing user profiles, and can be rerun.&lt;/li&gt;
&lt;li&gt;HOLD: the browser opens, but install path, cache path, executable path, or target URL is unclear.&lt;/li&gt;
&lt;li&gt;NO-GO: the agent needs production login state, private data, or external form submission before proving a minimal browser smoke test.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is not to make the agent timid. The point is to keep browser automation inspectable.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. A Safer First Instruction
&lt;/h2&gt;

&lt;p&gt;Instead of asking an AI coding agent to "set up browser automation," start with a smaller instruction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Using the Puppeteer capability notes, design the smallest safe browser smoke test for this repo.
Use only a local or explicitly approved URL.
Do not use an existing browser profile.
Do not submit forms or use credentials.
Return the planned command, expected artifact, stop conditions, and rollback path before running anything.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That instruction forces the agent to expose the boundary before it touches the browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. What This Helps With
&lt;/h2&gt;

&lt;p&gt;This workflow is useful when you want an AI host to use Puppeteer for screenshots, scraping, browser checks, or UI automation without quietly expanding its permissions.&lt;/p&gt;

&lt;p&gt;It does not replace the official Puppeteer documentation. It does not prove production readiness. It does not mean the Puppeteer maintainers endorse this pack.&lt;/p&gt;

&lt;p&gt;The useful mental model is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Puppeteer gives the agent browser hands. Your boundary gives it judgment about where those hands are allowed to go.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reference: the independent Doramagic Puppeteer project page and manual are here: &lt;a href="https://doramagic.ai/en/projects/puppeteer/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/puppeteer/manual/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Upstream project: &lt;a href="https://github.com/puppeteer/puppeteer" rel="noopener noreferrer"&gt;https://github.com/puppeteer/puppeteer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Disclosure: this is based on an independent Doramagic capability pack for Puppeteer. It is not affiliated with or endorsed by Puppeteer or Google unless explicitly stated.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>opensource</category>
      <category>puppeteer</category>
    </item>
    <item>
      <title>Before You Trust an LLM App, Review the Trace</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:27:03 +0000</pubDate>
      <link>https://dev.to/doramagic/before-you-trust-an-llm-app-review-the-trace-e7h</link>
      <guid>https://dev.to/doramagic/before-you-trust-an-llm-app-review-the-trace-e7h</guid>
      <description>&lt;p&gt;When an AI agent fails, the dangerous part is not only the failed output. The dangerous part is the next run.&lt;/p&gt;

&lt;p&gt;If nobody reviews the trace, the prompt change, the score signal, and the recovery action, the system can repeat the same mistake with more confidence. That is where Langfuse is useful: it is an open-source LLM engineering platform for observability, metrics, evals, prompt management, playgrounds, and datasets.&lt;/p&gt;

&lt;p&gt;The practical question is not "Should I install Langfuse?" The better first question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What evidence should I review before trusting the next LLM run?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the checklist I would use before adding a Langfuse-oriented workflow to an AI coding host.&lt;/p&gt;

&lt;p&gt;One clarification matters: this is not just a prompt and not a prompt library.&lt;br&gt;
The Doramagic Langfuse pack is a capability asset: it includes host&lt;br&gt;
instructions, a prompt preview, a human manual, a pitfall log, a boundary/risk&lt;br&gt;
card, eval checks, a smoke check, a test log, and a feedback path. The point is&lt;br&gt;
to make an AI host reason from source-backed evidence, not to make Langfuse&lt;br&gt;
sound easier than it is.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Start With the Failure Review Loop
&lt;/h2&gt;

&lt;p&gt;For an LLM app, a useful review loop needs four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A trace that shows what happened.&lt;/li&gt;
&lt;li&gt;A scoring or eval signal that says whether the run was acceptable.&lt;/li&gt;
&lt;li&gt;A prompt-management habit that records what changed.&lt;/li&gt;
&lt;li&gt;A recovery rule that stops the next run from pretending the issue is solved.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without those four pieces, observability becomes a dashboard you look at after damage is already done.&lt;/p&gt;

&lt;p&gt;For agent workflows, the first useful move is simple: require the agent to state what evidence it has before it claims success. If it cannot point to a trace, eval, or smoke check, it should say "not verified" instead of "done".&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Treat Prompt Changes as Production Changes
&lt;/h2&gt;

&lt;p&gt;Prompt edits often look harmless because they are just text. In practice, a prompt change can alter routing, tool selection, scoring behavior, or output format.&lt;/p&gt;

&lt;p&gt;A Langfuse-oriented workflow should make prompt changes reviewable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What was changed?&lt;/li&gt;
&lt;li&gt;Which task or dataset is affected?&lt;/li&gt;
&lt;li&gt;Which score changed after the edit?&lt;/li&gt;
&lt;li&gt;Was the improvement measured on one example or on a repeatable set?&lt;/li&gt;
&lt;li&gt;Can the change be rolled back?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters even more when an AI coding agent is involved. If the agent changes code and prompt behavior in the same session, you need a way to separate "the code got better" from "the prompt made the evaluator more forgiving".&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Do Not Skip the Boundary Card
&lt;/h2&gt;

&lt;p&gt;The Doramagic Langfuse pack marks an important boundary: it is not proof that Langfuse is installed, configured, or production-ready in your environment.&lt;/p&gt;

&lt;p&gt;It is also not official Langfuse documentation. Its limits are intentional:&lt;br&gt;
it helps prepare an evidence-review workflow, but it cannot replace upstream&lt;br&gt;
docs, runtime installation evidence, or production security review.&lt;/p&gt;

&lt;p&gt;Before real use, check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the test running in a temporary environment or container?&lt;/li&gt;
&lt;li&gt;Are production keys, private data, and main config directories excluded?&lt;/li&gt;
&lt;li&gt;Can the prompt or config change be rolled back?&lt;/li&gt;
&lt;li&gt;Is the run backed by a real trace or only by an agent explanation?&lt;/li&gt;
&lt;li&gt;Are failures recorded as evidence, not hidden as "retry noise"?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the difference between observability and theater. A dashboard is not a boundary. A trace is not a guarantee. A score is not a release approval unless the team defines how the score is used.&lt;/p&gt;

&lt;p&gt;The boundary rule is explicit: if there is no trace, no eval, no smoke check, or&lt;br&gt;
no recorded rollback path, the agent should not claim success. That keeps the&lt;br&gt;
workflow useful even before runtime installation is complete.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Watch for Real Integration Pitfalls
&lt;/h2&gt;

&lt;p&gt;The Doramagic pack records source-linked pitfalls that should be checked before first use. Examples include open or version-sensitive issues around scoring behavior, unnamed traces in the UI, Semantic Kernel/Openlit integration behavior, worker shutdown behavior in self-hosted Kubernetes, and idle BullMQ queue timeout behavior.&lt;/p&gt;

&lt;p&gt;The important discipline is not to overstate those issues. They are not universal proof that Langfuse is broken. They are reminders to verify the specific version, integration path, and deployment mode you plan to use.&lt;/p&gt;

&lt;p&gt;For a first run, I would use a GO / HOLD rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GO: a minimal trace is captured, an eval/smoke check is visible, and rollback is clear.&lt;/li&gt;
&lt;li&gt;HOLD: the trace exists but scoring, naming, worker behavior, or integration compatibility is unclear.&lt;/li&gt;
&lt;li&gt;NO-GO: the agent claims success without runtime evidence, or secrets/production data are required before basic verification.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  5. The First Safe Agent Instruction
&lt;/h2&gt;

&lt;p&gt;If you load a Langfuse-oriented capability pack into an AI coding host, the first instruction should not be "set this up in production".&lt;/p&gt;

&lt;p&gt;Use a safer first instruction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Using this pack, identify the first safe verification step for a Langfuse-oriented failure-review workflow.
Do not call external tools unless explicitly approved.
Do not claim Langfuse is installed or working without trace or eval evidence.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expected result is not a finished integration. The expected result is a boundary-aware next step: what to verify, where evidence should come from, and what would count as failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. What This Helps With
&lt;/h2&gt;

&lt;p&gt;This workflow helps when you are building or operating LLM systems and want an AI assistant to reason from evidence instead of guessing. It is especially useful when the team needs to review traces, evals, prompt changes, and recovery actions before another agent run is trusted.&lt;/p&gt;

&lt;p&gt;It does not replace Langfuse's official documentation. It does not prove production readiness. It does not mean upstream maintainers endorse this pack.&lt;/p&gt;

&lt;p&gt;The useful mental model is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Langfuse can give you the observability surface. Your process still needs the boundary, the eval rule, and the rollback habit.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reference: the independent Doramagic Langfuse project page and manual are here: &lt;a href="https://doramagic.ai/en/projects/langfuse/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/langfuse/manual/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Upstream project: &lt;a href="https://github.com/langfuse/langfuse" rel="noopener noreferrer"&gt;https://github.com/langfuse/langfuse&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Disclosure: this is based on an independent Doramagic capability pack for Langfuse. It is not affiliated with or endorsed by Langfuse unless explicitly stated.&lt;/p&gt;

</description>
      <category>devtools</category>
    </item>
    <item>
      <title>Before You Add More Agents, Design the Control Plane</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Wed, 27 May 2026 03:09:03 +0000</pubDate>
      <link>https://dev.to/doramagic/before-you-add-more-agents-design-the-control-plane-2a0i</link>
      <guid>https://dev.to/doramagic/before-you-add-more-agents-design-the-control-plane-2a0i</guid>
      <description>&lt;p&gt;OpenAI Agents Python makes it easy to describe agents, connect tools, define handoffs, and run agentic workflows. That is useful, but it also creates a trap: teams may start by adding more agents before they define the operational boundaries that make those agents safe to use in a real repository.&lt;/p&gt;

&lt;p&gt;The hard part is usually not getting the first demo to run. The hard part is knowing when an agent should start, what it is allowed to touch, what evidence it must leave behind, when it can hand work to another agent, and how the team will recover when the workflow fails.&lt;/p&gt;

&lt;p&gt;For production use, I would start with a control plane.&lt;/p&gt;

&lt;p&gt;This does not need to be a heavy platform. A Markdown checklist, a JSON policy file, and a trace log can be enough for the first version. The key is that the rules exist outside the model's temporary reasoning. They become part of the workflow, not something you hope the model remembers.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Define the task entry contract
&lt;/h2&gt;

&lt;p&gt;An agent should not start from a vague instruction like "fix this feature" or "improve this repo." That may be fine for a toy demo, but it is too wide for real work.&lt;/p&gt;

&lt;p&gt;A task entry contract should answer five questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the goal?&lt;/li&gt;
&lt;li&gt;What input is trusted?&lt;/li&gt;
&lt;li&gt;What files, services, or systems are in scope?&lt;/li&gt;
&lt;li&gt;What is the acceptance standard?&lt;/li&gt;
&lt;li&gt;When should the agent stop instead of improvising?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, a safer engineering task might say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read only the package under &lt;code&gt;src/connectors&lt;/code&gt;. Modify only &lt;code&gt;connector_policy.py&lt;/code&gt; and related tests. Preserve the git diff. Run the connector test suite. If the requested behavior conflicts with an existing policy rule, stop and return the conflict instead of rewriting the policy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That kind of instruction is not just prompt polish. It reduces the agent's degrees of freedom. It turns an open-ended request into an executable contract.&lt;/p&gt;

&lt;p&gt;The business value is simple: fewer surprising edits, fewer review cycles, and less time spent asking why an agent touched something unrelated.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Separate tools by risk
&lt;/h2&gt;

&lt;p&gt;Tool access should not be binary. "The agent can use tools" is too broad. A file search is not the same risk as deleting a directory, publishing an article, or calling a production API.&lt;/p&gt;

&lt;p&gt;I prefer three buckets.&lt;/p&gt;

&lt;p&gt;Low-risk tools can run directly. Examples: read a file, search for symbols, inspect documentation, list a directory, or open a local artifact.&lt;/p&gt;

&lt;p&gt;Medium-risk tools can run if they leave evidence. Examples: modify a draft, generate a patch, run tests, create a report, or produce a migration plan. The output should be inspectable.&lt;/p&gt;

&lt;p&gt;High-risk tools require an explicit gate. Examples: destructive git commands, deleting files, pushing to a remote, publishing content, spending money, modifying production infrastructure, or calling external APIs with side effects.&lt;/p&gt;

&lt;p&gt;OpenAI Agents Python gives you a framework for building the workflow. It does not automatically know your risk model. That risk model belongs in your engineering system.&lt;/p&gt;

&lt;p&gt;If your agent can publish content, the publication action should not be treated the same way as writing a local draft. If your agent can modify code, the modification should not be treated the same way as reading code. If your agent can call production services, the system needs a gate before side effects happen.&lt;/p&gt;

&lt;p&gt;This is where many agent workflows become fragile. The model may be capable, but the surrounding system has no authority model.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Make handoffs evidence-based
&lt;/h2&gt;

&lt;p&gt;Multi-agent workflows are attractive because they map nicely to human roles: researcher, planner, coder, reviewer, publisher. But every handoff creates a new failure point.&lt;/p&gt;

&lt;p&gt;A handoff table should define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When a handoff is allowed&lt;/li&gt;
&lt;li&gt;Which agent receives the task&lt;/li&gt;
&lt;li&gt;What evidence must be passed along&lt;/li&gt;
&lt;li&gt;Which cases block the handoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A research agent should not hand work to a writing agent by saying "I found the sources." It should pass source links, key claims, contradictions, uncertain points, and the reason those sources are relevant.&lt;/p&gt;

&lt;p&gt;A coding agent should not hand work to a release agent by saying "fixed." It should pass the diff, tests run, tests skipped, remaining risk, and rollback path.&lt;/p&gt;

&lt;p&gt;That evidence is the difference between agentic collaboration and a chain of guesses.&lt;/p&gt;

&lt;p&gt;The more agents you add, the more important this becomes. Without evidence-based handoffs, every downstream agent has to infer what the upstream agent meant. That makes failures harder to debug and easier to repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Treat trace as a product feature
&lt;/h2&gt;

&lt;p&gt;When an agent workflow fails, the least useful conclusion is "the model was unreliable." That may be true, but it does not tell you what to improve.&lt;/p&gt;

&lt;p&gt;A useful trace should capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The task goal&lt;/li&gt;
&lt;li&gt;The input and source material&lt;/li&gt;
&lt;li&gt;The rules that were active&lt;/li&gt;
&lt;li&gt;The tools that were called&lt;/li&gt;
&lt;li&gt;The files or external systems touched&lt;/li&gt;
&lt;li&gt;The verification result&lt;/li&gt;
&lt;li&gt;The failure reason&lt;/li&gt;
&lt;li&gt;The rule or workflow change suggested for next time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You do not need a complex observability backend on day one. A structured Markdown worklog or JSONL trace can be enough. What matters is that failures become training material for the system.&lt;/p&gt;

&lt;p&gt;If a failure came from a vague task, improve the task entry contract. If a failure came from excessive permission, tighten the tool policy. If a failure came from a weak handoff, change the handoff table. If a failure came from missing verification, add a test or preflight check.&lt;/p&gt;

&lt;p&gt;This is how an agent workflow gets more reliable over time. Not by hoping the next model will magically be better, but by converting failures into rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Start with one real workflow
&lt;/h2&gt;

&lt;p&gt;The wrong move is to design a giant multi-agent platform first. The better move is to choose one low-risk but real workflow.&lt;/p&gt;

&lt;p&gt;Good first workflows include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updating documentation after a code change&lt;/li&gt;
&lt;li&gt;Reviewing a pull request for missing tests&lt;/li&gt;
&lt;li&gt;Classifying issues into actionable buckets&lt;/li&gt;
&lt;li&gt;Producing a release note from a verified diff&lt;/li&gt;
&lt;li&gt;Preparing a technical article draft with source links and disclosure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each workflow, define the entry contract, tool policy, handoff table, and trace format. Then run it repeatedly. The goal is not to prove that agents are impressive. The goal is to prove that the workflow reduces repeated coordination while preserving reviewability and rollback.&lt;/p&gt;

&lt;p&gt;If a small workflow becomes stable, expand it. If it keeps failing, the trace should tell you whether the problem is the task, the permissions, the handoff, the verification, or the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical takeaway
&lt;/h2&gt;

&lt;p&gt;OpenAI Agents Python is a useful foundation for building agent workflows. But the production value comes from the control plane around it.&lt;/p&gt;

&lt;p&gt;Before adding more agents, define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How tasks enter the system&lt;/li&gt;
&lt;li&gt;Which tools are allowed under which conditions&lt;/li&gt;
&lt;li&gt;What evidence is required for handoff&lt;/li&gt;
&lt;li&gt;How traces feed back into better rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is less exciting than a flashy demo, but it is the difference between an agent that merely runs and an agent workflow that a team can actually trust.&lt;/p&gt;

&lt;p&gt;Disclosure: this is an unofficial Doramagic technical note. It is not an official OpenAI publication and does not represent the upstream project unless explicitly stated by that project.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>openai</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Adopt Codex CLI only after you can explain the source, boundary, review, and rollback model</title>
      <dc:creator>Tang Weigang</dc:creator>
      <pubDate>Tue, 26 May 2026 03:51:27 +0000</pubDate>
      <link>https://dev.to/doramagic/adopt-codex-cli-only-after-you-can-explain-the-source-boundary-review-and-rollback-model-5bf1</link>
      <guid>https://dev.to/doramagic/adopt-codex-cli-only-after-you-can-explain-the-source-boundary-review-and-rollback-model-5bf1</guid>
      <description>&lt;p&gt;A lot of teams want to treat Codex CLI as a shortcut: install it, point it at a repository, and hope it saves time immediately. That framing is too shallow for a real codebase.&lt;/p&gt;

&lt;p&gt;If you are adopting Codex CLI in a team that cares about quality, the real question is not whether it can write code. The real question is whether the workflow around it is explicit enough to be reviewed, bounded, and reversed. Without those four properties, the tool can create output faster, but it cannot create confidence faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Start from the source of truth
&lt;/h2&gt;

&lt;p&gt;Before any assistant touches a repository, someone needs to answer a basic question: what is the current source of truth?&lt;/p&gt;

&lt;p&gt;That sounds trivial, but it is the first place AI-assisted workflows drift. Teams often test a tool against a repository snapshot, an old issue thread, or a blog post that no longer matches the current implementation. Once that happens, every next step becomes fragile because the assistant is reasoning from stale input.&lt;/p&gt;

&lt;p&gt;A useful adoption process starts by checking three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the repository or package is the current one&lt;/li&gt;
&lt;li&gt;the installation or usage instructions still match reality&lt;/li&gt;
&lt;li&gt;the command being run is documented for this version, not an older release&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If those checks fail, do not treat the tool as “mostly correct.” Treat the source as unresolved. In practice, a fast assistant reading the wrong upstream source is not a productivity gain. It is a faster way to compound confusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Make the permission boundary visible
&lt;/h2&gt;

&lt;p&gt;The second boundary is operational scope.&lt;/p&gt;

&lt;p&gt;A team should be able to answer, in plain language, what the tool may read, what it may change, what it may execute, and what requires human approval. If those boundaries are hidden in the operator’s head, the workflow is already too loose.&lt;/p&gt;

&lt;p&gt;This matters because the early demo of an AI coding tool is misleading. It feels safe when it is only producing text. The risk appears when the same tool is allowed to inspect files, write patches, run shell commands, or touch directories that nobody explicitly intended to expose.&lt;/p&gt;

&lt;p&gt;A mature setup does not see permission boundaries as friction. It sees them as the thing that makes the workflow repeatable. The point is not to maximize what the tool can do. The point is to define exactly what it can do so the rest of the team can trust the result.&lt;/p&gt;

&lt;p&gt;A practical rule is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read access should be explicit&lt;/li&gt;
&lt;li&gt;write access should be narrow&lt;/li&gt;
&lt;li&gt;destructive actions should require confirmation&lt;/li&gt;
&lt;li&gt;privileged steps should be isolated from exploratory steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you cannot describe the boundary clearly, you do not yet have a production workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Put review back at the center
&lt;/h2&gt;

&lt;p&gt;The third boundary is review.&lt;/p&gt;

&lt;p&gt;This is where many teams get the biggest false win. A tool produces a patch quickly, and the team celebrates the speed. But if the patch is hard to inspect, hard to compare, or hard to reject, the tool has not reduced cost. It has merely moved cost into a later phase when context is already lower.&lt;/p&gt;

&lt;p&gt;Review is not a ceremonial step after generation. Review is part of the product.&lt;/p&gt;

&lt;p&gt;A good AI-assisted workflow makes the output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;easy to inspect&lt;/li&gt;
&lt;li&gt;easy to compare&lt;/li&gt;
&lt;li&gt;easy to reject&lt;/li&gt;
&lt;li&gt;easy to refine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the assistant should be optimized for diffs, not theatrics. If a change cannot be understood in a short review cycle, the workflow is not ready. The best sign of maturity is not that the assistant can generate a large patch. It is that a normal engineer can explain why the patch is acceptable in minutes.&lt;/p&gt;

&lt;p&gt;This is also where teams should insist on a clear evidence trail. If a change passes, where is the proof? If it fails, what specifically failed? If the answer is vague, the workflow is too soft to rely on.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Treat rollback as part of the design
&lt;/h2&gt;

&lt;p&gt;The fourth boundary is rollback.&lt;/p&gt;

&lt;p&gt;Rollback is often treated like cleanup after the fact. That is the wrong mental model. Rollback is part of the design of the workflow itself.&lt;/p&gt;

&lt;p&gt;Every real repository will eventually see a bad assumption, an incomplete refactor, a broken command, or a change that looked reasonable until someone reviewed it closely. The question is not whether mistakes will happen. The question is whether recovery is fast enough that the team stays calm.&lt;/p&gt;

&lt;p&gt;A rollback-capable workflow has three qualities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you can identify the last safe state&lt;/li&gt;
&lt;li&gt;you can return to it quickly&lt;/li&gt;
&lt;li&gt;you can explain what changed without guessing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If those three qualities are not present, then every experiment becomes a one-way door. That is too expensive for a solo team and unacceptable for a shared codebase.&lt;/p&gt;

&lt;p&gt;This is the difference between “the tool can help me write code” and “the tool can participate in an engineering system.” The first is a demo. The second is a capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Use a better adoption question
&lt;/h2&gt;

&lt;p&gt;The wrong question is: can the tool generate good code?&lt;/p&gt;

&lt;p&gt;The better question is: can the team trust the workflow around the tool?&lt;/p&gt;

&lt;p&gt;That better question breaks down into four operational checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Can we identify the source of truth before the tool starts?&lt;/li&gt;
&lt;li&gt;Can we define the tool’s authority without ambiguity?&lt;/li&gt;
&lt;li&gt;Can we tell whether the change is acceptable in under five minutes?&lt;/li&gt;
&lt;li&gt;Can we return to the last known good state without guesswork?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those are more useful than any demo because they turn a vague technology discussion into a reviewable operating standard.&lt;/p&gt;

&lt;p&gt;If any of those questions is “not yet,” the right answer is not to push harder on the model. The right answer is to fix the workflow boundary first.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. What a real adoption path looks like
&lt;/h2&gt;

&lt;p&gt;For a real team, the best rollout is boring on purpose.&lt;/p&gt;

&lt;p&gt;It should begin with a narrow, reversible use case. Not a magical broad permission set. Not an open-ended “let’s see what happens.” A narrow path where the output is easy to inspect and easy to undo.&lt;/p&gt;

&lt;p&gt;A good adoption path usually looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;choose one repository&lt;/li&gt;
&lt;li&gt;define one class of change&lt;/li&gt;
&lt;li&gt;define one reviewer&lt;/li&gt;
&lt;li&gt;define one rollback path&lt;/li&gt;
&lt;li&gt;measure whether the same standard holds on the second and third run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repetition matters. The first successful run is easy to overvalue because everybody is paying attention. The real test is the second, third, and tenth run, when the novelty is gone and the tool has to fit ordinary work.&lt;/p&gt;

&lt;p&gt;If the workflow does not survive repetition, it is not ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Why this matters for the team, not just the tool
&lt;/h2&gt;

&lt;p&gt;This approach is bigger than Codex CLI.&lt;/p&gt;

&lt;p&gt;Any AI coding tool used in a real repository should be evaluated the same way. The issue is not which vendor is cleverer. The issue is whether the team can maintain control while gaining speed.&lt;/p&gt;

&lt;p&gt;When something goes wrong, a mature team should not debate the intelligence of the tool. It should inspect the broken boundary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;was the source stale?&lt;/li&gt;
&lt;li&gt;were the permissions too broad?&lt;/li&gt;
&lt;li&gt;was the review path unclear?&lt;/li&gt;
&lt;li&gt;was rollback not guaranteed?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That framing reduces emotional noise and makes the problem fixable. It also makes the workflow easier to teach to other engineers because the rules are operational, not mystical.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. The shortest useful conclusion
&lt;/h2&gt;

&lt;p&gt;Codex CLI is worth adopting only when the surrounding workflow is already disciplined enough to keep it honest.&lt;/p&gt;

&lt;p&gt;If source is verified, permissions are bounded, review is visible, and rollback is guaranteed, the tool becomes useful. If not, it just helps you create uncertainty faster.&lt;/p&gt;

&lt;p&gt;Doramagic project page: &lt;a href="https://doramagic.ai/en/projects/codex/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/codex/&lt;/a&gt;&lt;br&gt;
Manual: &lt;a href="https://doramagic.ai/en/projects/codex/manual/" rel="noopener noreferrer"&gt;https://doramagic.ai/en/projects/codex/manual/&lt;/a&gt;&lt;br&gt;
Source repository: &lt;a href="https://github.com/openai/codex" rel="noopener noreferrer"&gt;https://github.com/openai/codex&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Non-official note: this is a Doramagic-made, non-official AI capability package. Unless the upstream project states otherwise, it does not represent an official upstream release.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivitydevops</category>
    </item>
  </channel>
</rss>
