<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: 眭林飞（Yabo.sui）</title>
    <description>The latest articles on DEV Community by 眭林飞（Yabo.sui） (@suilinfei001).</description>
    <link>https://dev.to/suilinfei001</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3983633%2Fd8f3c38a-4841-4f02-bb88-3660c07f0810.png</url>
      <title>DEV Community: 眭林飞（Yabo.sui）</title>
      <link>https://dev.to/suilinfei001</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/suilinfei001"/>
    <language>en</language>
    <item>
      <title>Stop AI Hallucinations: How to Make Natural Language Testing Real with "Harness Engineering"</title>
      <dc:creator>眭林飞（Yabo.sui）</dc:creator>
      <pubDate>Sun, 14 Jun 2026 09:23:54 +0000</pubDate>
      <link>https://dev.to/suilinfei001/stop-ai-hallucinations-how-to-make-natural-language-testing-real-with-harness-engineering-8k5</link>
      <guid>https://dev.to/suilinfei001/stop-ai-hallucinations-how-to-make-natural-language-testing-real-with-harness-engineering-8k5</guid>
      <description>&lt;h1&gt;
  
  
  Stop AI Hallucinations: How to Make Natural Language Testing Real with "Harness Engineering"
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When the system under test is a business-process-intensive software system (such as a configurable AI Agent platform), traditional automation testing hits a ceiling—API combinatorial explosion, business understanding gaps, and soaring maintenance costs. This article documents our complete journey from pytest hardcoded wrappers to building a CLI SDK and Skill system, ultimately achieving "describe a test scenario in natural language and it executes." While the Agent system serves as our example, this approach &lt;strong&gt;essentially applies to any software product exposing APIs&lt;/strong&gt;—e-commerce, fintech, SaaS, or internal tools. The core insight shifts from &lt;strong&gt;Context Engineering&lt;/strong&gt; (feeding code to AI for test generation) to &lt;strong&gt;Harness Engineering&lt;/strong&gt; (putting a "harness" on AI), fully leveraging AI's decision-making capabilities while constraining how it understands the business system—bringing testing back to its original form: describing behavior and expectations in natural language.&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Background: When Business System Complexity Exceeds Traditional Automation Testing
&lt;/h2&gt;

&lt;p&gt;To illustrate concretely, consider a system we deeply tested: a user-facing Agent building platform where users can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create an Agent and name it;&lt;/li&gt;
&lt;li&gt;Select and configure the underlying large language model;&lt;/li&gt;
&lt;li&gt;Bind knowledge sources (document libraries, databases, etc.);&lt;/li&gt;
&lt;li&gt;Configure skills (plugins, tool calls, workflows);&lt;/li&gt;
&lt;li&gt;Save and publish the Agent to designated users;&lt;/li&gt;
&lt;li&gt;Chat with the Agent as a specific user and receive structured responses with skill call records and citation sources.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each operation has an independent RESTful API. Creating an Agent is one API, binding knowledge sources is another, and initiating a chat requires yet another API with session context. This &lt;strong&gt;business-process-intensive, complex API orchestration&lt;/strong&gt; is equally common in e-commerce checkout, loan approval, and SaaS multi-tenant configuration. The Agent platform's testing challenges represent a whole class of software systems' common pain points.&lt;/p&gt;

&lt;p&gt;Initially, our automation testing strategy was "standard": write wrapper functions for each API in pytest, then call these functions sequentially in test cases with hardcoded parameters, capturing responses, extracting fields, and asserting. For example, an "end-to-end Q&amp;amp;A" test looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_agent_qa_with_skill&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Create agent
&lt;/span&gt;    &lt;span class="n"&gt;agent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 2. Bind knowledge base
&lt;/span&gt;    &lt;span class="nf"&gt;bind_knowledge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kb_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kb_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 3. Configure skill
&lt;/span&gt;    &lt;span class="nf"&gt;add_skill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skill_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 4. Save and publish
&lt;/span&gt;    &lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 5. Chat as user
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the weather in Shanghai today?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 6. Assert skill call records and semantics
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sunny&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttft&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach worked when business logic was simple, but as the system rapidly expanded, problems multiplied:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skyrocketing maintenance costs&lt;/strong&gt;: A single business change could require modifying hardcoded parameters across dozens of test functions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited expressiveness&lt;/strong&gt;: Testers spent most of their effort orchestrating API call sequences and data passing, rather than describing "how this feature should behave."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-developers excluded&lt;/strong&gt;: Product managers or QA engineers with clear behavioral expectations couldn't bypass Python code to participate in automated test design.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed a new approach—one where test authoring and execution more closely mirror how humans describe software behavior: &lt;strong&gt;express test cases in natural language, and have those cases execute reliably.&lt;/strong&gt; And this approach shouldn't be limited to Agent platforms; it should generalize to any business system driven by APIs.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Inspiration: LLM as a Judge and AI-Driven Interactive Tools
&lt;/h2&gt;

&lt;p&gt;Inspired by the LLM as a Judge pattern, we began experimenting with large models to evaluate semantic consistency of test results rather than rigidly matching keywords. Simultaneously, the popularity of AI interaction tools like Claude Code and OpenCLAW showed us another possibility: if developers can use natural language to make AI operate terminals, read/write files, and call external tools, could we also enable AI to directly operate our business systems?&lt;/p&gt;

&lt;p&gt;A naive idea was: &lt;strong&gt;feed the entire API documentation, usage examples, and even source code to a large model, then ask it to generate test code and execute it based on natural language descriptions.&lt;/strong&gt; This is typical &lt;strong&gt;Context Engineering&lt;/strong&gt;—providing complete context and expecting AI to independently understand the business, orchestrate APIs, and produce correct tests.&lt;/p&gt;

&lt;p&gt;But reality taught us lessons quickly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Severe hallucinations&lt;/strong&gt;: Large models would "invent" non-existent API endpoints, incorrect parameter combinations, even fabricate business rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API coordination out of control&lt;/strong&gt;: Operations like creating resources, configuring properties, and triggering actions have strict sequential dependencies and state passing (e.g., you must use the ID returned from the previous step for the next step). Letting AI decide autonomously resulted in frequently incorrect call sequences, making trial-and-error costs prohibitively high.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long feedback loops&lt;/strong&gt;: When tests themselves depend on AI-generated code, failures made it difficult to distinguish between business defects and generated code problems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;We realized we couldn't let AI fumble in the dark.&lt;/strong&gt; We needed to give it a "map" and a set of "standard actions," enabling it to exercise decision-making capabilities within defined boundaries without deviating from the correct business path. This "map" shouldn't be customized for just one type of system—it should become a &lt;strong&gt;reusable pattern&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. From Context Engineering to Harness Engineering
&lt;/h2&gt;

&lt;p&gt;Our proposed solution is &lt;strong&gt;Harness Engineering&lt;/strong&gt;. The metaphor is straightforward: &lt;strong&gt;put a harness on AI&lt;/strong&gt;, letting it run in a direction rather than charging blindly across open terrain.&lt;/p&gt;

&lt;p&gt;Implementing the Harness has two core layers, applicable to any system with APIs:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer One: CLI SDK Wrapping for Business APIs
&lt;/h3&gt;

&lt;p&gt;We wrapped all core system APIs into a command-line interface (CLI), where each business operation becomes a command with explicit parameters and outputs. For our Agent platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-cli create &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"test_agent"&lt;/span&gt; &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"gpt-4o"&lt;/span&gt;
agent-cli bind-knowledge &lt;span class="nt"&gt;--agent-id&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;--kb-id&lt;/span&gt; &lt;span class="s2"&gt;"kb_123"&lt;/span&gt;
agent-cli add-skill &lt;span class="nt"&gt;--agent-id&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;--skill&lt;/span&gt; &lt;span class="s2"&gt;"weather_search"&lt;/span&gt;
agent-cli publish &lt;span class="nt"&gt;--agent-id&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="s2"&gt;"user_001"&lt;/span&gt;
agent-cli chat &lt;span class="nt"&gt;--agent-id&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="s2"&gt;"user_001"&lt;/span&gt; &lt;span class="nt"&gt;--question&lt;/span&gt; &lt;span class="s2"&gt;"What's the weather in Shanghai?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For an e-commerce system, this might become &lt;code&gt;order-cli create&lt;/code&gt;, &lt;code&gt;order-cli add-item&lt;/code&gt;, &lt;code&gt;order-cli checkout&lt;/code&gt;, etc. The CLI layer enforces correct invocation methods, with parameter validation completed before command execution. Any non-compliant invocation immediately receives clear error feedback, rather than failing several steps later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer Two: Skills Development Based on CLI SDK
&lt;/h3&gt;

&lt;p&gt;With CLI commands in place, we further abstract them into AI-callable &lt;strong&gt;Skills&lt;/strong&gt;. In environments like Claude Code or OpenCLAW, a Skill is a "tool" with detailed descriptions, parameter definitions, and invocation examples. For instance, a &lt;code&gt;create_agent&lt;/code&gt; Skill definition includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description&lt;/strong&gt;: Creates a new Agent in the platform, requires specifying name, model, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameters&lt;/strong&gt;: name (string), model (string), knowledge_bases (list), skills (list)...&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invocation&lt;/strong&gt;: Internally executes &lt;code&gt;agent-cli create ...&lt;/code&gt; command.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, to AI, what it "sees" is no longer hundreds of pages of API documentation, but a clear capabilities menu. &lt;strong&gt;It only needs to plan "which Skills to call, in what order, with what parameters" based on the human's natural language task.&lt;/strong&gt; Business rules get "compiled" into Skill constraints: for example, &lt;code&gt;publish&lt;/code&gt; can only be called after an Agent is saved—this dependency can be explicitly declared in the Skill description or enforced through internal state checks, directly preventing illegal invocations.&lt;/p&gt;

&lt;p&gt;This is the core value of Harness Engineering: &lt;strong&gt;transferring the responsibility of "understanding how the business operates" from AI to pre-designed tools and constraints, while preserving AI's advantages in task planning and semantic judgment.&lt;/strong&gt; AI no longer needs to derive business processes from scratch—it operates within a safe sandbox, completing goals using standardized actions. &lt;strong&gt;Regardless of whether the system under test is an Agent platform, trading system, or approval workflow—as long as its behavior can be driven via APIs, this "harness" applies.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Practical Implementation: Completing Integration Tests Directly with Natural Language
&lt;/h2&gt;

&lt;p&gt;After this transformation, our test execution follows this process (still using the Agent platform as the example, but readers can replace Skills with their own business operations):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A tester (or product manager, developer) opens Claude Code and describes the test scenario in natural language.&lt;/li&gt;
&lt;li&gt;AI calls our pre-configured business Skills according to the description, operating the system in the correct sequence.&lt;/li&gt;
&lt;li&gt;Each step's response is parsed by AI; final results undergo semantic assertion via LLM as a Judge.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A real test instruction might look like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Help me create an Agent named "Weather Assistant," using the existing gpt-4o model in the system, configure the weather_search skill and travel_kb knowledge base, save it and publish to user lily. Then, as lily, ask this Agent: "Do I need an umbrella for Hangzhou tomorrow?" The expected answer should show the weather_search skill's call records in the response, the final reply should semantically indicate whether it will rain in Hangzhou tomorrow, and TTFT should not exceed 2 seconds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI independently plans the task sequence, calls Skills like &lt;code&gt;create_agent&lt;/code&gt;, &lt;code&gt;bind_knowledge&lt;/code&gt;, &lt;code&gt;add_skill&lt;/code&gt;, &lt;code&gt;publish&lt;/code&gt;, &lt;code&gt;chat&lt;/code&gt;, collects the final returned result, then gives a pass/fail conclusion based on our preset Judge rules (semantic similarity, keyword inclusion, TTFT metrics). The entire process &lt;strong&gt;requires zero hardcoded test functions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For other systems, the scenario is equally natural. For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the order system, help me create an order containing Product A and Product B for user Zhang San's account, using coupon code "SUMMER". The expected order total price should be 20% off the original price, and the order status should be "pending payment."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As long as the order system's APIs are wrapped into corresponding &lt;code&gt;order-cli&lt;/code&gt; commands and Skills, AI can execute this test in exactly the same way. What we've done is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrapped all RESTful APIs into CLI (Layer One of the Harness);&lt;/li&gt;
&lt;li&gt;Mapped CLI commands to Skills in the AI environment (Layer Two);&lt;/li&gt;
&lt;li&gt;Equipped AI with a Judge Skill for result assertion;&lt;/li&gt;
&lt;li&gt;Let humans define inputs and expectations in natural language.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fundamental form of testing has changed: &lt;strong&gt;test cases themselves revert to natural language descriptions of behavior and expectations, while execution and assertion are handed to constrained AI.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Why This Path Works: A Review of Automated Testing Evolution
&lt;/h2&gt;

&lt;p&gt;Looking back at the path automated testing has traveled, a clear evolutionary trajectory emerges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Past&lt;/strong&gt;: Natural language test cases guided testers' &lt;strong&gt;manual operations&lt;/strong&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Present&lt;/strong&gt;: Natural language test cases guide testers to &lt;strong&gt;write automation scripts&lt;/strong&gt; (pytest, Selenium, etc.), then CI pipelines execute them on schedule;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent past&lt;/strong&gt;: Attempted to stuff test cases and code together into AI, letting it generate and execute test scripts—but hallucinations and coordination costs ran high;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future (what we're doing now)&lt;/strong&gt;: Testers encapsulate business capabilities as Skills (Harness), then write test cases directly in natural language, letting Agents execute them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This final stage essentially returns testing to its original state: &lt;strong&gt;describing in natural language what the software should do and what we expect to see.&lt;/strong&gt; The difference is: now there's an AI Agent "trained with a harness" faithfully completing this verification work, rather than requiring humans to act as translators converting natural language to code. This evolution is universal because the ultimate goal of testing almost any software system can be framed as "trigger a business process under specific conditions, then verify the results"—and Harness Engineering happens to provide a standardized execution and verification framework decoupled from specific business logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Benefits and Reflections
&lt;/h2&gt;

&lt;p&gt;After deploying this testing system on our Agent platform and other business systems, we've seen significant benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dramatically improved test authoring efficiency&lt;/strong&gt;: Describing an end-to-end scenario in natural language is 5x faster than writing equivalent Python code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plummeting maintenance costs&lt;/strong&gt;: When business APIs change, only the corresponding CLI commands and Skill implementations need modification—all natural language test cases continue running without any changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dissolving team collaboration boundaries&lt;/strong&gt;: Product managers can write acceptance criteria directly as natural language test cases, which AI automatically executes before each release—true "living documentation."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests as documentation&lt;/strong&gt;: Natural language test cases are inherently highly readable, eliminating the burden of maintaining separate test documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of course, challenges exist. Harness Engineering requires upfront investment in designing sensible CLI interfaces and Skill abstractions—requiring test developers to deeply understand the business—but this investment is one-time and the pattern can be reused across projects. Additionally, LLM as a Judge can still produce evaluation bias in ambiguous scenarios, requiring continuous Judge prompt and standard calibration.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Conclusion: Returning Testing to Its Essence
&lt;/h2&gt;

&lt;p&gt;The deepest lesson from this exploration: &lt;strong&gt;the key to successful AI deployment is often not giving it more freedom, but giving it precisely the right amount of constraints.&lt;/strong&gt; By building the "harness"—CLI SDK and Skill system—we give AI both direction and boundaries when understanding business systems. This method isn't limited to Agent systems; it equally applies to order, approval, configuration, data flow, and all other software scenarios. Ultimately, testing behavior returns to its original human communication form: describing in sentences what we want the system to do and how we determine it did it right.&lt;/p&gt;

&lt;p&gt;When the day comes that any team member can type natural language into a chat box and drive a full business-process automated test, only then can we truly say: testing doesn't impede delivery—it is delivery.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;This article distills our practical experience applying AI testing across multiple business systems, with particular thanks to the "Context Engineering to Harness Engineering" conceptual leap. If you're exploring AI-driven testing on any API-exposed product, let's connect.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>llm</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
