眭林飞（Yabo.sui）

Posted on Jun 14 • Edited on Jun 23

Stop AI Hallucinations: How to Make Natural Language Testing Real with "Harness Engineering"

#ai #llm #automation #testing

Stop AI Hallucinations: How to Make Natural Language Testing Real with "Harness Engineering"

Abstract

When the system under test is a business-process-intensive software system (such as a configurable AI Agent platform), traditional automation testing hits a ceiling—API combinatorial explosion, business understanding gaps, and soaring maintenance costs. This article documents our complete journey from pytest hardcoded wrappers to building a CLI SDK and Skill system, ultimately achieving "describe a test scenario in natural language and it executes." While the Agent system serves as our example, this approach essentially applies to any software product exposing APIs—e-commerce, fintech, SaaS, or internal tools. The core insight shifts from Context Engineering (feeding code to AI for test generation) to Harness Engineering (putting a "harness" on AI), fully leveraging AI's decision-making capabilities while constraining how it understands the business system—bringing testing back to its original form: describing behavior and expectations in natural language.

1. Background: When Business System Complexity Exceeds Traditional Automation Testing

To illustrate concretely, consider a system we deeply tested: a user-facing Agent building platform where users can:

Create an Agent and name it;
Select and configure the underlying large language model;
Bind knowledge sources (document libraries, databases, etc.);
Configure skills (plugins, tool calls, workflows);
Save and publish the Agent to designated users;
Chat with the Agent as a specific user and receive structured responses with skill call records and citation sources.

Each operation has an independent RESTful API. Creating an Agent is one API, binding knowledge sources is another, and initiating a chat requires yet another API with session context. This business-process-intensive, complex API orchestration is equally common in e-commerce checkout, loan approval, and SaaS multi-tenant configuration. The Agent platform's testing challenges represent a whole class of software systems' common pain points.

Initially, our automation testing strategy was "standard": write wrapper functions for each API in pytest, then call these functions sequentially in test cases with hardcoded parameters, capturing responses, extracting fields, and asserting. For example, an "end-to-end Q&A" test looked like this:

def test_agent_qa_with_skill():
    # 1. Create agent
    agent_id = create_agent(name="test_agent", model="gpt-4o")
    # 2. Bind knowledge base
    bind_knowledge(agent_id, kb_id="kb_123")
    # 3. Configure skill
    add_skill(agent_id, skill_name="weather_search")
    # 4. Save and publish
    publish(agent_id, user="user_001")
    # 5. Chat as user
    resp = chat(agent_id, user="user_001", question="What's the weather in Shanghai today?")
    # 6. Assert skill call records and semantics
    assert "weather_search" in resp["skill_calls"]
    assert "sunny" in resp["answer"]
    assert resp["ttft"] < 2.0

This approach worked when business logic was simple, but as the system rapidly expanded, problems multiplied:

Skyrocketing maintenance costs: A single business change could require modifying hardcoded parameters across dozens of test functions.
Limited expressiveness: Testers spent most of their effort orchestrating API call sequences and data passing, rather than describing "how this feature should behave."
Non-developers excluded: Product managers or QA engineers with clear behavioral expectations couldn't bypass Python code to participate in automated test design.

We needed a new approach—one where test authoring and execution more closely mirror how humans describe software behavior: express test cases in natural language, and have those cases execute reliably. And this approach shouldn't be limited to Agent platforms; it should generalize to any business system driven by APIs.

2. Inspiration: LLM as a Judge and AI-Driven Interactive Tools

Inspired by the LLM as a Judge pattern, we began experimenting with large models to evaluate semantic consistency of test results rather than rigidly matching keywords. Simultaneously, the popularity of AI interaction tools like Claude Code and OpenCLAW showed us another possibility: if developers can use natural language to make AI operate terminals, read/write files, and call external tools, could we also enable AI to directly operate our business systems?

A naive idea was: feed the entire API documentation, usage examples, and even source code to a large model, then ask it to generate test code and execute it based on natural language descriptions. This is typical Context Engineering—providing complete context and expecting AI to independently understand the business, orchestrate APIs, and produce correct tests.

But reality taught us lessons quickly:

Severe hallucinations: Large models would "invent" non-existent API endpoints, incorrect parameter combinations, even fabricate business rules.
API coordination out of control: Operations like creating resources, configuring properties, and triggering actions have strict sequential dependencies and state passing (e.g., you must use the ID returned from the previous step for the next step). Letting AI decide autonomously resulted in frequently incorrect call sequences, making trial-and-error costs prohibitively high.
Long feedback loops: When tests themselves depend on AI-generated code, failures made it difficult to distinguish between business defects and generated code problems.

We realized we couldn't let AI fumble in the dark. We needed to give it a "map" and a set of "standard actions," enabling it to exercise decision-making capabilities within defined boundaries without deviating from the correct business path. This "map" shouldn't be customized for just one type of system—it should become a reusable pattern.

3. From Context Engineering to Harness Engineering

Our proposed solution is Harness Engineering. The metaphor is straightforward: put a harness on AI, letting it run in a direction rather than charging blindly across open terrain.

Implementing the Harness has two core layers, applicable to any system with APIs:

Layer One: CLI SDK Wrapping for Business APIs

We wrapped all core system APIs into a command-line interface (CLI), where each business operation becomes a command with explicit parameters and outputs. For our Agent platform:

agent-cli create --name "test_agent" --model "gpt-4o"
agent-cli bind-knowledge --agent-id <id> --kb-id "kb_123"
agent-cli add-skill --agent-id <id> --skill "weather_search"
agent-cli publish --agent-id <id> --user "user_001"
agent-cli chat --agent-id <id> --user "user_001" --question "What's the weather in Shanghai?"

For an e-commerce system, this might become order-cli create, order-cli add-item, order-cli checkout, etc. The CLI layer enforces correct invocation methods, with parameter validation completed before command execution. Any non-compliant invocation immediately receives clear error feedback, rather than failing several steps later.

Layer Two: Skills Development Based on CLI SDK

With CLI commands in place, we further abstract them into AI-callable Skills. In environments like Claude Code or OpenCLAW, a Skill is a "tool" with detailed descriptions, parameter definitions, and invocation examples. For instance, a create_agent Skill definition includes:

Description: Creates a new Agent in the platform, requires specifying name, model, etc.
Parameters: name (string), model (string), knowledge_bases (list), skills (list)...
Invocation: Internally executes agent-cli create ... command.

Now, to AI, what it "sees" is no longer hundreds of pages of API documentation, but a clear capabilities menu. It only needs to plan "which Skills to call, in what order, with what parameters" based on the human's natural language task. Business rules get "compiled" into Skill constraints: for example, publish can only be called after an Agent is saved—this dependency can be explicitly declared in the Skill description or enforced through internal state checks, directly preventing illegal invocations.

This is the core value of Harness Engineering: transferring the responsibility of "understanding how the business operates" from AI to pre-designed tools and constraints, while preserving AI's advantages in task planning and semantic judgment. AI no longer needs to derive business processes from scratch—it operates within a safe sandbox, completing goals using standardized actions. Regardless of whether the system under test is an Agent platform, trading system, or approval workflow—as long as its behavior can be driven via APIs, this "harness" applies.

4. Practical Implementation: Completing Integration Tests Directly with Natural Language

After this transformation, our test execution follows this process (still using the Agent platform as the example, but readers can replace Skills with their own business operations):

A tester (or product manager, developer) opens Claude Code and describes the test scenario in natural language.
AI calls our pre-configured business Skills according to the description, operating the system in the correct sequence.
Each step's response is parsed by AI; final results undergo semantic assertion via LLM as a Judge.

A real test instruction might look like:

Help me create an Agent named "Weather Assistant," using the existing gpt-4o model in the system, configure the weather_search skill and travel_kb knowledge base, save it and publish to user lily. Then, as lily, ask this Agent: "Do I need an umbrella for Hangzhou tomorrow?" The expected answer should show the weather_search skill's call records in the response, the final reply should semantically indicate whether it will rain in Hangzhou tomorrow, and TTFT should not exceed 2 seconds.

AI independently plans the task sequence, calls Skills like create_agent, bind_knowledge, add_skill, publish, chat, collects the final returned result, then gives a pass/fail conclusion based on our preset Judge rules (semantic similarity, keyword inclusion, TTFT metrics). The entire process requires zero hardcoded test functions.

For other systems, the scenario is equally natural. For example:

In the order system, help me create an order containing Product A and Product B for user Zhang San's account, using coupon code "SUMMER". The expected order total price should be 20% off the original price, and the order status should be "pending payment."

As long as the order system's APIs are wrapped into corresponding order-cli commands and Skills, AI can execute this test in exactly the same way. What we've done is:

Wrapped all RESTful APIs into CLI (Layer One of the Harness);
Mapped CLI commands to Skills in the AI environment (Layer Two);
Equipped AI with a Judge Skill for result assertion;
Let humans define inputs and expectations in natural language.

The fundamental form of testing has changed: test cases themselves revert to natural language descriptions of behavior and expectations, while execution and assertion are handed to constrained AI.

5. Why This Path Works: A Review of Automated Testing Evolution

Looking back at the path automated testing has traveled, a clear evolutionary trajectory emerges:

Past: Natural language test cases guided testers' manual operations;
Present: Natural language test cases guide testers to write automation scripts (pytest, Selenium, etc.), then CI pipelines execute them on schedule;
Recent past: Attempted to stuff test cases and code together into AI, letting it generate and execute test scripts—but hallucinations and coordination costs ran high;
Future (what we're doing now): Testers encapsulate business capabilities as Skills (Harness), then write test cases directly in natural language, letting Agents execute them.

This final stage essentially returns testing to its original state: describing in natural language what the software should do and what we expect to see. The difference is: now there's an AI Agent "trained with a harness" faithfully completing this verification work, rather than requiring humans to act as translators converting natural language to code. This evolution is universal because the ultimate goal of testing almost any software system can be framed as "trigger a business process under specific conditions, then verify the results"—and Harness Engineering happens to provide a standardized execution and verification framework decoupled from specific business logic.

6. Benefits and Reflections

After deploying this testing system on our Agent platform and other business systems, we've seen significant benefits:

Dramatically improved test authoring efficiency: Describing an end-to-end scenario in natural language is 5x faster than writing equivalent Python code.
Plummeting maintenance costs: When business APIs change, only the corresponding CLI commands and Skill implementations need modification—all natural language test cases continue running without any changes.
Dissolving team collaboration boundaries: Product managers can write acceptance criteria directly as natural language test cases, which AI automatically executes before each release—true "living documentation."
Tests as documentation: Natural language test cases are inherently highly readable, eliminating the burden of maintaining separate test documentation.

Of course, challenges exist. Harness Engineering requires upfront investment in designing sensible CLI interfaces and Skill abstractions—requiring test developers to deeply understand the business—but this investment is one-time and the pattern can be reused across projects. Additionally, LLM as a Judge can still produce evaluation bias in ambiguous scenarios, requiring continuous Judge prompt and standard calibration.

7. Conclusion: Returning Testing to Its Essence

The deepest lesson from this exploration: the key to successful AI deployment is often not giving it more freedom, but giving it precisely the right amount of constraints. By building the "harness"—CLI SDK and Skill system—we give AI both direction and boundaries when understanding business systems. This method isn't limited to Agent systems; it equally applies to order, approval, configuration, data flow, and all other software scenarios. Ultimately, testing behavior returns to its original human communication form: describing in sentences what we want the system to do and how we determine it did it right.

When the day comes that any team member can type natural language into a chat box and drive a full business-process automated test, only then can we truly say: testing doesn't impede delivery—it is delivery.

This article distills our practical experience applying AI testing across multiple business systems, with particular thanks to the "Context Engineering to Harness Engineering" conceptual leap. If you're exploring AI-driven testing on any API-exposed product, let's connect.
GitHub：https://github.com/suilinfei001/nl2test

Top comments (1)

Mehmet Can Farsak • Jun 14

Solid breakdown of moving from hardcoded test wrappers to natural language testing. The "put a harness on the AI" framing is spot on — and the same problem shows up in agent workflows more broadly. I've seen agents start writing test code when they should still be brainstorming what to test.

Put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) which adds a PreToolUse hook layer that enforces ideation mode. Three modes (divergent, actionable, academic) so the agent stays in the right headspace for test design before jumping to implementation. Worth considering as part of the harness engineering toolkit.