Kong's 7% AI Accuracy Gain Didn't Come From Better Models

#documentation #ai #devtools #techwriting

Kong's AI chatbot went from 84% to 91% confident answer rates.

Not from a model upgrade. Not from prompt engineering. From making their CLI guides testable.

Manny Silva shared this case study yesterday. The number is interesting. The mechanism behind it is more interesting.

The problem with AI-consumed documentation

Here's what happens when an AI chatbot answers questions from your docs:

User asks a question
RAG system retrieves relevant doc chunks
LLM generates answer based on those chunks
User gets confident-sounding response

Step 3 is where things break. If the retrieved chunk contains outdated information, the AI doesn't know it's outdated. It answers confidently. Wrongly.

A broken procedure in your docs propagates through every AI-generated response that draws from it.

This isn't theoretical. Air Canada's chatbot fabricated a refund policy because the actual policy wasn't properly documented. The AI filled the gap with plausible-sounding nonsense.

What Kong actually did

Kong's documentation team rebuilt their CLI how-to guides to be testable. Every procedure can be executed sequentially. Copy-paste commands straight down the page.

The key insight from Diana Breza and Fabian Rodriguez: tests derived directly from docs, not maintained separately. Same source = no drift.

Here's what a testable CLI procedure looks like with Doc Detective:

{
  "tests": [
    {
      "steps": [
        {
          "description": "Install Kong Gateway",
          "runShell": "curl -Ls https://get.konghq.com | bash"
        },
        {
          "description": "Verify installation",
          "runShell": "kong version",
          "exitCodes": [0]
        },
        {
          "description": "Start Kong",
          "runShell": "kong start"
        },
        {
          "description": "Check Kong is running",
          "httpRequest": {
            "url": "http://localhost:8001/status",
            "method": "get",
            "statusCodes": [200]
          }
        }
      ]
    }
  ]
}

Every step maps to a line in the documentation. When the test fails, you know exactly which instruction broke.

For API documentation, the same principle applies:

{
  "tests": [
    {
      "description": "Test the /services endpoint",
      "steps": [
        {
          "httpRequest": {
            "url": "http://localhost:8001/services",
            "method": "post",
            "requestData": {
              "name": "example-service",
              "url": "http://httpbin.org"
            },
            "responseData": {
              "name": "example-service"
            },
            "statusCodes": [201]
          }
        }
      ]
    }
  ]
}

The responseData field is the key. It validates that the API returns what your docs claim it returns. When the response schema changes, the test fails before users see stale documentation.

Inline tests: keep tests next to the content they validate

Doc Detective also supports inline tests in Markdown. Tests live in comments that don't render in the final output.

You write your normal documentation, then add a comment line like this:

[comment]: # (step {"httpRequest": {"url": "...", "statusCodes": [201]}})

The test definition sits right next to the curl command it validates. Same source file. No separate test suite to maintain. When someone updates the docs, the tests update too — or they break and you know immediately.

Run it with:

npx doc-detective --input your-docs-folder/

Why 7% matters more than it sounds

91% confident answers means roughly 9 out of 10 user questions get resolved without human intervention.

84% means roughly 8 out of 10.

That's not a 7% improvement. That's a 44% reduction in unresolved queries.

For context: AssemblyAI reduced first response time from 15 minutes to 23 seconds with AI-powered documentation routing. One API company cut support tickets 45% by improving error documentation alone.

The pattern is consistent. Documentation accuracy directly determines AI system effectiveness.

The RAG accuracy chain

AI21 Labs research found that structured RAG approaches improved accuracy by up to 60% compared to traditional methods. The mechanism: transforming unstructured documents into structured, query-aware representations.

Testable documentation does something similar. When every procedure is executable:

Conflicting information gets caught — you can't have two different procedures for the same outcome if both are tested

Outdated content fails tests — product changes break doc tests before users encounter stale instructions

Missing steps become obvious — if you can't execute the procedure sequentially, something's missing

The documentation becomes self-validating. And self-validating docs make better RAG sources.

What this means for your docs

Most documentation teams measure page views, time on page, maybe feedback scores. None of these metrics tell you if the content is correct.

Kong's approach adds a binary signal: does this procedure work or not?

Some practical implications:

For API docs: If your getting started guide requires users to mentally interpolate steps, an AI chatbot will do the same — and get it wrong. Write procedures that execute without human judgment.

For CLI docs: Every command sequence should be copy-pasteable. If users need to modify commands based on context, document that context explicitly.

For error documentation: This is where most docs fail hardest. Document what each error means, why it happens, and how to fix it. AI chatbots field error questions constantly.

The uncomfortable truth

GPT-4 and Claude frequently fabricate API calls when given summarized OpenAPI definitions. UC Berkeley's Gorilla research documented this. Parasoft observed responses with "two APIs that were real and one that was made up."

Your AI chatbot is only as good as your docs. If your docs are untested, your AI is guessing. Confidently.

Kong's 7% improvement came from removing the guesswork. Not from the AI getting smarter.

The Docs as Tests methodology was developed by Manny Silva, currently Head of Docs at Skyflow. His book "Docs as Tests: A Strategy for Resilient Technical Documentation" covers the framework in detail. Doc Detective, the open-source tool for documentation testing, hit v3.6.3 this month.