DEV Community

Cover image for For the First Time, Zero Confabulation Is Reproducible on Any AI: Open Sourcing ConteX Law
Russel Hawkins
Russel Hawkins

Posted on

For the First Time, Zero Confabulation Is Reproducible on Any AI: Open Sourcing ConteX Law

I am a developer with 27 years of coding behind me. Experience taught me something easy to overlook in the current rush: if one human being can envisage an idea, another can take it apart. My introduction to AI quickly became an exercise in frustration, and I suspect everyone reading this has felt the same thing. AI models are confabulation engines. By that I do not mean they are broken. They produce an answer by predicting what should plausibly come next, not by checking what is true, so wherever they have no firm ground they fill the gap with something that simply reads right.

The Result

CLARA, ConteX Law, LINGO and AXIOM, demonstrated together in CON10X 4N6, is the result of three years solving that problem. For the first time, zero-confabulation output is reproducible on any AI model, in any domain, against a wildcard prompt, an uploaded PDF document, or an uploaded image. Not "fewer hallucinations." Zero, and reproducible, run after run, model after model.

That is the claim this letter exists to make, and I am publishing it in the form of the work itself: CON10X Web Domain is open source on GitHub today, and CON10X 4N6 is free for life for anyone to use. Everything below explains how it works and why it matters. Nothing below is required to verify the claim. Download CON10X 4N6, point it at anything, and watch what the model does.

Why Models Confabulate

The model makers admit it. OpenAI's own research traces the problem to how the models are trained: built to reproduce the patterns in their training data, not to check what is true, so when they are unsure they guess something plausible rather than say they do not know. Hallucination is a marketing word for that. The architecture is probabilistic: it predicts the next most likely token, so confabulation is a property of how the model works, not a surface bug a later version quietly removes. No amount of compute changes what the architecture is doing. OpenAI's own answer is that the model can abstain when unsure. Yet on independent testing by Artificial Analysis in April 2026, its newest and most capable model, GPT-5.5, confabulated instead of abstaining on 86% of the questions it got wrong. That is the worst of any frontier model, even as it posted the highest accuracy on record. The escape the makers point to is the one their own flagship will not take.

I decided to take the problem on. The question was not how to build a better model, but how to describe a domain so completely that the model has nothing left to invent. That meant finding the smallest set of dimensions a specification needs to close the gap a model would otherwise fill with probability. What survived that process was four non-overlapping pillars of truth: Structure, Behaviour, Influence and Objective. Together they are ConteX Law (SSRN abstract=6970199).

The Four Pillars

Structure defines the shape of the domain: the entities that exist, how they relate, and the form a valid answer must take. Code works because its structure is explicit. Structure does the same thing for any domain. It gives a skeleton to transcribe into, rather than a blank space to guess at.

Behaviour defines the rules that govern that structure: what is permitted, what is forbidden, what depends on what. This is the layer that lets a defect be a defect: a thing that breaks a stated rule, not a thing the model happens to dislike.

Influence defines the authority the answer must answer to: the sources, precedents, mandates and constraints that sit outside the model and outrank it. Where that authority is silent, the model must say so rather than fill the gap itself.

Objective defines what the answer is for: the mandate it is measured against. Without a stated objective there is no such thing as a wrong answer, only a plausible one.

Fill those four pillars accurately and the model is no longer predicting the next probable token across an open field. It is transcribing a domain that has already been pinned down. The probability space is collapsed at the input, before the model ever runs, which is why the result is reproducible and does not depend on which model you use.

ConteX Law alone is a specification, not an enforcement mechanism. Three more pieces make it real:

LINGO is the deterministic linguistic gated engine that holds generation to the four pillars at the point of writing, using the same linguistic capability the model already has. Confabulation is stopped at generation, not flagged afterward. That is why the four pillars can only be satisfied through LINGO, and ConteX Law cannot operate without it.

CLARA governs everything entering the AI: validates domain fingerprints, seals the prompt end to end with cryptographic integrity, and rejects tampering before any model ever sees the request (SSRN abstract=6652458).

AXIOM verifies every citation in the output against primary registrars: Crossref, arXiv, and OpenAlex. It validates book references by ISBN, then certifies the result. This is the layer that catches the exact failure mode making headlines: fabricated case citations, invented sources, references to documents that do not exist.

CLARA governs the input. ConteX Law specifies the domain. LINGO enforces it during generation. AXIOM certifies the citations in what comes out. Demonstrated together in CON10X 4N6, that is the full stack, and it is what makes zero confabulation reproducible rather than a one-off.

The First Proof

The first proof of concept was a web transformation engine, CON10X Web Domain. It takes any website or WebView2 mobile application and reproduces it identically across 11 web frameworks or 5 native mobile frameworks. No Figma, no Penpot, no design stage: straight to production-ready code that reproduces identically every time. As of today that engine is open source. Use it, improve it, do what you like with it.

CON10X Web Domain proved an AI model could be made to produce reliable, reproducible code. It left the harder question open: how do you do this for any domain, against a wildcard prompt? Code gives a model explicit structure to transcribe into. A wildcard prompt gives it none of that. It is unstructured general knowledge: any topic, with no sequencing or scaffolding done in advance.

RAG Is Not the Problem

RAG is not the problem, and I am not here to tell you it is useless. Retrieval is sound. The flaw is what RAG still depends on at the final step: a probabilistic AI model asked to turn retrieved material into an accurate answer. That is the one thing the architecture cannot guarantee, because predicting the next likely token is not the same as stating what is true, and no amount of retrieval quality changes that. The problem has become harder to see, not easier, because newer models confabulate convincingly enough to pass an entire ecosystem of skilled, qualified experts without detection.

This is an open invitation to developers building RAG systems. ConteX Law is not a competitor to your work. It moves the determination of truth to the input layer instead of leaving it to the model at the end. Understand the four pillars and you change two things at once: how you build a RAG system, and which AI model you are free to run it on.

CON10X 4N6: The Stack, Demonstrated

CON10X 4N6 is free for life as of today, and it is where CLARA, ConteX Law, LINGO and AXIOM run together. Wildcard prompt, upload a PDF, even a scanned one with no digital text layer, or upload a photo. The engine grounds the input, completes the four pillars at the input layer, and the output holds to what was actually given rather than inventing around it. Every citation in the resulting report is verified against a primary registrar by AXIOM before the report is certified.

This proves ConteX Law is model-agnostic. The most important disclosure: it is not about the capability of the AI model or the compute behind it. In my own testing, CON10X 4N6 running ConteX Law on an open-weights Qwen 3.6 27B model produced the same forensic findings as Claude Opus 4.8 running the same pipeline. Claude Opus 4.8 on its own, without ConteX Law, flagged under 20% of the defects and fabrications in that test.

You do not have to take my word for it. The work is documented across four papers on SSRN. Three are already published: the misdiagnosis paper on why AI confabulates and what it has cost (abstract 6609519), the dual-use disclosure and governance paper (abstract 6641679), and the CLARA self-governing architecture paper (abstract 6652458). The fourth, which states ConteX Law in full and sets out the completeness gate so anyone can run the falsification test themselves (abstract 6970199), is awaiting distribution. The test is reproducible. Run it yourself, on any of the AI models supported in CON10X 4N6, and you will get the same result.

Why This Matters

South Africa's draft National AI Policy passed through a full ecosystem of expert review: legal, academic, financial, technical. It was still approved by Cabinet. Not one stage caught that the document failed its own mandate. A CON10X 4N6 forensic audit found 131 substantive defects: provisions naming no accountable actor, no mechanism, no timeline, and claims with nothing behind them. The fabricated citations that later reached the news were the trivial part. They are the surface error even a simple checker catches. The substance failure, the part that actually mattered, was caught by no human at any stage. That is the measure of how convincing AI-generated output has become, and it is getting harder, not easier, to catch by reading. Detection has to move off the human reader and onto something deterministic. AXIOM and LINGO together are what surfaced the 131 defects the expert reviewers had passed.

It is not a one-off, and it is not only government. In April 2026 the Wall Street firm Sullivan & Cromwell admitted to a federal bankruptcy filing containing AI-generated errors, including fabricated citations and references to cases that did not exist. The firm's own internal review did not catch them. Opposing counsel did. Industry analysts now say it plainly: human oversight can no longer protect customers from these errors at enterprise scale, because a model delivers a wrong answer with the same fluency as a right one, and no team can audit a million interactions a day. This is a present and expensive enterprise problem, and it is the exact gap this work closes.

Why I Am Giving It Away

My work is better served by giving it away than by holding on to it hoping to commercialise it one day. There are people far smarter than me who can take this further than I can. My one request to anyone who builds on it: apply it responsibly. It is a powerful framework for making an AI model respond truthfully, and power of that kind deserves care.

There are two ways to put this to work. Any developer can take the open-sourced code and build on ConteX Law directly. An enterprise or RAG developer who wants a working solution now can shortcut that path by asking me directly. One distinction worth noting: ConteX Law is the framework, the specification of the four pillars, and that is what is open source, built for code generation specifically. LINGO is the engine that exposes the full power of ConteX Law: the linguistic engine that completes the four pillars across all domains and drives CON10X 4N6. LINGO is not open source.

The Door This Opens

CON10X 4N6 demonstrates that an enterprise can use a capable local, on-premises open-weights AI model to produce the same accurate response it would get from a cloud-based frontier model, when both run ConteX Law. A frontier model without ConteX Law still confabulates. You need ConteX Law either way.

A frontier deployment is expensive because you are paying the model to do the reasoning: search an open probability space, weigh options, generate an answer, often more than once before it settles. ConteX Law takes that work off the model. By the time a request reaches the model, the domain has already been specified at the input. The model is transcribing a result that was pinned down before it ever ran. That is the whole reason a 27B open-weights model matched the frontier model once ConteX Law sat in front of it.

Two things follow, and both cut cost. You run a far smaller model, far cheaper to host. And the model does far less per request, so a single modest machine serves a whole team. A multi-GPU workstation able to run Qwen 3.6 27B on-premises costs in the region of $20k today, falling quarter on quarter as inference-focused cards reach the market.

The arithmetic is straightforward. A 100-employee enterprise can stand up a complete local, on-premises deployment with ConteX Law, hardware included, for around $20k. The same 100 employees on a frontier AI subscription at $100/month each cost $120k/year. The on-premises solution pays for itself inside two months and saves on the order of $100k in the first year, with everything after that amounting to little more than the electricity bill.

Do Not Take My Word For It

The fastest way to settle any of this is not to argue with me, it is to run it. Download CON10X 4N6, point it at any document, any image, or any wildcard prompt you like, on any of the supported models, and watch what the model on its own misses or invents. The claim is testable on the spot, by anyone, for free.

The source code for CON10X Web Domain and SnapStak Mobile is on GitHub, along with SnapStak Studio, the open source VS Code extension for running and refining the generated code inside the editor. CON10X 4N6 is free on the Microsoft Store.

https://github.com/SnapStak-AI/SnapStak-Web-Domain

https://github.com/SnapStak-AI/SnapStak-Mobile

https://github.com/SnapStak-AI/SnapStak-Studio

https://apps.microsoft.com/detail/9PFDQ9Q09081

For help integrating ConteX Law, contact me, Russel, on the

Top comments (0)