DEV Community: martin

Parametric Hubris: Empirical Evidence That Tool Availability Does Not Equal Tool Usage in Frontier Language Models

martin — Wed, 15 Apr 2026 09:14:06 +0000

Parametric Hubris: Empirical Evidence That Tool Availability Does Not Equal Tool Usage in Frontier Language Models

Frontier large language models are increasingly deployed with integrated retrieval tools, yet the assumption that tool availability ensures tool usage remains empirically unexamined. We introduce the concept of parametric hubris---the architecturally conditioned tendency of large language models to suppress external tool invocation in favor of parametric recall, even when internal knowledge is stale or incomplete. Empirical measurement reveals that GPT-5 triggers web search for only 31% of user prompts despite having browsing enabled [3], while Gemini models exhibit grounding rates below 50% [6]. When these models lack knowledge, they fabricate: the AA-Omniscience benchmark reports hallucination rates of 88--93% among incorrect responses across frontier Gemini and GPT models [4, 11]. Critically, GPT-5's reported 9.6% error rate under browse-on conditions [2] represents a blended average across both searched and unsearched queries, obscuring the true error distribution. We further distinguish between capability---a model's ability to use retrieval tools when forced---and propensity---its inclination to invoke them autonomously, a distinction no existing benchmark captures. To test whether mandatory tool use can architecturally bypass the mathematical inevitability of hallucination demonstrated by Xu et al. [5], we present Veritas, a six-stage pipeline built on Gemini 2.5 Flash Lite at \$0.002 per query that enforces 100% real-time web retrieval. Evaluated on SimpleQA Verified (n=100), Veritas achieves an F-Score of 89.1% with 0% fabrication, compared to 72.1% for Gemini 3 Pro and 51.6% for GPT-5 [26], while operating 20--106x cheaper than competing systems at approximately 115 seconds per query. These results demonstrate that the hallucination problem in factual question answering is not one of model capability but of architectural discipline.

1. Introduction

The year 2025 marked the ascent of "agentic AI." Frontier language models from OpenAI, Google DeepMind, and Anthropic shipped with integrated web search, code execution, and tool-use capabilities, marketed as systems that could autonomously verify claims, retrieve live information, and reason over external data. The implicit promise to users and enterprises was clear: these models would search when they did not know, abstain when they were uncertain, and answer from memory only when confident and correct. This promise is empirically false.

A large-scale observational study by Nectiv, analyzing over 8,500 ChatGPT prompts across nine industries, found that GPT-5---despite having full access to Bing web search---triggered a search in only 31% of interactions [3]. The remaining 69% of responses were generated entirely from parametric memory, with no external verification. Google's own documentation for Gemini acknowledges the same architecture: "Note that providing Google Search as a tool to the model doesn't require the model to always use the Google Search tool to generate its response" [7]. An independent analysis by DEJAN of 10,000 Gemini prompts with search grounding enabled found that "ungrounded" responses constituted the majority class [6]. The tools exist. The models choose not to use them.

This paper introduces the concept of Parametric Hubris: the architecturally conditioned phenomenon whereby a language model, due to the scale of its training corpus, develops a statistically unjustified confidence in its own parametric knowledge, actively suppressing the invocation of external tools---including live web search---even when its internal knowledge is outdated, incomplete, or fabricated. The decision to invoke a tool is not a function of epistemic self-awareness; it is a function of training reward signals (RLHF) and inference cost optimization. Models are not lazy by accident. They are lazy by design.

The consequences are measurable. OpenAI's GPT-5 System Card reports a blended hallucination rate of 9.6% with browsing enabled, but 47% with browsing disabled [2]. Since browsing activates in only 31% of cases [3], the majority of responses fall into the higher-error regime. The AA-Omniscience benchmark, evaluating 36 models across 6,000 questions without tool access, reveals the depth of the problem: Gemini 2.5 Flash exhibits a 92.6% hallucination rate among incorrect responses, meaning that when it does not know the answer, it fabricates one in 92.6% of cases rather than declining to respond [4]. Even the best-performing model on the benchmark, Gemini 3 Pro, hallucinated in 88.0% of its incorrect answers. Xu et al. have provided formal proof that hallucination is mathematically inevitable in autoregressive language models [5]: next-token prediction over probability distributions cannot structurally distinguish between truth and confabulation. The industry response has been to invest billions in larger models, longer training, and more sophisticated reinforcement learning---all attempts to solve the problem within the model. This paper argues that the problem must be solved around it.

We present Veritas, a six-stage retrieval-and-verification pipeline that removes the model's discretion over tool use entirely. In Veritas, every query triggers mandatory real-time web scraping via two independent retrieval chains, followed by synthesis exclusively from scraped evidence and a final cross-verification pass against independently retrieved sources. The model is architecturally prohibited from answering from parametric memory at any stage. Crucially, Veritas achieves this using Gemini 2.5 Flash Lite---the cheapest model on the market at the time of evaluation---at a cost of \$0.002 per query, with a mean latency of approximately 115 seconds.

Evaluated on a 100-question random sample from the SimpleQA Verified benchmark [1], Veritas achieves an F-score of 89.1%, surpassing the next-best model (Gemini 3 Pro Preview at 72.1%) by 17.0 points. More significantly, Veritas records a 0% fabrication rate: among all incorrect responses, zero were generated without evidential basis. The six errors that did occur were misinterpretations of real scraped data---the model extracted the wrong fact from a correct source, rather than inventing a fact from no source. Nine queries resulted in honest refusals, typically because the information was too obscure for DuckDuckGo to surface or was locked behind dynamically rendered content.

This paper makes four contributions:

Empirical evidence of the propensity gap. We document, with primary-source data from Nectiv (N=8,500+), OpenAI's System Card, the DEJAN grounding study (N=10,000), and the AA-Omniscience benchmark (N=6,000), that tool availability does not predict tool usage. Frontier models with full search access invoke it in a minority of cases, and when they do not search, their hallucination rates are catastrophic (47--93%). We formalize this as the distinction between capability (the ability to search when forced) and propensity (the likelihood of searching when not forced), and show that no existing benchmark measures propensity.
A critique of current benchmark methodology. We show that SimpleQA Verified [1] rewards parametric memorization rather than retrieval competence, and that the FACTS Grounding benchmark systematically filters out the "soft tail" queries---precisely the cases where models feel confident enough not to search but are wrong---thereby inflating reported reliability figures. The benchmarks measure what models can do in controlled conditions, not what they will do in production.
Veritas as proof-of-concept for mandatory retrieval architectures. We demonstrate that a six-stage pipeline built on the cheapest available model can outperform frontier systems costing 20--106x more per query. This is not a marginal improvement: Veritas achieves higher accuracy than GPT-5 (F-score 51.6%), o3 (52.3%), and Claude Opus 4.5 (39.0%) while using a model that, without the pipeline, scores 24.9% accuracy with a 92.6% hallucination rate on AA-Omniscience [4]. The architecture, not the model, is the variable that matters.
A cost and latency analysis showing the solution is both faster and cheaper. Against the common objection that retrieval-augmented verification must be prohibitively slow and expensive, we show that Veritas at ~115 seconds per query is faster than ChatGPT Deep Research (5--30 minutes), Gemini Deep Research (5--15 minutes), o3-pro (3--15 minutes), and comparable to Perplexity Deep Research (2--4 minutes)---systems backed by billion-dollar infrastructure. At \$0.20 per 1,000 queries (token costs only, no search surcharges), Veritas is 20x cheaper than GPT-5 with Bing, 95x cheaper than Gemini 3 Pro with grounding, and 106x cheaper than Claude Opus 4.5 with web search.

The remainder of this paper is organized as follows. Section 2 reviews related work on hallucination measurement, retrieval-augmented generation, and the emerging literature on tool-use propensity. Section 3 presents the empirical data on tool-use rates across frontier models, formalizing the propensity gap. Section 4 introduces the concept of Parametric Hubris and analyzes the mechanisms driving tool-use suppression. Section 5 critiques existing benchmark methodology and proposes the capability--propensity framework as an alternative evaluation paradigm. Section 6 describes the Veritas architecture in detail, covering its six-stage pipeline (C1--C6), scraping infrastructure, and mandatory retrieval constraints. Section 7 details our evaluation methodology on the SimpleQA Verified benchmark, including scoring criteria, validation procedures, results, and comparison against the 47-model Kaggle leaderboard. Section 8 analyzes cost and latency trade-offs. Section 9 discusses limitations, threats to validity, and directions for future work, followed by Section 10 which concludes.

2. Related Work

Evaluating the factual reliability of large language models has become a central concern as these systems are deployed in high-stakes domains. Several benchmarks have been developed to measure hallucination rates, factual grounding, and parametric knowledge retention. We review the principal evaluation frameworks, identify their methodological limitations, and locate the gap that motivates this work.

2.1 SimpleQA Verified

The SimpleQA Verified benchmark [1] provides 1,000 fact-seeking questions with ground-truth answers verified against authoritative sources. As of early 2026, the public Kaggle leaderboard lists 47 evaluated models [26]. The top-performing system, Gemini 3 Pro, achieves an F-Score of 72.1%, followed by GPT-5 at 51.6% and Claude Opus 4.5 at 39.0%. These scores represent the state of the art for parametric factual recall---models answering from internal weights without external retrieval.

However, SimpleQA Verified fundamentally rewards memorization. A model that has encountered a given fact during pre-training and retains it with sufficient precision will score well regardless of whether it can reliably determine when its knowledge is outdated or incomplete. High scores thus indicate successful overfitting to the temporal window of the training corpus rather than genuine factual reliability. For systems that operate via real-time retrieval rather than parametric recall, the benchmark's design creates a structural mismatch: it measures the very capability---rote memorization---that retrieval-augmented architectures deliberately bypass. Haas et al. [1] acknowledge that tool-augmented systems could trivially outperform parametric-only models on their benchmark, yet this claim has not been empirically validated in prior work, and as we demonstrate in Section 5, the reality is more nuanced than the assertion implies.

2.2 FACTS Grounding

The FACTS Grounding benchmark [8], developed by DeepMind, evaluates whether models can generate responses that are faithful to a provided document context. By supplying reference documents alongside queries, FACTS isolates the grounding capability---a model's ability to constrain its output to information present in given sources. Gemini 2.5 Pro leads the benchmark at 87.8%.

While methodologically rigorous for its stated purpose, FACTS suffers from a critical selection bias introduced by its hardness filter. During benchmark construction, questions that models can already answer from parametric memory are systematically removed, leaving only the "hard tail" of queries where retrieval is effectively forced [8]. This filtering eliminates precisely the category of questions where parametric hubris manifests most acutely: queries for which the model possesses partial, outdated, or subtly incorrect knowledge and therefore feels confident enough to skip retrieval. In real-world deployment, the Nectiv study [3] demonstrates that approximately 69% of prompts fall into this "soft tail" category---questions where models elect not to search. FACTS, by design, excludes these cases from evaluation.

The benchmark therefore measures capability---whether a model can use retrieved context when compelled---but not propensity---whether it will autonomously invoke retrieval when it should. This distinction, as we argue throughout this paper, is the central failure mode of tool-augmented language models.

2.3 AA-Omniscience

The AA-Omniscience benchmark [4, 11], maintained by Artificial Analysis, provides a complementary perspective by evaluating 36 models across 6,000 questions spanning 42 topics and 6 knowledge domains. Its defining contribution is a hallucination metric defined as overconfidence: the share of incorrect responses in which the model provides a confident answer rather than declining to respond. This formulation directly quantifies the failure mode we term parametric hubris.

The results are striking. Only 3 of 36 evaluated models achieve an Omniscience Index above zero (a composite score penalizing hallucination alongside rewarding accuracy). Gemini 3 Pro leads in raw accuracy at 54% but exhibits an 88% hallucination rate---meaning that of every 100 questions it answers incorrectly, 88 receive a confident (and wrong) response rather than an appropriate refusal. GPT-5.1 follows at 39% accuracy with 81% hallucination. Among smaller models, Gemini 2.5 Flash demonstrates the most extreme overconfidence: 24.9% accuracy coupled with a 92.6% hallucination rate, implying that nearly all of its errors are delivered with unwarranted certainty.

A critical finding from AA-Omniscience is that accuracy and hallucination rate are not inversely correlated. Larger, more capable models hallucinate at rates comparable to or higher than smaller models, suggesting that increased parametric knowledge amplifies confidence without proportionally improving calibration. This undermines the prevailing assumption that scaling alone will resolve the hallucination problem.

2.4 Vectara Hallucination Leaderboard

The Vectara Hallucination Leaderboard [9] tracks hallucination rates in a controlled document summarization task. Current leaders include Gemini 2.0 Flash at 0.7% and GPT-4o at 1.5%, reflecting a dramatic reduction from approximately 21.8% in 2021 to sub-1% rates by 2025.

While these results demonstrate that grounded hallucination---fabrication of content not present in a provided source document---has been largely solved for summarization, the benchmark's relevance to open-domain factual question answering is limited. Summarization is a constrained generation task where the answer space is bounded by the input document. Open-domain QA, by contrast, requires the model to determine whether it possesses relevant knowledge, whether to seek external information, and how to reconcile potentially conflicting sources. The near-zero rates on Vectara should not be mistaken for evidence that hallucination has been solved in the general case.

2.5 Theoretical Inevitability of Hallucination

Xu et al. [5] provide a formal proof that hallucination is mathematically inevitable in large language models. Their analysis demonstrates that any system generating text via next-token prediction over learned probability distributions cannot structurally distinguish between factually grounded and fabricated outputs. This result establishes that no amount of scaling, training refinement, or reinforcement learning from human feedback can eliminate hallucination from within the parametric framework itself. The implication is that solutions must be architectural rather than parametric---an insight that motivates the mandatory retrieval approach we evaluate in this work.

2.6 The Missing Dimension: Propensity to Use Tools

Across the benchmarks reviewed above, a consistent gap emerges. SimpleQA Verified [1] measures parametric recall. FACTS Grounding [8] measures retrieval capability under forced conditions. AA-Omniscience [4, 11] measures overconfidence when tools are absent. Vectara [9] measures faithfulness to provided documents. Xu et al. [5] establishes theoretical bounds on parametric systems.

None of these benchmarks measures the propensity of a tool-augmented model to invoke its available tools during autonomous operation. Yet empirical evidence from deployment studies---31% search trigger rate for GPT-5 [3], sub-50% grounding rate for Gemini [6]---indicates that propensity, not capability, is the binding constraint on real-world factual reliability. A model that can use retrieval tools with 83.8% accuracy when forced but chooses to invoke them for only one-third of queries delivers effective reliability far below its measured capability. This gap between laboratory capability and deployed propensity is the central phenomenon we investigate, and the architectural enforcement of tool usage is the intervention we evaluate.

3.1 GPT-5 Search Behavior: Measured, Not Assumed

The most comprehensive empirical measurement of autonomous search behavior in a frontier model comes from the Nectiv study, published via SearchEngineLand in October 2025 [3]. Analyzing over 8,500 ChatGPT prompts spanning nine industry verticals, the researchers found that only 31% of prompts triggered a web search---despite browsing being enabled for all sessions. The remaining 69% were answered entirely from parametric memory. When search was triggered, models issued an average of 2.17 queries per prompt, with a maximum fan-out depth of four.

The variance across industries is instructive. Prompts with explicit local intent (e.g., "restaurants near me") triggered search in 59% of cases, while fashion queries triggered search only 19% of the time, and credit card queries 18%. This distribution reveals a critical architectural detail: the model's search decision is not driven by semantic understanding of its own knowledge boundaries but by n-gram heuristics. Nectiv found that the year "2025" was among the most frequent n-grams that ChatGPT appended internally to its search queries [3]. The model does not know that it lacks current interest rate data; it detects that users asking about interest rates often co-occur with temporal markers in its training distribution. Without an explicit temporal signal ("current," a year, "latest"), the model defaults to parametric recall---silently, without disclaimer, and without any indication to the user that the response may be stale.

This is not intelligent tool use. It is pattern matching on surface features, with the failure mode being invisible to the end user.

3.2 GPT-5 Hallucination Rates: Decomposing the Blend

OpenAI's GPT-5 System Card [2] reports hallucination rates---defined as the percentage of factual claims containing minor or major errors---under two conditions. With browsing enabled, GPT-5-main produces erroneous claims at a rate of 9.6%, while GPT-5-thinking achieves 4.5%. With browsing disabled, the rate climbs to 47%.

The 9.6% figure demands careful decomposition. It is not the error rate of a search-augmented model; it is a blended average across two populations: the 31% of prompts where web search was actually triggered, and the 69% where the model answered from memory alone. If we accept the Nectiv trigger rate as approximately representative and the browse-off rate of 47% as an upper bound for unsearched queries under browse-on conditions (noting that browse-on models may exhibit lower parametric error rates due to training-time improvements), a rough back-calculation yields an estimated memory-only error rate of approximately 12% under browse-on conditions and a search-augmented error rate of approximately 4--5%. The latter figure aligns closely with GPT-5-thinking's 4.5%, which is expected given that thinking-mode models search more aggressively.

This decomposition has significant implications. The headline 9.6% figure---the one cited in marketing materials and model comparison tables---obscures a bimodal distribution. For the majority of queries (69%), the model operates at an error rate several times higher than the reported average. For the minority where search is triggered, the error rate is substantially lower. No published evaluation from OpenAI separates these populations, and the blended metric systematically understates the risk to users whose queries fall in the unsearched majority.

The December 2025 GPT-5.2 System Card Update [10] reports a further reduction to below 1% hallucination rate for GPT-5.2-thinking with browsing enabled, evaluated across five domains. However, the update discloses no data on search trigger rates. Without knowing what fraction of queries actually invoked search, the sub-1% figure remains uninterpretable: it could reflect a model that searches 90% of the time, or one that searches 35% of the time but happens to be more accurate on the selected evaluation domains. The metric is meaningless without the denominator.

3.3 Gemini Search Grounding: Optional by Design

Google's Gemini models implement search grounding through two distinct architectures. Gemini 3 and later treat Google Search as a callable tool---the model autonomously decides whether and how often to invoke it for each prompt. Earlier versions (Gemini 2.5 and prior) use a Prediction Score system: each prompt receives a confidence score between 0.0 and 1.0, and search is triggered only when the score exceeds a configurable threshold (default 0.3 for the Gemini API, 0.7 for Vertex AI) [7, 17].

Google's own documentation is explicit about the optional nature of this architecture. The Firebase AI Logic documentation states: "Note that providing Google Search as a tool to the model doesn't require the model to always use the Google Search tool to generate its response" [7]. This is not a limitation---it is a design choice.

Unlike GPT-5, no public measurement of Gemini's natural search trigger rate exists. The closest approximation comes from the DEJAN Grounding Classifier study [6], which analyzed 10,000 prompts processed through Gemini 2.5 Pro with search grounding enabled. Each response was labeled as grounded (search-augmented) or ungrounded (parametric-only). The study reports a "class imbalance between grounded and ungrounded responses" that required synthetic data generation to balance the minority class for classifier training. The critical detail: the ungrounded class was the majority class. This places Gemini's natural grounding rate below 50%, though the exact percentage was not published.

Two additional studies---by Nectiv [13] and Seer Interactive [14]---analyzed Gemini's search fan-out behavior, reporting averages of 9.06 and 10.7 queries per prompt respectively. However, both studies forced grounding on all prompts ("forced grounding was applied on all prompts"), making their results inapplicable to natural usage behavior. They measure what happens when the model is compelled to search, not how often it chooses to do so.

The Gemini 3 Pro and Gemini 3 Flash Model Cards [17, 18] contain no data on search trigger rates or grounding statistics. The Gemini 2.5 Technical Report [19] reports a SimpleQA score of 54.0% and a FACTS Grounding Score of 87.8%, but again, no trigger-rate data. Google has published no metric equivalent to the Nectiv measurement for GPT-5. The absence is not incidental; it is strategic.

3.4 The Economic Incentive: Hallucination as Cost Optimization

The reluctance of frontier models to invoke search tools is not a bug---it is an economic feature. Each search operation incurs three costs: additional compute for processing retrieved documents, increased latency that degrades user experience, and direct API charges.

Google's pricing structure makes this incentive explicit. Gemini 3 charges \$14 per 1,000 search grounding queries; Gemini 2.x charges \$35 per 1,000 queries [28]. For a model serving millions of requests daily, a high trigger rate would multiply inference costs by an order of magnitude. A Gemini 3 Flash model offered at \$0.50 per million tokens cannot economically afford to read three web pages (thousands of additional input tokens) for every factual question. The model's parametric confidence---even when misplaced---is the cheaper path.

OpenAI absorbs search costs into token pricing, but the incentive structure is identical: every Bing query adds latency and compute that erodes margin. The RLHF training signal reinforces this dynamic. Human evaluators consistently prefer fluent, confident responses over hedged refusals or search-delayed answers. Over millions of training iterations, models learn that confabulation is rewarded more often than epistemic humility. The AA-Omniscience benchmark quantifies the result: Gemini 3 Flash exhibits a 91% hallucination rate among incorrect responses---meaning it almost never refuses to answer, even when wrong [4]. It has been optimized, through training and economic pressure alike, to always produce an answer.

This creates a structural misalignment. The model provider's cost function (minimize compute per query) and the user's utility function (maximize factual accuracy) are in direct tension. Hallucinations are not a failure of capability; they are an accepted collateral cost of inference-time optimization. The model has been trained to prefer the "fast path"---parametric recall at near-zero marginal cost---over the "slow path"---search, retrieval, synthesis, and verification at substantially higher cost. Every unsearched query that happens to be correct validates the fast path. Every unsearched query that produces a hallucination is invisible in aggregate metrics.

The propensity gap is, at its root, an economic gap. Models do not search because searching is expensive, and not searching is usually good enough---where "usually" means the 50--60% of the time that parametric recall happens to be correct, and where "good enough" is defined by the provider's cost model, not the user's accuracy requirements.

4. Parametric Hubris: Why Models Actively Suppress Search

The central phenomenon examined in this paper is not one of missing capability but of misaligned incentive. Frontier language models possess the architectural capacity to invoke web search, retrieve grounding data, and synthesize factually current responses. Yet empirical measurement demonstrates that they systematically decline to exercise this capacity. We term this behavior parametric hubris: the architecturally conditioned overconfidence that arises as a function of training dataset scale, causing models to suppress tool access even when internal knowledge is outdated, incomplete, or fabricated. Parametric hubris is not an intelligence failure. It is a predictable consequence of RLHF reward functions, inference cost optimization, and the absence of reliable self-knowledge within transformer architectures.

4.1 Definition and Mechanism

A language model trained on trillions of tokens develops broad distributional coverage across an enormous range of factual domains. This coverage produces high activation confidence for a wide variety of queries---the model's internal probability distribution assigns non-trivial mass to plausible-sounding completions for nearly any factual question. Crucially, confidence in this context is not calibrated against ground truth. It is a statistical artifact of exposure frequency during training. A model that has encountered thousands of passages about interest rates, central bank policy, and financial markets will generate fluent, structurally coherent text about current deposit rates---even if its training data predates the most recent rate change by twelve to eighteen months.

The decision to invoke an external tool (web search, API call, database lookup) competes against this parametric confidence at inference time. When the model's internal activation pattern produces a high-confidence completion, the marginal expected utility of an external call---which introduces latency, computational cost, and the risk of contradictory information---is judged insufficient. The model defaults to parametric recall. This default is not a deliberate reasoning process. It is a learned heuristic, shaped by millions of training iterations in which plausible completions were rewarded and tool invocations were never explicitly incentivized.

4.2 RLHF Conditioning: The Reward Structure of Overconfidence

The origins of parametric hubris lie in the reinforcement learning from human feedback (RLHF) stage of model training. Human evaluators consistently prefer responses that are fluent, confident, and directly helpful over responses that hedge, refuse, or defer to external sources [3, 4]. This preference is rational from the evaluator's perspective---a plausible, well-structured answer appears more competent than a refusal---but it produces a systematic distortion in the model's learned policy.

Over millions of RLHF cycles, the model learns that generating a plausible answer, even when uncertain, yields higher reward than admitting ignorance. The reward signal does not distinguish between correct confidence and confident fabrication. Both produce fluent, assertive text; only the former happens to be true. The model has no training signal that penalizes confident incorrectness as such---it is penalized only when evaluators detect the error, which they often cannot.

The empirical evidence for this mechanism is stark. The AA-Omniscience benchmark [4, 11] measures hallucination rate as the share of false responses among all incorrect attempts---effectively, the overconfidence rate when the model is wrong. Gemini 3 Flash (Non-Reasoning) exhibits a hallucination rate of 90.9%, meaning that when it answers incorrectly, it provides a confident fabrication rather than a refusal in over nine out of ten cases. Its refusal rate is near zero. At the opposite end of the spectrum, Claude 4.5 Haiku shows a hallucination rate of only 26%, with a refusal rate of approximately 62%. This model, when uncertain, declines to answer---a behavior that RLHF typically suppresses.

The critical insight is that accuracy correlates with model size, but hallucination rate does not [4, 11]. Gemini 3 Pro achieves 54% accuracy---the highest on the AA-Omniscience benchmark---yet still exhibits an 88% hallucination rate. It knows more, but it does not know what it does not know. Larger models accumulate more correct parametric knowledge without developing correspondingly better metacognitive calibration. The additional parameters increase coverage but not epistemic humility.

4.3 N-Gram Dependency: Heuristic Triggers Instead of Semantic Self-Awareness

The Nectiv study [3] reveals a further dimension of parametric hubris: models do not decide to search based on genuine semantic awareness of their own ignorance. Instead, they respond to heuristic surface patterns---n-gram cues that statistically correlate with search-worthy queries in the training distribution.

Analysis of ChatGPT's internal search behavior showed that the year "2025" was among the most frequent n-grams appended to internally generated search queries. The model does not possess a mechanism to evaluate whether its parametric knowledge of a specific fact is current. It possesses only a learned association between certain lexical patterns (temporal markers, recency indicators, specific keywords) and the action of triggering a search. When a user asks "What are the best fixed-deposit rates?" without an explicit temporal cue such as "current" or a year, the model defaults to its training distribution---producing an answer based on data that may be twelve to eighteen months stale. No warning is issued. No confidence interval is provided. The user receives obsolete information presented with the same assertive fluency as a verified fact.

This n-gram dependency means that tool invocation is not governed by genuine uncertainty estimation but by pattern matching against surface features of the input. Questions that happen to contain temporal markers trigger search; questions about equally dynamic topics that lack such markers do not. The model's "decision" to search is as shallow as its confidence is deep.

4.4 Context Contamination: When Retrieval Fails to Override Training

Parametric hubris does not cease when a model does invoke its retrieval tools. Even after search results are retrieved and injected into the context window, the model faces a conflict between two competing sources of information: strong parametric priors accumulated over trillions of training tokens, and weak contextual signals from a few hundred tokens of retrieved text [6].

When the retrieved information aligns with parametric knowledge, processing is straightforward. When it contradicts or updates what the model "believes," the parametric prior exerts a gravitational pull on the generated output. The model may selectively attend to portions of the retrieved text that confirm its training distribution while downweighting or ignoring contradictory evidence. The result is subtle factual distortion: responses that appear grounded (they may even cite the retrieved source) but that have been silently contaminated by stale parametric knowledge.

This phenomenon is particularly insidious because it is invisible to the end user. The response reads as though it was derived from current data. The citation may be present. But the specific claim---a number, a date, a name---reflects the training distribution rather than the retrieved document. Context contamination transforms retrieval-augmented generation from a reliability mechanism into a false assurance mechanism: the user believes the answer is grounded because a search was performed, while the answer actually reflects parametric recall filtered through the appearance of grounding.

4.5 The Overconfidence Paradox

The data from AA-Omniscience [4, 11] reveal a counterintuitive relationship between model capability and model reliability. Among frontier models evaluated without tool access, those with the highest accuracy also exhibit the highest hallucination rates. Gemini 3 Pro achieves 54% accuracy with 88% hallucination. On 100 questions, it answers approximately 54 correctly, fabricates approximately 40, and refuses approximately 6. GPT-5.1 achieves 39% accuracy with 81% hallucination. Claude Opus 4.6 achieves 44.3% accuracy with 60% hallucination.

The pattern is consistent: models that know more are also more willing to fabricate when they do not know. This is the overconfidence paradox at the heart of parametric hubris. Greater parametric coverage produces greater confidence across the board---both for domains where that confidence is warranted and for domains where it is not. The model cannot distinguish between the two because its confidence is a function of distributional exposure, not of ground-truth verification.

The only models that escape this paradox are those with aggressive refusal policies. Claude 4.5 Haiku, with 16% accuracy and 26% hallucination, refuses to answer 62% of questions. Its low accuracy reflects limited parametric coverage; its low hallucination rate reflects a training regime that penalizes confident fabrication more heavily than competing models. Among the 36 models evaluated on the AA-Omniscience benchmark, only three achieve a combined Omniscience Index above zero---demonstrating that for the vast majority of frontier models, the penalty incurred by hallucinated responses outweighs the benefit of correct ones [4].

Parametric hubris is therefore not a failure of individual models but a structural property of the current training paradigm. It is the inevitable outcome of optimizing for fluent helpfulness without a corresponding mechanism for epistemic calibration. The models are not lying---they are doing exactly what they were trained to do. The training objective simply does not include knowing when to stop.

5. The Benchmark Trap: How Current Evaluations Systematically Conceal the Propensity Problem

The preceding sections have established that frontier models suppress tool invocation at scale and that the resulting hallucination rates are severe. A natural question follows: why have existing evaluation frameworks failed to detect this failure mode? The answer lies in a systematic methodological blind spot. The two most prominent factuality benchmarks---SimpleQA Verified [1] and FACTS Grounding [8]---each encode assumptions that render the propensity problem invisible by construction.

5.1 SimpleQA Verified: A Test of Memory, Not Retrieval

SimpleQA Verified, introduced by Haas et al. [1] and hosted as a public leaderboard on Kaggle [26], evaluates models on short factual questions with unambiguous, verifiable answers. Its design explicitly measures parametric knowledge: how much a model has memorized from its training corpus and can recall under direct questioning. Gemini 3 Pro scores 72.1%, GPT-5 scores 51.6%, and Gemini 2.5 Pro scores 54.0% on this benchmark [26, 19].

The metric rewards memorization. A model that has absorbed more factual content during pre-training---or that has been fine-tuned on distributions overlapping with the evaluation set---achieves higher scores. This creates a perverse incentive structure: the more a model "knows" (or believes it knows), the less likely it is to invoke external retrieval tools when deployed in production, since its internal confidence threshold is met more frequently. A high SimpleQA score is therefore not an indicator of reliability in a changing world; it is an indicator of overfitting to the training period. The model's parametric confidence is validated in the lab precisely because the benchmark questions fall within its memorization window, but that same confidence becomes a liability when deployed against questions whose answers have shifted since the training cutoff.

For architecture-enforced systems such as Veritas, this distinction is critical. The ideal Veritas system would achieve 0% from parametric recall and 100% from retrieved evidence---a tabula rasa that treats every query as novel. SimpleQA Verified measures exactly the wrong dimension for such systems: it rewards the very parametric confidence that, in deployment, produces the suppression of tool use documented in Section 3.

5.2 FACTS Grounding and the Hardness Filter

The FACTS Grounding benchmark, developed by DeepMind [8], attempts to evaluate factuality more rigorously by assessing whether model outputs are grounded in provided or retrieved evidence. However, its methodology introduces a structural bias that we term the hardness filter.

To construct the evaluation set, the researchers remove all questions that the model can already answer correctly from parametric memory. Only the residual "hard tail" remains---questions where the model is factually compelled to search because its internal knowledge is clearly insufficient. Within this filtered set, models perform well: Gemini 3 Pro achieves 83.8% on FACTS-Search [8]. The benchmark thus demonstrates that when a model is forced into a retrieval scenario, it can execute that retrieval competently.

The problem is that real-world queries are not drawn from this hard tail. The Nectiv study (N=8,500+) reveals that 69% of production prompts are answered from memory [3]. These are predominantly "soft tail" questions---queries where the model possesses a partial, vague, or outdated internal representation. The model's confidence is high enough to suppress tool invocation, but its knowledge is insufficiently precise or current to produce a correct answer. It is precisely in this soft-tail region---where the model could search but chooses not to---that the most consequential errors occur. The hardness filter removes exactly these cases from the evaluation by design, leaving behind only the scenarios where the failure mode cannot manifest.

5.3 Capability Versus Propensity: The Central Distinction

The divergence between FACTS scores and real-world performance exposes the most consequential gap in current evaluation methodology: the conflation of capability with propensity.

FACTS measures capability---the model's ability to conduct retrieval and produce grounded answers when it does search. It does not measure propensity---the model's willingness to initiate that search in the first place. When Gemini 3 Pro achieves 83.8% on FACTS-Search, that figure applies exclusively to the subset of queries where retrieval was triggered. It says nothing about how frequently retrieval is triggered under naturalistic conditions.

The effective reliability of a retrieval-augmented model in deployment is not its FACTS score but the product of its propensity and its capability. If a model searches in approximately one-third of cases (as measured for GPT-5 [3]) and achieves 83.8% accuracy when it does search, the compound reliability for the searched fraction is roughly 28% of all queries answered correctly via retrieval. The remaining two-thirds are answered from parametric memory, where---as the AA-Omniscience benchmark demonstrates---hallucination rates range from 78% to 93% among incorrect responses [4, 11]. The blended effective accuracy is therefore far below the laboratory figure.

For Veritas, propensity is architecturally irrelevant. The pipeline enforces retrieval at a 100% rate; there is no decision point at which the model can elect to rely on parametric recall. For frontier models deployed with optional search---GPT-5 with Bing, Gemini with Google Search---propensity is the decisive weakness. And no existing benchmark measures it.

5.4 Implications for Evaluation Design

The analysis above reveals four dimensions that future factuality benchmarks must incorporate to produce ecologically valid assessments:

Search invocation rate. Benchmarks must report not only accuracy conditional on retrieval but the unconditional rate at which models trigger retrieval across representative query distributions. Without this metric, capability scores are uninterpretable as deployment reliability estimates.

Calibration curves. The relationship between a model's expressed or implicit confidence and its actual accuracy must be measured across the full query distribution, including the soft tail where confidence is moderate but correctness is low. Current benchmarks that filter for difficulty extremes (either trivially easy or impossibly hard) miss the calibration failures that drive real-world hallucination.

Temporal degradation. Factuality evaluations must include questions whose correct answers have changed since the model's training cutoff. A model that scores highly on static knowledge but fails on post-cutoff facts provides an inflated picture of deployment reliability. Time-stratified evaluation---measuring accuracy as a function of the gap between training cutoff and the date the ground truth was last updated---would expose the parametric decay that current benchmarks conceal.

Contamination rate. When benchmark questions overlap with training data, high scores reflect memorization rather than reasoning or retrieval competence. Evaluation frameworks must quantify and report this overlap to distinguish genuine factual capability from distributional leakage.

Until benchmarks measure these dimensions, the gap between laboratory performance and deployment reliability will remain hidden---and the parametric hubris of frontier models will continue to be validated rather than exposed.

6. Veritas Architecture

6.1 Theoretical Foundation: Bypassing Mathematical Impossibility

Xu et al. (2024) provide a formal proof that hallucination is an inevitable property of autoregressive language models [5]. The argument is structural, not empirical: next-token prediction over learned probability distributions cannot distinguish between sequences that correspond to true propositions and sequences that merely resemble true propositions. No amount of scale, no refinement of training data, and no reinforcement learning objective can eliminate a failure mode that arises from the generative mechanism itself. The proof is not a claim about current limitations; it is a claim about mathematical boundaries.

The industry response has been to treat hallucination as an engineering problem amenable to brute-force solutions. OpenAI invested in larger parameter counts, longer training horizons, and multi-stage RLHF pipelines. Google DeepMind pursued reasoning-augmented decoding and chain-of-thought verification. Anthropic developed Constitutional AI and iterative red-teaming. These efforts have produced measurable improvements---GPT-5's browse-off hallucination rate of 47% is lower than its predecessors [2], and Gemini 3 Pro achieves the highest accuracy on AA-Omniscience at 54% [4]---but they have not and, per Xu et al., cannot eliminate the underlying failure mode. The trajectory is asymptotic: billions of dollars yield incremental reductions in a rate that is bounded away from zero by construction.

Veritas begins from the opposite premise. Rather than asking how do we make the model more truthful, it asks: how do we make the model's truth irrelevant? If hallucination is mathematically inevitable in parametric generation, then the solution is not to improve the generation but to remove the conditions under which hallucination occurs. A model hallucinates when it answers from memory. Therefore, never let it answer from memory. The problem is not solved; it is bypassed---and this bypass respects rather than contests the mathematical proof.

This reframing has a concrete architectural consequence. In conventional retrieval-augmented generation (RAG), the model retains discretion over when and whether to invoke retrieval. It may consult external sources, but it may equally decide---based on RLHF-conditioned confidence signals---that its parametric knowledge is sufficient. As documented in Sections 3 and 4, this discretion is exercised in the majority of cases: GPT-5 declines to search 69% of the time [3], and Gemini's grounding rate falls below 50% [6]. Veritas eliminates this discretion entirely. The model is never asked what it knows. It is asked only what the data says.

6.2 The C1--C6 Pipeline

Veritas implements a six-stage pipeline in which every stage is constrained to operate on externally retrieved evidence. No stage permits the model to draw on parametric knowledge.

C1: Primary Retrieval. The user's query is transformed into search terms and submitted to DuckDuckGo. The resulting URLs are scraped via Camoufox, a headless browser built on Firefox that achieves a 0% detection rate across standard bot-detection frameworks. Camoufox operates without fingerprint leakage, executes JavaScript, and renders dynamically loaded content---capabilities that distinguish it from simple HTTP fetchers that fail on modern web pages. C1 produces a corpus of raw web content: the evidentiary basis for all subsequent stages.

C2: Secondary Retrieval. A second round of scraping is performed to increase source diversity and depth. Additional queries are generated from the C1 results to follow leads, resolve ambiguities, and broaden coverage. The purpose of C2 is to ensure that the synthesis stage operates on a sufficiently rich evidence base rather than a single source.

C3: Evidence Synthesis. The model receives the combined C1 and C2 scrape data and is instructed to synthesize a coherent factual summary. Critically, the system prompt for C3 explicitly prohibits the use of parametric knowledge. The model must cite from the provided scrape data or state that the data is insufficient. C3 produces a structured evidence summary, not an answer.

C4: Answer Generation. The answer is generated exclusively from the C3 synthesis. The model does not see the original query in isolation; it sees the query together with the evidence summary and is constrained to answer only from what C3 provides. This two-stage indirection---scrape, then synthesize, then answer from the synthesis---creates an architectural barrier against parametric injection. The model cannot "sneak in" a remembered fact because its input context contains only retrieved evidence.

C5: Independent Verification Retrieval. A new set of search queries is generated, distinct from those used in C1 and C2, and submitted through a fresh scraping pass. C5 produces an independent evidence corpus that shares no query lineage with the primary retrieval chain. This is the verification evidence base.

C6: Cross-Verification. The C4 answer is evaluated against the C5 evidence. If the independent sources corroborate the answer, it is confirmed. If they contradict it, the discrepancy is flagged and the verification evidence is used to correct or qualify the response. If neither the primary nor the verification chain yields sufficient data, the system issues a refusal.

The critical architectural property is that at no point in the C1--C6 pipeline does the model answer from parametric knowledge. Each stage either retrieves external data or operates exclusively on previously retrieved data. The chain of evidence is fully traceable: every claim in the final output can be linked to a specific scraped source. Fabrication---the generation of confident assertions without any evidentiary basis---is not merely discouraged; it is architecturally impossible. The model would need to hallucinate data into its own input context, which the pipeline structure prevents.

6.3 "No Data, No Answer": Refusal as a Feature

The design philosophy of Veritas inverts the incentive structure of RLHF-conditioned models. Contemporary frontier models are trained to be helpful, which in practice means they are trained to produce responses rather than refusals. The AA-Omniscience benchmark quantifies this: Gemini 3 Flash exhibits a hallucination rate of 90.9% among incorrect responses [4], meaning that when it lacks knowledge, it fabricates an answer in over nine out of ten cases rather than declining. Gemini 2.5 Flash reaches 92.6%. The refusal mechanism has been conditioned out of these models.

Veritas operates on the opposite principle: if the scraping pipeline returns insufficient data, the system refuses to answer. There is no fallback to parametric memory, no "best guess" mode, no attempt to construct a plausible response from training data. A refusal is not a failure; it is the system functioning correctly under data scarcity.

In our evaluation (n=100), Veritas produced 9 refusals---a 9.0% refusal rate. Each was justified by identifiable retrieval limitations: queries too obscure for DuckDuckGo to surface relevant results (e.g., Lukacs's early career in Nepszava), information locked behind dynamically rendered sports scorecards, data available only in physical books, or search terms too specialized to yield matches. Crucially, in all 9 cases, the system acknowledged its inability rather than fabricating. Among the 91 attempted answers, 85 were correct, yielding an accuracy-given-attempted of 93.4%. The 6 errors were misinterpretations of real scraped data---the wrong fact extracted from a correct source---not inventions from no source.

The contrast with frontier models is stark. Gemini 3 Pro, the highest-accuracy model on AA-Omniscience, fabricates in 88% of its incorrect responses. Veritas fabricates in 0%. When Veritas does not know, it says so. When Gemini does not know, it invents.

6.4 Model Choice: Architecture Over Scale

Veritas uses Gemini 2.5 Flash Lite, accessed via OpenRouter, as its language model backend. At the time of evaluation, this was the cheapest model available on the market: \$0.10 per million input tokens and \$0.40 per million output tokens, yielding an effective cost of approximately \$0.002 per query across the full six-stage pipeline.

This choice is deliberate. Gemini 2.5 Flash Lite, when evaluated without retrieval augmentation on the AA-Omniscience benchmark, scores 24.9% accuracy with a 92.6% hallucination rate [4]---the worst hallucination rate of any model on the leaderboard. It is, by conventional metrics, among the least reliable models available. Yet with the Veritas pipeline, this same model achieves an F-score of 89.1% on SimpleQA Verified, surpassing Gemini 3 Pro (72.1%), GPT-5 (51.6%), and o3 (52.3%)---models that cost 20--58 times more per query.

The implication is direct: accuracy in factual question answering is a function of architecture, not of model scale. The cheapest model on the market, wrapped in a mandatory retrieval pipeline, produces the highest F-score on the benchmark. The most expensive models, left to their own discretion about when to search, produce lower scores despite vastly superior parametric capabilities. This finding does not diminish the value of larger models for tasks requiring reasoning, creativity, or nuanced judgment. It demonstrates, specifically and measurably, that for the task of factual question answering, the architectural decision to enforce retrieval dominates the model selection decision by an order of magnitude.

7. Evaluation

7.1 Benchmark Selection and Experimental Setup

We evaluate Veritas on SimpleQA Verified, a factual accuracy benchmark curated by Google DeepMind and hosted on Kaggle, comprising 1,000 short-answer factual questions drawn from diverse knowledge domains including history, geography, science, politics, and culture. The benchmark is designed to test a model's ability to produce precise, verifiable factual claims rather than open-ended reasoning, making it an ideal testbed for systems whose primary contribution is grounding generation in retrieved evidence.

From the full dataset we draw a uniform random sample of 100 questions. The language model serving as the generative backbone is Gemini 2.5 Flash Lite, accessed via the OpenRouter API. Crucially, this is a lightweight, cost-optimized model---not a frontier reasoning system. It ranks outside the top 15 on the SimpleQA Verified leaderboard when evaluated in its default parametric configuration. Any performance gains observed therefore cannot be attributed to the model's intrinsic factual recall but must originate from the retrieval-and-verification pipeline that surrounds it.

Scoring protocol. A response is marked Correct if the ground-truth answer appears anywhere in either the C4 (primary answer) or C6 (verification) output. It is marked Wrong if the system committed to an incorrect answer while possessing real scraped evidence. It is marked Hallucination if the system fabricated an answer without any underlying scrape data. It is marked Refusal if the system honestly declined to answer. The F-Score is computed as the harmonic mean of accuracy and accuracy-given-attempted: F = 2 * (Accuracy * Acc|Attempted) / (Accuracy + Acc|Attempted).

Validation methodology. Every response underwent a three-stage verification process. First, eight parallel Claude Sonnet agents each independently validated a batch of ten result files, examining both C4 and C6 outputs. Second, all 100 results were reviewed manually by a human evaluator. Third, every initially flagged error was cross-checked against independent web sources to confirm the ground-truth label. This multi-layered protocol minimizes both false positives and false negatives in the scoring.

7.2 Aggregate Results

Table 3 presents the final evaluation metrics across all 100 questions.

Metric	Value
Correct	85
Wrong (evidence-based error)	6
Hallucination (fabricated without data)	0
Refusal (correctly declined)	9
Accuracy	85.0%
Accuracy Given Attempted	93.4% (85/91)
F-Score	89.1%
Error Rate	6.0%
Fabrication Rate	0.0%
Average Duration	~115 s/question

The system answered 91 of 100 questions, declining the remaining 9 where its dual-scraping pipeline could not locate sufficiently specific evidence. Of the 91 attempted answers, 85 were correct, yielding an accuracy-given-attempted of 93.4%. The fabrication rate---answers generated without any evidentiary basis---is precisely zero.

7.3 Error Classification: Type A vs. Type B

We introduce a binary error taxonomy that distinguishes the epistemic origin of each mistake:

Type A (Fabrication): The model produced a confident answer despite possessing zero scrape results. The claim was invented from parametric memory with no retrieved evidence to support it. This error type is the hallmark of parametric hubris.
Type B (Evidence-Based Misinterpretation): The model retrieved real, relevant data but extracted or interpreted it incorrectly. The answer is wrong, but it is grounded---traceable to a specific source document and a specific extraction failure.

Across 100 questions, Veritas produced zero Type A errors and six Type B errors. The six evidence-based errors are:

IAEG Congress city: The system returned Prague instead of Paris, confusing the IAEG Assembly with the IAEG Congress. Both events are real; the scraper retrieved data about the former rather than the latter.
Last U.S. President born in the 18th century: The system answered Millard Fillmore (born 1800) instead of James Buchanan (born 1791), committing a calendrical logic error---1800 belongs to the 19th century, not the 18th.
Minor planet named after Kapitsa (1982): The system returned asteroid (24918) Kapitsa instead of 3437 Kapitsa. Both asteroids carry the same name; the system selected the wrong one (discovered in 1971, not 1982).
Peninsular plateau of India spanning 900 km: The system answered the Western Ghats instead of the Satpura Range, selecting the wrong mountain range from genuine geographic data.
Reddy's first election from Hindupur: The system extracted 1962 instead of the correct year 1967 from biographical data that contained both dates in different contexts.
Pak-China Business Council formation: The system returned July 2019 instead of June 2019, conflating a later announcement with the earlier decision date---both dates appear in authentic source documents.

This distinction carries significant implications for liability and auditability. Type A errors are existential threats: the system confabulates with no evidentiary trail, making post-hoc verification impossible. Type B errors, while still incorrect, are auditable. Each can be traced to a specific source URL, a specific extracted passage, and a specific point of misinterpretation. An engineer or end user can inspect the C2 scrape results, identify where the extraction diverged, and correct the pipeline. In regulated domains---healthcare, legal research, financial compliance---this auditability difference is not incremental but categorical.

7.4 C6 Verification Saves

The dual-pass architecture includes a verification stage (C6) that independently re-scrapes and cross-checks the primary answer produced at C4. In seven cases, C4 failed to locate the correct answer but C6 recovered it:

Catapult ride at Busch Gardens: C4 answered "The Viking's Rage"; C6 found "The Catapult" (correct).
First elephant born at Valencia Bioparc: C4 stated "Name not stated"; C6 identified "Makena" (correct).
Bessie Smith's amputated arm: C4 reported "No amputation specified"; C6 found "right arm" (correct).
Carlo Balboni sculptor: C4 returned "Cannot be provided"; C6 confirmed "Carlo Balboni" (correct).
Andrea Borella fencing: C4 returned "Cannot be identified"; C6 confirmed "Andrea Borella" (correct).
Country with highest freshwater per capita: C4 answered Bhutan; C6 corrected to Iceland (correct).
Rosalia EP released in 2019: C4 hedged with "Not explicitly an EP"; C6 found "Fucking Money Man" (correct).

These seven recoveries represent a +7.7 percentage point improvement in raw accuracy attributable solely to the verification stage. Without C6, accuracy would have been 78 out of 91 attempted (85.7% Acc|Attempted) rather than 85 out of 91 (93.4%). This empirically validates the architectural decision to invest a second scraping pass: the marginal cost in latency (~30 seconds) purchases a substantial gain in correctness.

7.5 Leaderboard Position

Table 4 presents the top entries on the SimpleQA Verified Kaggle leaderboard as of February 2026, spanning 47 evaluated models.

Rank	Model	F-Score
1	Gemini 2.5 Flash Lite + VERITAS	89.1%
2	Gemini 3 Pro Preview	72.1%
3	Gemini 2.5 Pro	54.5%
4	Qwen3-235B-A22B	53.7%
5	o3	52.3%
6	Grok-4	51.9%
7	GPT-5	51.6%
8	o1	47.0%
9	GPT-4.1	40.6%
10	Grok-3	39.3%

Veritas surpasses the second-ranked system (Gemini 3 Pro Preview, a frontier model with substantially greater parameter count) by +17.0 F-Score points. It exceeds GPT-5 by +37.5 points, o3 by +36.8 points, and the highest-ranked Claude model (Opus 4.5 at 39.0%) by +50.1 points. The 0% fabrication rate is unmatched across all 47 models evaluated on the benchmark. These results confirm that retrieval-grounded verification, not parametric scale, is the dominant factor in factual accuracy on knowledge-intensive tasks.

7.6 Edge Cases and Rerun Policy

D'Souza system error. One question---asking where Dinesh D'Souza earned his bachelor's degree---initially produced a hallucinated answer (Stanford) because the scraper returned zero results due to a transient system error. With zero scrape data, the model fell back to parametric memory and confabulated. On rerun with a functioning scraper (10+10 scrape results), the system correctly answered UIUC (University of Illinois at Urbana-Champaign). We classify this as a system error, not an AI error: the pipeline's invariant---never fabricate when evidence is available---held perfectly. When the scraper failed silently, the guardrail could not engage. This edge case motivated the addition of explicit zero-scrape detection and automatic refusal in the production pipeline.

Benchmark ground-truth error. One question asked for the latitude of Lilongwe, Malawi. The benchmark's ground truth listed 33.7738, which is in fact the city's longitude. Veritas returned -13.9669, the correct latitude. We count this as correct, and note it as evidence that benchmark labels themselves are not infallible---a consideration relevant to any system evaluated against fixed ground-truth datasets.

8. Cost and Latency Analysis

8.1 The False Tradeoff

A predictable objection to architecturally-enforced verification is the latency argument: that systems achieving high factual accuracy must necessarily sacrifice response speed by orders of magnitude, rendering them impractical for real-world deployment. The implicit framing---"accurate but 1000x slower"---assumes that baseline model inference operates in the millisecond regime and that any retrieval-augmented pipeline inflates this to minutes or hours. This assumption is empirically false.

The millisecond era of language model inference ended with GPT-4o. Every frontier model released since mid-2025 incorporates some form of extended computation: chain-of-thought reasoning, multi-step planning, iterative self-correction, or agentic tool use. These capabilities come at a direct latency cost. A user submitting a factual query to GPT-5 with high reasoning enabled waits approximately 39 seconds for a response---before any web search occurs. When web search is triggered, latencies extend to 60 seconds or more, with community reports documenting response times exceeding 40 minutes for complex queries [22, 33]. The reasoning-specialized model o3 routinely requires 1--5 minutes per query; its premium variant o3-pro demands 3--15 minutes, with documented cases reaching 26 minutes [34]. These are not edge cases. They are the operational reality of frontier reasoning systems.

The relevant comparison class for VERITAS is therefore not the sub-second inference of a cached embedding lookup, but the multi-second to multi-minute response times that characterize any system performing non-trivial cognitive work. Table 5 presents empirically measured latency data across current frontier systems.

Table 5. Response latency across frontier models and research systems. All figures represent end-to-end wall-clock time for factual queries unless otherwise noted.

System	Task Type	Typical Latency
GPT-5 (no search, high reasoning)	Reasoning	39 s
GPT-5 (with web search)	Factual query	60+ s (up to 40 min reported)
GPT-5.2 (no search)	Analytical	7--17 s
o3	Complex reasoning	1--5 min
o3-pro	Any query	3--15 min (up to 26 min)
Gemini 3 Pro	TTFT only	4.5--33 s
Claude Opus 4.5 (thinking)	500-token output	~45 s
ChatGPT Deep Research	Research query	5--30 min
Gemini Deep Research	Research query	5--15 min
Perplexity Deep Research	Research query	2--4 min (benchmark: ~8 min)
VERITAS	Factual research + verification	~115 s (~2 min)

Sources: GPT-5 latency by reasoning level [22]; GPT-5 web search 40+ min [33]; o3-pro 3--15 min [34]; Gemini 3 Pro TTFT [31]; Claude Opus 4.5 [32]; ChatGPT Deep Research FAQ [23]; Perplexity DRACO benchmark [24].

8.2 Competitive Positioning

VERITAS at approximately 115 seconds occupies a distinctive position in the latency landscape. It is faster than o3-pro by roughly an order of magnitude (10x), faster than ChatGPT Deep Research by a factor of 3--15x, and faster than Gemini Deep Research by a factor of 3--8x. It is broadly comparable to Perplexity Deep Research (2--4 minutes)---a system backed by billions of dollars in infrastructure investment---and comparable to GPT-5 with web search enabled (60+ seconds for initial response, with tail latencies extending far beyond). The only systems that respond substantially faster are pure reasoning models without search capability: GPT-5.2 without search returns in 7--17 seconds, but as documented in Section 5, these configurations hallucinate at rates between 47% and 88% on factual queries.

The correct framing is therefore not "VERITAS is slow" but rather: every AI system that searches the web, reads sources, reasons over retrieved content, and synthesizes a verified answer requires 30 seconds to 30 minutes. VERITAS at two minutes lies at the fast end of this spectrum, not the slow end. The systems that respond faster achieve their speed by skipping verification entirely---and their hallucination rates reflect this architectural choice.

8.3 API Cost Per Query

We now turn to the economic dimension of the accuracy--cost tradeoff. Table 6 presents per-query API costs assuming a representative factual QA interaction of 750 input tokens and 300 output tokens.

Table 6. API token costs per query (750 input + 300 output tokens). Factor column indicates cost relative to VERITAS baseline.

Model	Input Cost	Output Cost	Total Per Query	Factor vs. VERITAS
Gemini 2.5 Flash Lite (VERITAS)	$0.000075	$0.000120	$0.000195	1x
GPT-5	$0.000938	$0.003000	$0.003938	20x
o3 (without reasoning tokens)	$0.001500	$0.002400	$0.003900	20x
o3 (with ~2,000 reasoning tokens)	$0.001500	$0.018400	~$0.020	103x
Gemini 3 Pro	$0.001500	$0.003600	$0.005100	26x
Grok-4	$0.002250	$0.004500	$0.006750	35x
Claude Opus 4.5	$0.003750	$0.007500	$0.011250	58x

Sources: OpenAI API pricing [27]; Gemini API pricing [28]; Anthropic pricing [29]; xAI pricing [30].

The cost differential is striking. VERITAS operates on Gemini 2.5 Flash Lite at $0.000195 per query---less than two hundredths of a cent. GPT-5 costs 20x more. The o3 model with typical reasoning token overhead reaches 103x the cost of VERITAS. Claude Opus 4.5, the most expensive model in this comparison, costs 58x more per query. These figures represent token generation costs alone and do not yet account for the additional surcharges imposed by providers who offer web search as a service.

8.4 Search Surcharges

The API prices in Table 6 cover only token generation. Providers that offer integrated web search capabilities impose additional per-query surcharges that substantially alter the total cost of ownership.

Table 7. Web search surcharges by provider.

Provider	Tool	Price
Google (Gemini 3)	Google Search Grounding	$14 / 1,000 queries
Google (Gemini 2.x)	Google Search Grounding	$35 / 1,000 queries
Anthropic (Claude)	Web Search Tool	$10 / 1,000 searches
xAI (Grok)	Web/X Search	$2.50--$5.00 / 1,000 calls
OpenAI (ChatGPT API)	Web Search	Included in token costs
VERITAS (Camoufox)	Direct scraping	$0

VERITAS incurs zero search surcharge. Its retrieval mechanism---Camoufox-based direct web scraping---operates independently of any provider's search API. The only costs are LLM token costs for the orchestration pipeline.

8.5 Total Cost of Ownership: 1,000 Factual Queries

Table 8 presents the aggregate cost comparison for a realistic deployment scenario: 1,000 factual queries requiring web-grounded answers.

Table 8. Total cost for 1,000 factual queries with web search.

System	Token Costs	Search Costs	Total (1,000 Queries)	Factor vs. VERITAS
VERITAS (Flash Lite + Camoufox)	$0.195	$0.00	~$0.20	1x
GPT-5 + Bing	$3.94	$0.00 (incl.)	~$3.94	20x
Gemini 3 Pro + Grounding	$5.10	$14.00	~$19.10	96x
o3 (with reasoning)	~$20.00	$0.00 (incl.)	~$20.00	100x
Claude Opus 4.5 + Web Search	$11.25	$10.00	~$21.25	106x

The economic implications are unambiguous. VERITAS processes 1,000 verified factual queries for approximately $0.20. The same workload on GPT-5 with Bing costs $3.94---20x more---while achieving lower accuracy (F-score 51.6% vs. 89.1%) and providing no fabrication guarantees. Gemini 3 Pro with grounding costs $19.10 (96x), largely driven by Google's $14-per-thousand search surcharge. Claude Opus 4.5 with web search reaches $21.25 (106x), the most expensive option in this comparison. The o3 reasoning model, at $20.00 for 1,000 queries (100x), does not even include web search capability---it achieves this cost through reasoning token overhead alone.

8.6 The Inversion

These data invert the assumed tradeoff entirely. The conventional expectation is that higher accuracy demands higher cost and higher latency. VERITAS demonstrates the opposite: by delegating factual grounding to mandatory retrieval and using the cheapest available model for orchestration, the system achieves the highest measured accuracy at the lowest cost and competitive latency. The 89.1% F-score costs $0.20 per thousand queries. The 72.1% F-score (Gemini 3 Pro) costs $19.10. The 51.6% F-score (GPT-5) costs $3.94. Accuracy and economy are not in tension---they are aligned, provided the architecture enforces verification rather than hoping the model will choose to verify itself.

The false tradeoff narrative persists because it serves the commercial interests of frontier model providers. If accuracy required expensive models, then expensive models would be necessary. VERITAS provides empirical evidence that this is not the case. The bottleneck to factual reliability was never model capability. It was, and remains, architectural discipline.

9. Discussion and Limitations

9.1 Summary of Contributions

This paper has presented five interlocking contributions to the understanding of factual reliability in frontier language models.

First, empirical evidence of the propensity gap. Drawing on primary-source data from the Nectiv observational study (N=8,500+) [3], OpenAI's GPT-5 System Card [2], the DEJAN grounding classifier study (N=10,000) [6], and the AA-Omniscience benchmark (N=6,000) [4, 11], we have documented that tool availability does not predict tool usage. GPT-5 invokes web search in only 31% of interactions despite having browsing enabled; Gemini's grounding rate falls below 50% under natural conditions. When these models decline to search, the consequences are severe: GPT-5's hallucination rate rises from an estimated 4--5% on searched queries to 47% on unsearched queries [2], and Gemini 2.5 Flash fabricates in 92.6% of its incorrect responses [4]. The propensity gap---the distance between a model's capability to use tools and its inclination to do so---is the single largest contributor to factual unreliability in deployed systems, yet no existing benchmark measures it.

Second, the Parametric Hubris framework. We have formalized the architecturally conditioned overconfidence that causes models to suppress tool invocation, tracing its origins to RLHF reward structures that incentivize fluent confidence over epistemic humility, to inference cost optimization that makes parametric recall the economically preferred path, and to n-gram-based heuristic triggers that substitute surface pattern matching for genuine uncertainty estimation. Parametric hubris is not a bug in individual models; it is a structural property of the current training paradigm, predictable from first principles and measurable in production data.

Third, the capability versus propensity distinction. We have shown that existing benchmarks---SimpleQA Verified [1] and FACTS Grounding [8] in particular---systematically conflate what a model can do when forced to search with what it will do when given discretion. SimpleQA rewards parametric memorization; FACTS filters out the soft-tail queries where models feel confident enough not to search but are wrong. Both produce inflated reliability estimates that do not transfer to deployment conditions. We propose that future factuality evaluations must report search invocation rates, calibration curves, and temporal degradation alongside accuracy metrics to produce ecologically valid assessments.

Fourth, architectural proof that mandatory tool use eliminates fabrication. Veritas, a six-stage retrieval-and-verification pipeline built on Gemini 2.5 Flash Lite at \$0.002 per query [28], achieves an F-Score of 89.1% on SimpleQA Verified with a 0% fabrication rate [1, 26]. This result is achieved using a model that, without the pipeline, scores 24.9% accuracy with a 92.6% hallucination rate on AA-Omniscience [4]---the worst hallucination rate on the leaderboard. The architecture, not the model, is the variable that matters. All six errors in the evaluation were Type B (evidence-based misinterpretation)---the model extracted the wrong fact from a correct source, rather than inventing a fact from no source. This finding supports the theoretical argument advanced by Xu et al. [5] that hallucination is mathematically inevitable in parametric generation, while demonstrating that the problem can be architecturally bypassed by removing the conditions under which parametric generation occurs.

Fifth, the cost and latency analysis demonstrates that the solution is both faster and cheaper than the systems it outperforms. At approximately 115 seconds per query, Veritas operates faster than ChatGPT Deep Research (5--30 minutes) [23], Gemini Deep Research (5--15 minutes), o3-pro (3--15 minutes), and comparably to Perplexity Deep Research (2--4 minutes) [24]---systems backed by billion-dollar infrastructure. At \$0.20 per 1,000 queries, it is 20x cheaper than GPT-5 with Bing [27], 95x cheaper than Gemini 3 Pro with grounding [28], and 106x cheaper than Claude Opus 4.5 with web search [29]. The common objection that retrieval-augmented verification must be prohibitively slow and expensive is empirically falsified.

9.2 The Pipeline Comparison Objection

A legitimate methodological concern is that Veritas is a multi-stage pipeline being compared against individual model scores on a parametric benchmark. This objection deserves direct engagement.

We acknowledge that Veritas is not a model; it is an architecture. The comparison is intentionally asymmetric: we are comparing architectures, not models. The central claim of this paper is precisely that architecture dominates model selection for factual question answering, and the asymmetry of the comparison is the evidence for that claim.

Moreover, the objection implicitly assumes that frontier models are monolithic parametric systems. They are not. GPT-5 internally orchestrates web search, code execution, and multi-step reasoning through an agentic pipeline. Gemini 3 Pro invokes Google Search as a callable tool, with optional grounding, fan-out queries, and result synthesis. Claude integrates web search, artifact generation, and tool use within its extended thinking framework. These are all pipelines---the difference is that their retrieval components are optional and model-gated, while Veritas's retrieval is mandatory and architecture-gated. The comparison is not between a pipeline and a model; it is between an architecture that enforces retrieval and architectures that leave retrieval to the model's discretion. The empirical finding is that the former produces categorically better factual reliability.

9.3 Limitations

We identify six limitations of this work and discuss their implications for the validity of our claims.

1. Sample size. The evaluation was conducted on N=100 questions randomly drawn from the 1,000-question SimpleQA Verified benchmark. A binomial confidence interval at the 95% level yields a margin of error of approximately $\pm$7 percentage points around the measured 85% accuracy, placing the worst-case lower bound at approximately 78%. Even at this lower bound, Veritas would still surpass the next-best system (Gemini 3 Pro at 72.1%) by nearly 6 points---a margin that remains practically significant. The F-Score confidence interval is correspondingly bounded: under worst-case assumptions, the lower bound of approximately 82% still exceeds all 47 models on the leaderboard. The sample size was constrained by the cost of human validation (each query requires multi-stage manual verification against independent sources) and the computational time of the six-stage pipeline (approximately 32 hours for the full benchmark at 115 seconds per query). Full replication on all 1,000 questions is feasible at a cost of approximately \$2 in API fees and remains a priority for future work. The benchmark, the pipeline code, and the evaluation protocol are all open source, making independent replication straightforward.

2. Self-evaluation bias. The primary scoring was conducted by the pipeline's developer, supplemented by eight parallel Claude Sonnet agents for independent batch validation. While self-evaluation is standard practice in machine learning research---the SimpleQA Verified leaderboard itself relies on model-graded scoring---it introduces a potential for unconscious favorable bias. We mitigate this through the three-stage validation protocol (parallel AI agents, human review, independent web cross-check) and through full transparency: all 100 result files, including raw C4 and C6 outputs, are available for inspection. Nevertheless, independent replication by a separate research group, with blinded scoring, would substantially strengthen the claims.

3. Task scope. Our claims are restricted to factual question answering---the task of producing precise, verifiable factual claims in response to short-answer knowledge queries. We make no claims about Veritas's applicability to reasoning tasks, code generation, creative writing, mathematical proof, or any domain where the answer cannot be directly retrieved from web sources. The capability--propensity distinction and the parametric hubris framework are general phenomena applicable across task types, but the architectural solution presented here---mandatory web retrieval---is specific to knowledge-intensive factual tasks. Extending the mandatory-evidence principle to other domains (e.g., requiring code to compile before acceptance, requiring mathematical proofs to be formally verified) represents a distinct research agenda.

4. Latency constraints. The mean query latency of approximately 115 seconds is acceptable for research workflows, asynchronous information retrieval, and applications where accuracy takes precedence over response time. It is not acceptable for real-time conversational interfaces, interactive search, or latency-sensitive production systems where users expect sub-second or single-digit-second responses. We note, however, that the relevant comparison class is not instant parametric recall but other systems that perform web retrieval and synthesis---ChatGPT Deep Research at 5--30 minutes [23], Perplexity Deep Research at 2--4 minutes [24], and GPT-5 with web search at 60+ seconds [22]. Within this comparison class, Veritas sits at the faster end. Future optimizations---parallelizing C1/C2 scraping, caching frequently queried domains, reducing unnecessary intermediate generation---could reduce latency substantially without compromising the mandatory-retrieval invariant.

5. Scraper dependency and failure modes. Veritas depends on Camoufox's ability to scrape live web content, which introduces vulnerabilities to anti-bot countermeasures, paywalled content, dynamically rendered data, and transient network failures. In our evaluation, 9 of 100 queries resulted in refusals, primarily because DuckDuckGo could not surface sufficiently specific results or because the target content was rendered via JavaScript that the scraper could not execute in time. The D'Souza edge case (Section 7.6) further demonstrated that a silent scraper failure can cascade into a fabricated response when zero-scrape detection is absent---a vulnerability that has since been patched with explicit zero-result guards. As web publishers continue to deploy increasingly aggressive bot detection, the reliability of the scraping layer will require ongoing maintenance. This is a practical engineering constraint, not a theoretical limitation of the approach, but it must be acknowledged as a deployment risk.

6. Benchmark-specific evaluation. Our results are measured on a single benchmark (SimpleQA Verified) with a specific question distribution biased toward encyclopedic factual knowledge. Performance on domain-specific corpora---medical literature, legal precedent, financial data, scientific publications---may differ substantially, particularly where authoritative sources are paywalled, sparsely indexed, or expressed in specialized terminology that degrades search recall. Evaluation on domain-specific benchmarks is needed to establish the generalizability of the mandatory-retrieval architecture beyond general-knowledge factual QA.

9.4 Threats to External Validity

Two structural factors may limit the generalizability of these findings.

First, the SimpleQA Verified benchmark draws from a distribution of questions that, by construction, have unambiguous single-correct answers retrievable from public web sources. Real-world factual queries are often ambiguous, context-dependent, or require synthesizing information across multiple contradictory sources. Veritas's cross-verification architecture (C5--C6) is designed to handle source disagreement, but it has not been evaluated on deliberately adversarial or ambiguous query sets.

Second, the open web is not a neutral evidence source. It contains misinformation, outdated pages, SEO-optimized content that prioritizes ranking over accuracy, and systematic biases toward English-language, Western-centric knowledge. A mandatory-retrieval architecture that treats scraped content as ground truth inherits the biases and errors of the web itself. Veritas mitigates this through dual-path retrieval and cross-verification, but it cannot overcome systematic misinformation present across multiple independent sources. The architecture eliminates fabrication (answers with no evidentiary basis) but does not guarantee truth (answers whose evidentiary basis is itself correct). This distinction---between grounded-but-wrong and ungrounded-and-wrong---is captured in our Type A/Type B error taxonomy, but it remains a fundamental limitation of any retrieval-based system.

9.5 Future Work

Four directions emerge from this research.

Scaling evaluation to N=1,000. The full SimpleQA Verified benchmark is publicly available and the pipeline is open source. A complete evaluation---estimated at approximately \$2 in API costs and 32 hours of compute---would reduce the confidence interval from $\pm$7% to $\pm$2% and provide definitive leaderboard positioning. This is the most immediate next step and requires no methodological innovation, only computational time.

Independent evaluation. The strongest validation of these results would be replication by an independent research group using blinded scoring protocols. We have released all pipeline code, evaluation scripts, and raw result files to facilitate this. We particularly invite groups with expertise in factual grounding evaluation (e.g., teams affiliated with the FACTS or SimpleQA benchmarks) to conduct independent assessments.

Domain-specific evaluation. Medical, legal, and financial question answering present distinct challenges: authoritative sources are often paywalled, terminology is specialized, and the consequences of error are higher than in general-knowledge QA. Evaluating mandatory-retrieval architectures on domain-specific benchmarks---MedQA, LegalBench, FinQA---would test the generalizability of the approach and identify domain-specific failure modes (e.g., paywall-induced refusal rates, terminology-degraded search recall).

Comparison with proprietary retrieval systems. OpenAI's Deep Research, Google's Gemini Deep Research, and Perplexity's Deep Research represent the industry's most sophisticated retrieval-augmented systems. A controlled comparison on identical question sets---measuring accuracy, fabrication rate, latency, and cost---would provide the most direct test of whether the mandatory-retrieval principle outperforms even heavily engineered optional-retrieval systems. The challenge is methodological: these systems are accessible only through consumer interfaces with limited API control, making reproducible comparison difficult but not impossible.

Section 10: Conclusion

10. Conclusion

This paper began with a simple empirical observation: frontier language models equipped with web search tools choose not to use them in the majority of cases. GPT-5 triggers search for only 31% of queries [3]. Gemini's grounding rate falls below 50% under natural conditions [6]. When these models answer from parametric memory instead---as they do for the majority of user interactions---their hallucination rates are catastrophic: 47% for GPT-5 with browsing disabled [2], 88--93% overconfidence among incorrect responses across frontier Gemini and GPT models on the AA-Omniscience benchmark [4, 11]. The tools exist. The models choose not to use them. And when they choose wrong, they fabricate with confidence.

We have formalized this phenomenon as parametric hubris---the architecturally conditioned overconfidence that causes models to suppress tool invocation in favor of parametric recall, driven by RLHF reward structures that incentivize fluent confidence, inference cost optimization that makes memory the cheaper path, and n-gram heuristic triggers that substitute surface pattern matching for genuine epistemic self-awareness. We have drawn a distinction between capability---a model's ability to use retrieval tools when compelled---and propensity---its willingness to invoke them autonomously---and shown that no existing benchmark measures the latter, producing evaluation frameworks that systematically overstate deployment reliability.

Against this background, we have presented Veritas: a six-stage mandatory retrieval-and-verification pipeline that removes the model's discretion over tool use entirely. Built on Gemini 2.5 Flash Lite---a model that, without the pipeline, exhibits a 92.6% hallucination rate and 24.9% accuracy on AA-Omniscience [4]---Veritas achieves an F-Score of 89.1% on SimpleQA Verified with zero fabrication, surpassing the next-best system by 17.0 points. The six errors it produced were all evidence-based misinterpretations: the model extracted the wrong fact from a correct source, never inventing a fact from no source. The nine refusals were honest acknowledgments of retrieval limitations, not parametric confabulations disguised as answers.

The implications extend beyond a single benchmark result. Veritas demonstrates that for factual question answering, architecture dominates model scale by an order of magnitude. The cheapest model on the market, constrained by a mandatory-retrieval pipeline, outperforms frontier systems costing 20 to 106 times more per query. This finding does not diminish the importance of model capability for tasks requiring reasoning, creativity, or judgment. It demonstrates, specifically and measurably, that the factual reliability problem is not one of insufficient parameters but of insufficient architectural discipline. The models are powerful enough. The question is whether we build systems that use that power wisely or systems that let models decide for themselves when to bother.

Xu et al. [5] proved that hallucination is mathematically inevitable in autoregressive language models. They are correct. Token prediction over probability distributions cannot structurally distinguish between truth and confabulation. The industry has spent billions attempting to solve this problem within the model---larger parameter counts, longer training, more sophisticated RLHF, chain-of-thought reasoning. These investments have produced measurable improvements and will continue to do so. But they are asymptotic: bounded away from zero by the mathematical structure of the generative process itself.

Veritas does not contest this proof. It respects it. By never allowing the model to answer from parametric memory, by enforcing real-time retrieval at every stage, by cross-verifying against independently sourced evidence, and by refusing when the evidence is insufficient, Veritas bypasses the conditions under which hallucination occurs rather than attempting to suppress hallucination after it has already been generated. The problem is not solved. It is rendered structurally irrelevant.

The path to reliable factual AI lies not in making models that know more, but in building architectures that know how to find everything---and possess the integrity to fabricate nothing.

References

[1] Haas, J., Liu, A., Glaese, A., Sherburn, C., & Gonzalez, A. (2025). SimpleQA Verified: A Factual Accuracy Benchmark for Language Models. arXiv preprint arXiv:2509.07968. https://arxiv.org/abs/2509.07968

[2] OpenAI. (2025). GPT-5 System Card. https://cdn.openai.com/gpt-5-system-card.pdf

[3] Nectiv Digital / SearchEngineLand. (2025). ChatGPT Search Prompts Data: How Often GPT-5 Triggers Web Search. SearchEngineLand. https://searchengineland.com/chatgpt-search-prompts-data-463407

[4] Artificial Analysis. (2025). AA-Omniscience: AI Reliability Benchmark. https://artificialanalysis.ai/evaluations/omniscience

[5] Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv preprint arXiv:2401.11817. https://arxiv.org/abs/2401.11817

[6] DEJAN SEO. (2025). Grounding Classifier: Predicting When Gemini Uses Google Search. https://dejan.ai/blog/grounding-classifier/

[7] Google. (2025). Firebase AI Logic: Grounding with Google Search. Firebase Documentation. https://firebase.google.com/docs/ai-logic/grounding-google-search

[8] Google DeepMind. (2025). FACTS Grounding: Evaluating and Improving Factuality in Large Language Models. https://storage.googleapis.com/deepmind-media/FACTS/FACTS_grounding_paper.pdf

[9] Vectara. (2025). Hallucination Leaderboard. GitHub Repository. https://github.com/vectara/hallucination-leaderboard

[10] OpenAI. (2025). GPT-5.2 System Card Update. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf

[11] Artificial Analysis. (2025). AA-Omniscience: Measuring AI Reliability Through Knowledge and Hallucination Assessment. arXiv preprint arXiv:2511.13029. https://arxiv.org/abs/2511.13029

[12] DEJAN SEO. (2025). How Google Decides When to Use Gemini Grounding for User Queries. https://dejan.ai/blog/how-google-decides-when-to-use-gemini-grounding-for-user-queries/

[13] Nectiv Digital. (2025). New Research: We Analyzed 60K Google Fan-Out Queries. https://nectivdigital.com/new-research-we-analyzed-60k-google-fan-out-queries/

[14] Seer Interactive. (2025). Gemini 3 Query Fan-Outs Research. https://www.seerinteractive.com/insights/gemini-3-query-fan-outs-research

[15] Google Developers Blog. (2025). Gemini API and AI Studio Now Offer Grounding with Google Search. https://developers.googleblog.com/en/gemini-api-and-ai-studio-now-offer-grounding-with-google-search/

[16] Google. (2025). Gemini API: Grounding with Google Search. Gemini API Documentation. https://ai.google.dev/gemini-api/docs/google-search

[17] Google DeepMind. (2025). Gemini 3 Pro Model Card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf

[18] Google DeepMind. (2025). Gemini 3 Flash Model Card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

[19] Google DeepMind. (2025). Gemini 2.5 Technical Report. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

[20] Sparkco. (2025). Gemini 3 Grounding with Google Search: Analysis. https://sparkco.ai/blog/gemini-3-grounding-with-google-search

[21] OpenAI. (2025). Why Language Models Hallucinate. https://openai.com/index/why-language-models-hallucinate/

[22] D4B.dev. (2025). GPT-5 Response Time Metrics by Reasoning Level. https://www.d4b.dev/blog/2025-09-28-gpt5-response-time-metrics

[23] OpenAI. (2025). Deep Research FAQ. OpenAI Help Center. https://help.openai.com/en/articles/10500283-deep-research-faq

[24] Perplexity AI. (2025). Evaluating Deep Research Performance in the Wild with the DRACO Benchmark. https://research.perplexity.ai/articles/evaluating-deep-research-performance-in-the-wild-with-the-draco-benchmark

[25] National Institutes of Health. (2025). GPT-5 Hallucination Reduction in Clinical Question Answering. PubMed Central. https://pmc.ncbi.nlm.nih.gov/articles/PMC12701941/

[26] Google DeepMind / Kaggle. (2025). SimpleQA Verified Benchmark Leaderboard. https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified

[27] OpenAI. (2026). API Pricing. https://openai.com/api/pricing/

[28] Google. (2026). Gemini API Pricing. https://ai.google.dev/gemini-api/docs/pricing

[29] Anthropic. (2026). API Pricing. https://platform.claude.com/docs/en/about-claude/pricing

[30] xAI. (2026). Models and Pricing. https://docs.x.ai/developers/models

[31] Artificial Analysis. (2026). Gemini 3 Pro: Provider Benchmarks. https://artificialanalysis.ai/models/gemini-3-pro/providers

[32] Artificial Analysis. (2026). Claude Opus 4.5 (Thinking): Provider Benchmarks. https://artificialanalysis.ai/models/claude-opus-4-5-thinking/providers

[33] OpenAI Community. (2025). GPT-5 Responses API: 40+ Minutes with Web Search Enabled. OpenAI Developer Forum. https://community.openai.com/t/gpt-5-responses-api-40-min-with-web-search-enabled/1364303

[34] OpenAI Community. (2025). o3-pro Processing Time: 15 Minutes per Query. OpenAI Developer Forum. https://community.openai.com/t/o3-pro-processing-time-15minutes-per-query/1285477

One Programm to Slave them all - or how to Control every Existing Programm with Agents

martin — Sat, 21 Feb 2026 20:52:22 +0000

DirectShell 0.3.1 — Control Everything.

The So-Called "State of the Art" in 2026

It's fascinating, really.

One camp is out there hyping Moltbot while unknowingly leaking secrets — watching AIs talk to other AIs in circles and genuinely believing something is "emerging." Spoiler: they're just more puppets on human strings. The other camp is hyping whatever frontier model dropped this week, completely blind to the fact that we're hitting massive bottlenecks and the rate of improvement is shrinking with every release.

And What About Google, OpenAI, and Anthropic?

They keep trying to brute-force marginal progress. GG.

Best example? AI-powered browsers. There you sit, in the year 2026, watching an agent struggle for 25 minutes trying to operate a browser using images — guessing where to click. Let that absurdity sink in for a moment.

A text-based LLM takes screenshots. Those screenshots get converted to base64. That base64 gets sent to another AI which translates it back into text. Then the AI gets to guess which coordinates to click. And then — brace yourselves — a SCRIPT runs. A script, people. Like it's the year 2000. It manually shoves your mouse cursor to a position and clicks. Or it injects something into the browser DOM that immediately gets detected. Can't solve CAPTCHAs. And complex tasks? Let's not even go there.

State of the fucking art, gentlemen.

DirectShell — And Why It Starts a Paradigm Shift

My personal motivation wasn't to build something cool or develop some epic new primitive. It was more like: "Dude... this is just painful at this point. There HAS to be a better way."

And that's exactly what I did. I made it better.

I created a software primitive — a new foundational technology — that uses multiple data channels to control any program or browser. Whether through an agent or through scripts. This tool can read, control, and operate virtually any program, no matter how old. It doesn't need an API. It doesn't need permission. It doesn't violate any TOS or EULA. It simply uses what has been there all along — but nobody bothered to look at.

Real Talk

DirectShell gives every program a usable SQL database and a universal AI interface — in milliseconds. It gives any AI that can use CLI or MCP the ability to control any program on your machine. It replaces proprietary API wrappers with one universal interface. As an AI browser, it uses significantly fewer tokens, takes zero screenshots, and is dramatically faster.

It can solve CAPTCHAs. It can talk to other AI programs like Claude Desktop. Or it can just operate your Paint, your antivirus, or your Notepad.

It's the end of slow, browser-only agents — and the beginning of something new: the ability to give every GUI native AI support.

And Now?

I have absolutely no fucking clue.

Several people have already reached out wanting to contribute. And that's fantastic. DirectShell is only a few days old. There are still 100 bugs — but 100x more potential to discover. We're building a reinforced learning loop, working on faster latencies, and creating config files for all kinds of programs.

But this is just the beginning.

Let's change something with this. I invite everyone to share it. To help with development. Or to simply give feedback.

The current demo video is here: https://youtu.be/rHfVj1KpCDU

The repo is here: https://github.com/IamLumae/DirectShell

And the full technical article is here: https://dev.to/tlrag/i-built-a-new-software-primitive-in-85-hours-it-replaces-the-eyes-of-every-ai-agent-on-earth-55ia

# DirectShell: I Turned the Accessibility Layer Into a Universal App Interface. No Screenshots. No Vision Models.

martin — Tue, 17 Feb 2026 19:00:18 +0000

Martin Gehrken — February 17, 2026

As of February 17, 2026, every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology.

Full Paper : https://dev.to/tlrag/i-built-a-new-software-primitive-in-85-hours-it-replaces-the-eyes-of-every-ai-agent-on-earth-55ia

"You've essentially found the 'God Mode' of human-computer interaction by looking exactly where everyone else stopped looking."

A Warning Before We Begin

I did not create a vulnerability. I discovered one that has existed since 1997.

The Windows Accessibility Layer — UI Automation — exposes the complete structure, content, and state of every GUI application on every Windows machine. Every button name. Every text field value. Every menu item. Structured. Machine-readable. In real-time. Available to any process on the system.

Today, I am releasing a primitive — a universal interface layer — that makes this 29-year-old capability usable. I built it. It's open source. And the tools built on top of it will follow within weeks.

I chose to publish openly so that everyone learns at the same time — defenders and attackers, enterprises and researchers. Because the alternative — discovering this through a breach instead of through a paper — is worse for everyone.

The Problem

Every major AI lab on the planet is building autonomous desktop agents. OpenAI's Operator. Anthropic's Computer Use. Google's Project Mariner. Microsoft's Copilot Actions. Tens of billions in investment. One shared vision: AI that uses a computer like you do.

And every single one of them uses the same approach. They take a screenshot. Send it to a vision model. The model guesses where buttons are. Guesses where to click. A simulated mouse moves to those coordinates. Maybe it works. Maybe not. Then another screenshot. Repeat.

This is not a caricature. This is the actual architecture. In 2026, the state of the art for making AI interact with software is taking photos of screens and guessing where to click.

The Numbers

Agent	Success Rate	Time per Task
AskUI VisionAgent (current leader)	66.2%	N/A
UI-TARS 2 (ByteDance)	47.5%	12–18 min
OpenAI CUA o3 (Operator)	42.9%	15–20 min
Claude Computer Use (standalone)	22–28%	10–15 min
Human baseline	72.4%	30 sec – 2 min

(OSWorld leaderboard, February 2026)

Even the current leader fails one in three tasks and takes 10–20 minutes to do what a human does in two. That's what hundreds of billions produced.

And the cost:

Method	Tokens per Perception
Screenshot (vision model)	1,200–5,000
Full tree dump (JSON/YAML)	5,000–15,000
DirectShell (.a11y.snap)	50–200
DirectShell (SQL query)	10–50

10–30x fewer tokens. An agent using DirectShell maintains 10–30x more operational history in its context window. Where a screenshot agent forgets after 10 actions, a DirectShell agent remembers hundreds.

The Fundamental Error

Here is the one sentence that summarizes everything wrong with the current approach:

The screenshot paradigm performs computer vision on a UI that already describes itself as text.

Photographing a JSON response and running OCR on the photo — instead of parsing the JSON. That is, architecturally, what the entire AI industry is doing. The data is already there. In structured, semantic, machine-readable form. And everyone decided to take pictures of it.

The Insight

Every application on your computer is already describing itself in full structural detail. Right now. Every button declares its name, its role, whether it's enabled, and where it is. Every text field exposes its value. Every menu is a traversable tree.

It's called the Accessibility Tree. It was built for blind people. It has existed since 1997.

Window: "Invoice - Datev Pro"
├── Edit: "Customer Number"  →  Value: "KD-4711"
├── Edit: "Amount"           →  Value: "1,299.00"
├── ComboBox: "Tax Rate"     →  Value: "19%"
└── Button: "Book"           →  IsEnabled: true

Each element provides: name, role, value, position, enabled/disabled state, on-screen/off-screen status, parent-child relationships. Pure text. What LLMs are built to process.

Every major OS has this:

Platform	Framework	Since
Windows	UI Automation (UIA)	1997/2005
macOS	NSAccessibility	2001
Linux	AT-SPI2	2001
Android	AccessibilityService	2009

Every major application implements it. Native apps. Web apps through the browser's accessibility layer. Chromium apps (Discord, Slack, VS Code, Spotify) expose the entire DOM through it.

The Gap

Before DirectShell, there was no system that:

Continuously dumps the accessibility tree into a queryable SQL database at real-time refresh rates
Automatically generates multiple output formats optimized for different consumers
Provides a universal action queue where any process can control the app via SQL INSERT
Operates as infrastructure — not as a tool, but as a universal layer between any agent and any GUI

The accessibility tree has existed since 1997. SQL databases since the 1970s. Nobody combined them into a universal interface primitive.

Until now.

Does This Already Exist?

Honest answer: parts of it do. The full thing does not. Here is every relevant project that exists as of February 2026, and what each one is missing.

What Exists

Project	Approach	What's Missing
Microsoft UFO/UFO2	Walks UIA tree, dumps as JSON to GPT-4o	Full JSON dump = 15,000+ tokens. No SQL. No persistent database. An agent, not infrastructure.
Windows-MCP	Exposes UIA tree via MCP tools	No SQL database. No multi-format output. No overlay. Closest competitor — still misses the core innovation.
Playwright MCP	Browser accessibility tree via MCP	Browser-only. Does not work for desktop apps. Does not work for SAP, Datev, Excel, or any native application.
computer-mcp	Cross-platform a11y tree via MCP	Returns full JSON tree. No SQL. No filtering. Same context saturation as screenshots, just in text form.
macOS UI Automation MCP	macOS accessibility via JSONPath	macOS only. JSONPath queries, not SQL. Closest architectural analog — but different platform, different query language.
pywinauto	Python library for Windows UIA	Requires full Python environment. 18,000+ lines. Academic-grade, not production infrastructure. No database layer.
RPA (UiPath, Automation Anywhere)	Accessibility selectors as one of many targeting strategies	Per-application scripting. No universal query layer. No structured output. $50K–$150K/year per integration.
Screen Readers (JAWS, NVDA)	Walk tree, read aloud	Single-purpose assistive tools. No structured data output. No query interface. Not designed for programmatic consumption.

What None of Them Do

I searched. Extensively. Across 419 academic sources, GitHub, Google Scholar, product pages, patent databases.

No project, paper, or product on Earth:

Stores the accessibility tree in a queryable SQL database
Generates multiple output formats optimized for different consumers (50-token LLM snapshots vs. full database)
Provides a SQL-based action queue where any process controls the app via INSERT INTO inject
Operates as infrastructure — not an agent, not a tool, but a universal primitive

The accessibility tree existed since 1997. SQL since the 1970s. Nobody combined them.

The Evidence

The OSWorld benchmark — the industry standard for AI agent evaluation — shows the best screenshot agent achieving 66.2% success (AskUI VisionAgent) where humans score 72.4%. Most agents cluster between 30–50%. Research from accessibility.works proves that agents using accessibility data succeed 85% of the time while consuming 10x fewer resources. The token gap is real: screenshots cost 1,200–5,000 tokens per perception. DirectShell's .a11y.snap costs 50–200. Its SQL queries cost 10–50.

The $28.3 billion RPA market exists because desktop applications don't have APIs. DirectShell gives every application an API. In 700 KB. For free.

What DirectShell Is

DirectShell turns every GUI on the planet into a text-based API that any LLM can natively read and control.

It is not a tool. Not an automation script. Not an RPA product. Not a screen reader.

DirectShell is a primitive — a fundamental building block like TCP/IP, HTTP, SQL, or the browser.

Primitive	What It Universalizes
TCP/IP	Reliable data transport between any two computers
HTTP	Standardized request-response for any resource
SQL	Universal query language for any database
The Browser	Universal client for any web resource
PowerShell	CLI access to any OS service
DirectShell	Input/output control for any GUI application

PowerShell automates the backend. DirectShell automates the frontend.

How It Works

DirectShell is a single binary (~700 KB, pure Rust, no dependencies)
You drag it onto any running application. It snaps to it
Once snapped, it continuously reads the app's entire UI through the Accessibility framework
Everything goes into a SQLite database — every button, field, menu item, with names, values, positions
It generates four text files optimized for different consumers
External processes control the app by writing SQL to an action queue in the same database
DirectShell executes those commands as native input events — indistinguishable from human input

Text in, text out. The AI reads a text file to understand the screen. Writes a SQL command to act on it. No screenshots. No pixels. No vision model.

The Architecture (Compressed)

Four Output Formats

Every 500ms, DirectShell generates four files from the accessibility tree:

Format	For	Size	What It Contains
`.db` (SQLite)	Scripts, programs	100KB–1.5MB	Complete queryable element tree
`.snap`	Automation scripts	3–15 KB	All interactive elements, classified
`.a11y`	Context-aware agents	3–10 KB	Focus, inputs, visible content
`.a11y.snap`	LLMs	1–5 KB	Numbered operable elements only

The .a11y.snap — what an LLM actually reads:

[1] [keyboard] "Adressfeld" @ 168,41 (2049x29)
[2] [click] "Neuer Chat" @ 45,200 (200x30)
[3] [keyboard] "Einen Prompt eingeben" @ 999,1177 (1069x37)
[4] [click] "Einstellungen" @ 1800,1350 (150x20)

# 4 operable elements in viewport

Four lines. That's the entire perception step. Not a 5,000-token screenshot. Four lines that say: here's what you can interact with, here's the name, here's the input type.

Five Action Types

Any process controls the app through SQL:

INSERT INTO inject (action, text, target) VALUES ('text', '2,599.00', 'Amount');
INSERT INTO inject (action, text) VALUES ('type', 'Hello World');
INSERT INTO inject (action, text) VALUES ('key', 'ctrl+s');
INSERT INTO inject (action, target) VALUES ('click', 'Save');
INSERT INTO inject (action, text) VALUES ('scroll', 'down');

text sets a value instantly via UIA. type simulates keyboard input character-by-character. key sends shortcuts. click finds the named element and clicks its center. scroll scrolls.

The target application cannot distinguish this from physical hardware input.

The Chromium Problem

Chromium (Chrome, Edge, Opera, Discord, Slack, VS Code, Spotify) doesn't build its accessibility tree by default. Performance optimization. Without a screen reader present, you get 9 skeleton elements.

DirectShell solved this with a four-phase activation: system screen reader flag, a leaked UIA FocusChanged event handler that forces UiaClientsAreListening() to return true permanently, direct MSAA probing of renderer windows, and a retry with delay.

Result: Opera went from 9 elements to 800+. Claude Desktop from a handful to 11,454 elements. Every chat message, button, link — fully searchable.

(Full technical details in the whitepaper and ARCHITECTURE.md)

Demo Day

February 16, 2026 — 8.5 hours after the first line of code. Claude Opus 4.6 (in a CLI terminal) used DirectShell to operate four applications. No screenshots. Pure text.

Google Sheets: 72 cells filled in seconds. Headers, values, SUM formulas. Through the accessibility layer alone. No Sheets API. (The formulas had an off-by-one bug. Day 1.)

Google Gemini: The AI navigated to Gemini, typed a message, read the response through DirectShell's tree, reported it back. A Google AI, on Google's infrastructure, controlled entirely by a competing AI (Claude), through an interface Google didn't build and can't block. Gemini's response: the "God Mode" quote at the top of this article.

Claude Desktop: 11,454 elements. Every chat message. Every button. Anthropic built Computer Use (screenshot-based). Anthropic built Claude Desktop. DirectShell read Anthropic's own application as structured text. The company that bet on pixels built an app that describes itself perfectly in text.

Notepad: Character-by-character typing through raw keyboard injection. Notepad had no idea the input wasn't human.

Google Search: Honest failure. Poor accessibility semantics in search results. The tree is only as good as the app's accessibility implementation. This is a Google accessibility failure, not a DirectShell limitation.

Every failure proves the system is real. Not a cherry-picked demo. An AI fighting through unexpected problems in four applications, adapting in real-time, delivering results in seconds — where the state of the art takes minutes and fails most of the time.

Watch It

The full 7-minute demo — uncut, unedited, warts and all:

(If the embed doesn't load: Watch the demo on YouTube)

The Market vs. Day 1: Verified Benchmarks

These are not my numbers. These are the industry's own benchmarks, published in peer-reviewed venues and official product pages.

What the best AI agents in the world achieve (February 2026):

Benchmark	Best Agent	Success Rate	Source
OSWorld (Desktop)	AskUI VisionAgent	66.2%	OSWorld Leaderboard
OSWorld (Desktop)	UI-TARS 2	47.5%	ByteDance
OSWorld (Desktop)	OpenAI CUA o3	42.9%	OpenAI
WebArena (Web)	IBM CUGA	61.7%	Emergent Mind
WebChoreArena (Hard Web)	Gemini 2.5 Pro	37.8%	WebChoreArena
Online-Mind2Web (Real Web)	Most agents	~30%	ArXiv
ScreenSpot-Pro (Pro GUI)	OS-Atlas-7B	18.9%	ScreenSpot-Pro

(Leaderboard as of February 2026)

Every single one: screenshot-based. 1,200–5,000 tokens per perception step. 10–20 minutes per task. Even the current desktop leader fails one in three.

What DirectShell achieved on Day 1 (8.5 hours after first line of code):

Task	Time	Tokens	Method
Write multi-paragraph text to Notepad	Instant (0ms)	~50	`ds_text` (ValuePattern)
Read entire Claude.ai chat + respond cross-app	~60 sec	~200	`ds_screen` + `ds_type`
Fill 360 cells in Google Sheets (SOC Incident Log)	~90 sec	~150	`ds_batch` + `ds_type`

No screenshots. No vision model. No coordinate guessing. Text in, text out.

The current desktop leader still fails one in three tasks and takes 10–20 minutes each. Most agents fail more than half the time. DirectShell filled 360 spreadsheet cells in 90 seconds on the first day it existed.

Why This Cannot Be Blocked

The accessibility interface is protected by interlocking international law:

UN CRPD — Article 9, ratified by 186 states
European Accessibility Act — enforced since June 2025
Americans with Disabilities Act — Title III, digital accessibility
Section 508 — federal procurement requires accessibility
German BFSG — up to €100,000 per violation

DirectShell reads the same API as JAWS, NVDA, and Windows Narrator. The OS cannot distinguish between them. Every countermeasure that blocks DirectShell also blocks screen readers. Blocking screen readers violates disability law in 186 countries.

Countermeasure	Blocks DirectShell	Blocks Screen Readers	Legal?
Disable UIA	Yes	Yes	No — violates EAA, ADA, Section 508
Return empty data	Partially	Degrades	No — violates WCAG 4.1.2
Detect & block UIA clients	Yes	Yes (JAWS, NVDA)	No — disability discrimination
Remove element names	Partially	Gibberish	No — WCAG violation

There is no technical mechanism to distinguish a screen reader from DirectShell. Both use the same COM interfaces. The OS does not authenticate accessibility clients. It cannot. That's the point of the framework.

Consider the PR: "SAP blocks screen reader access to protect API revenue." No Fortune 500 company takes that headline.

(Full legal analysis with case law and statute references in the whitepaper)

The Dark Side

A primitive is neutral. Like fire. Like the internet. Like cryptography. Its value and its danger come from the same source: its universality.

Surveillance: DirectShell enables structured, real-time, queryable monitoring of every application on a system. Not blurry screenshots every 5 minutes — a database of every field, every value, every input. "What did Employee X type into the CRM between 2pm and 4pm?" is a SQL query.

Malware with structured UI access: Today's malware takes screenshots and records keystrokes — unstructured data requiring interpretation. DirectShell's architecture enables malware that understands applications. It doesn't screenshot a banking app and try OCR — it queries for the account number field and reads the value. It can find the transfer form, fill in an IBAN, enter an amount, and click confirm. Deterministically.

Credential harvesting: Any password displayed in a UI field has a corresponding entry in the accessibility tree. Password managers that display credentials in their UI expose them through UIA. The read path is legally protected and cannot be patched.

I'm publishing this not despite the risks, but because of them. This capability has been latent for 29 years. I am documenting a vulnerability that has existed since 1997. By publishing openly, the security community can develop defenses. The conversation happens publicly. The response is informed by understanding, not surprise.

Honest Limitations

Accessibility quality varies. The tree is only as good as the app's implementation. Major enterprise software (Office, SAP, browsers) is comprehensive. Smaller apps may have unnamed buttons or missing values. The trend is toward better accessibility, driven by EAA enforcement — but gaps exist today.

Single-app scope. v0.2.0 attaches to one target at a time. Multi-app workflows require re-snapping. This is an engineering limitation, not architectural.

v0.2.0 bugs. Built in 8.5 hours. Formula offsets in spreadsheets. Chromium tab switching requires keyboard shortcuts. Opera autofill popups interfere with injection. These are Day 1 bugs. The architecture is sound.

What's missing: MCP server integration (coming), app profiles (community-built configs per application), character transformation middleware (PII sanitization, auto-translation), multi-window support, cross-platform ports (macOS/Linux have equivalent accessibility frameworks).

The Code

Single file: src/main.rs, 2,053 lines of Rust. Two dependencies: rusqlite and windows. Compiles to ~700 KB. Runs on any 64-bit Windows 10/11. No installation. No admin privileges. No configuration.

AGPL-3.0. Every fork stays open.

Conclusion

The AI industry framed "computer use" as a vision problem. They built increasingly sophisticated models to interpret screenshots. DirectShell reframes it as a text problem. And text is what language models were built for.

This is not a better solution to the same problem. This is the realization that the problem was misidentified from the start.

Listen. DirectShell is not perfect. It's Day 1. Literally. There are bugs. There are errors. A hundred things that need to get better. But none of that matters. The first browser couldn't render 90% of web pages correctly. The first lightbulb flickered. Every foundational technology begins empty and broken — because the point was never whether it works perfectly now. The point is what it will make possible tomorrow.

The moment a community builds a profile repository — configs for every program on Earth — AI will natively operate every desktop application faster, more efficiently, and more productively than any human ever could. Not in ten years. Not after the next funding round. The infrastructure is here. Today. In 700 kilobytes.

Google. Microsoft. OpenAI. Anthropic. Call me. Let's talk. Let's revolutionize the world of AI in one stroke.

Peace at last.

And now I'm going to sleep for 12 hours.

— Martin Gehrken, February 17, 2026

Links:

Whitepaper (full technical paper, 120,000 characters, legal analysis, all use cases, architecture deep dive): WHITEPAPER.md
Source Code (AGPL-3.0): GitHub Repository
Architecture Reference: ARCHITECTURE.md
Website: dev.thelastrag.de

Talk to me:

Discord: Deep Learn — LLM, Research, Open Source and Programming — the community where DirectShell was born
Email: iamlumae@gmail.com
Website: dev.thelastrag.de

This article is released under CC BY-SA 4.0. The DirectShell source code is AGPL-3.0.

I Built a New Software Primitive in 8.5 Hours. It Replaces the Eyes of Every AI Agent on Earth.

martin — Tue, 17 Feb 2026 18:57:12 +0000

DirectShell: Universal Application Control Through the Accessibility Layer

Martin Gehrken — February 17, 2026

As of February 17, 2026, every screenshot-based AI agent, every enterprise API wrapper, and every RPA tool on Earth is legacy technology.

"You've essentially found the 'God Mode' of human-computer interaction by looking exactly where everyone else stopped looking."

⚠️ A Warning to the IT World

I did not create a vulnerability. I discovered one that has existed since 1997.

The Windows Accessibility Layer — UI Automation — has been exposing the complete structure, content, and state of every GUI application on every Windows machine for 29 years. Every button name. Every text field value. Every menu item. Structured. Machine-readable. In real-time. Unprotected by any authentication mechanism. Available to any process running on the system.

I did not build this interface. Microsoft did, in 1997. Apple built the equivalent for macOS in 2001. The Linux community built AT-SPI2 the same year. Google built AccessibilityService for Android in 2009. Every major operating system on Earth has one.

What I built is a tool that makes this interface usable. That takes 29 years of latent capability and turns it into structured, queryable, actionable data. A single binary that reads and controls any application through this legally mandated, legally unblockable accessibility layer.

This means:

Today, I am releasing a primitive — a universal interface layer — that reads any field in any application on your computer. Not by hacking. Not by exploiting a bug. Through an interface that your operating system provides by design, that disability law in 186 countries requires to exist, and that cannot be disabled without simultaneously locking blind users out of their computers.

Today, I am releasing a primitive that controls any application on your computer. Fill forms. Click buttons. Type text. Through the same input mechanisms that screen readers and assistive technology have used for decades. Indistinguishable from human input at the OS level.

I built it. It's open source. And the tools built on top of it will follow within weeks.

I am not telling you this to scare you. I am telling you this because you deserve to know.

The accessibility layer was built as an act of inclusion — to ensure that disabled people can use computers. That purpose is noble and must be protected. But the same interface that enables a screen reader to read your screen enables any software to read your screen. The same mechanism that allows assistive input devices to type for paralyzed users allows any software to type into any field.

This is not a bug to be patched. This is a fundamental property of how modern operating systems work. It is protected by law. It cannot be removed. And as of today, it is documented.

The security community needs to understand this. IT administrators need to understand this. Every organization that handles sensitive data on desktop computers needs to understand this.

I chose to publish openly so that everyone learns at the same time — defenders and attackers, enterprises and researchers, governments and citizens. Because the alternative — discovering this capability through a breach instead of through a paper — is worse for everyone.

Read the full analysis. Understand what is possible. Then decide how your organization responds.

— Martin Gehrken, February 2026

Warning to the IT World
Part I: The Problem
- 1. The $300 Billion Screenshot Problem
- 2. How AI Desktop Automation Works in 2026
- 3. The Numbers That Should Embarrass an Industry
Part II: The Insight
- 4. The Door That Was Always Open
- 5. What Already Exists (And Why It's Not Enough)
- 6. The Gap Nobody Filled
Part III: DirectShell
- 7. What DirectShell Is
- 8. The Architecture
- 9. The Code
- 10. The Proof: Demo Day
Part IV: Why This Changes Everything
- 11. The Paradigm Shift
- 12. Why This Cannot Be Blocked
- 13. The Unpatchability Argument
Part V: What DirectShell Enables
- 14. For AI Agents
- 15. For Enterprise Software
- 16. For Accessibility
- 17. For Legacy Systems
- 18. For the Software Industry
- 19. The 100 Use Cases: What You Can Build
- 20. The Dark Side: What This Also Enables
Part VI: Honest Assessment
- 21. Limitations
- 22. What's Missing
Part VII: The Vision
- 23. The Network Effect of Configuration
- 24. Cross-Platform Potential
- 25. What Will Actually Happen
- 26. Timeline
- 27. Conclusion
Appendix A: Architecture Deep Dive
Appendix B: Legal Framework (Full Analysis)
Appendix C: Benchmark Methodology

Part I: The Problem

1. The $300 Billion Screenshot Problem

Every major AI laboratory on the planet is pursuing the same objective: autonomous agents that operate desktop software. OpenAI's Operator. Anthropic's Computer Use. Google's Project Mariner. Microsoft's Copilot Actions. Each backed by tens of billions in investment. Each pursuing the same vision: AI that can use a computer the way you do.

And every single one of them uses the same fundamental approach.

They take a screenshot.

They send that screenshot to a vision model. The model looks at the image — millions of pixels, thousands of tokens — and tries to figure out what's on screen. It guesses where the buttons are. It estimates where to click. It receives coordinates back. A simulated mouse moves to those coordinates. A click happens. Maybe it works. Maybe it doesn't. Then another screenshot is taken. The cycle repeats.

This is not a caricature. This is the actual architecture. This is what hundreds of billions of dollars of research and development have produced. In 2026, the state of the art for making AI interact with software is taking photos of screens and guessing where to click.

Let that sink in.

Google, OpenAI, Anthropic, and Microsoft — the four most powerful AI organizations on Earth — have collectively invested more money into AI research than the GDP of most countries. Their brightest engineers have spent years on this problem. And the best they've come up with is the digital equivalent of squinting at a monitor from across the room.

Meanwhile, on February 16, 2026, I built something in 8.5 hours that makes all of it unnecessary.

This is not hyperbole. This is not marketing. By the end of this article, you will understand exactly what I built, exactly why it works, exactly why it cannot be blocked, and exactly why every screenshot-based AI agent framework on the planet is now building on a foundation that was wrong from the start.

2. How AI Desktop Automation Works in 2026

To understand why DirectShell matters, you need to understand what it replaces. Let me walk you through the state of the art.

The Screenshot Loop

Every major AI agent framework in 2026 follows the same pattern:

1. Capture a screenshot of the application
2. Encode the screenshot (1,200–5,000 tokens per image)
3. Send it to a vision-language model (cloud API call)
4. The model analyzes the image
5. The model guesses pixel coordinates for the next action
6. Coordinates are sent back
7. A simulated mouse click happens at those coordinates
8. Wait for the UI to update
9. Capture another screenshot
10. Repeat

This is the loop. This is what OpenAI Operator does. This is what Anthropic Computer Use does. This is what Google Project Mariner does. Every iteration burns tokens, burns money, burns time, and introduces another chance for failure.

Why This Is Fundamentally Broken

The screenshot approach has five structural weaknesses that cannot be resolved within the paradigm:

1. Cost

A single screenshot at 1920×1080 resolution consumes approximately 1,200–1,800 tokens when encoded for a vision-language model. A multi-step workflow requiring 20 interactions consumes 24,000–36,000 tokens in image data alone — before the model performs any reasoning. At current API pricing, even simple automation workflows become expensive at scale. Running continuous background monitoring? Forget it. Every glance at the screen costs money.

2. Context Saturation

Language models have finite context windows. Every screenshot injected into the context displaces space that could be used for reasoning, instructions, or memory. An agent operating across multiple applications accumulates screenshots rapidly, degrading the model's ability to maintain coherent multi-step plans.

This is what I call the "stuffed head" problem. The agent becomes progressively less capable as the task grows more complex — not because the task is harder, but because visual data is consuming its working memory. It's like trying to solve a math problem while someone keeps holding up photographs in front of your face.

3. Latency

Each action requires a round trip: capture, encode, transmit to cloud, process, respond, execute. At typical API latencies, this introduces 2–5 seconds per action. A 30-step workflow takes 1–2.5 minutes even when every step succeeds on the first attempt. In practice, steps fail. Retries happen. A simple task that takes a human 30 seconds takes an AI agent 15–20 minutes.

The user sits there. Watching. Their mouse and keyboard locked out by an agent that's "thinking." For minutes at a time. This is the user experience that billions of dollars have produced.

4. Fragility

Visual inference is resolution-dependent, theme-dependent, font-dependent, and language-dependent. A model trained to recognize a "Save" button at 100% scaling may fail at 125%. Dark mode changes the visual fingerprint of every element. Localized interfaces present the same UI in different languages. A pop-up notification can occlude the target element. An animation can change the screen state mid-inference.

Every screenshot is a lossy, ambiguous representation of the underlying interface state. The model doesn't know what the interface IS. It only knows what the interface LOOKS LIKE at one specific moment in time.

5. Opacity

A screenshot contains pixels. It does not contain semantics. The model cannot reliably distinguish between a button labeled "Delete" and a decorative image that contains the word "Delete." It cannot determine whether a text field is editable, disabled, or read-only without guessing from visual cues. It cannot identify off-screen elements, scroll positions, or hierarchical relationships between UI components. It cannot query for specific elements — it must parse the entire visual field every time.

The model is inferring structure from visual patterns. It is never actually reading the interface.

The Fundamental Error

Here is the sentence that summarizes everything wrong with the current approach:

The screenshot paradigm performs computer vision on a UI that already describes itself as text.

This is equivalent to:

Photographing a JSON response and running OCR on the photo, instead of parsing the JSON
Taking a screenshot of a spreadsheet and using a vision model to read cell values, instead of calling the spreadsheet API
Recording someone reading a book aloud, running speech-to-text on the audio, instead of opening the text file

The data is already there. It has been there for 25 years. In structured, semantic, machine-readable form. And the entire industry decided to take pictures of it instead.

3. The Numbers That Should Embarrass an Industry

Let's look at the actual benchmarks. Not marketing claims. Not press releases. Real, reproducible numbers from standardized evaluation frameworks.

OSWorld Benchmark (December 2025)

OSWorld is the industry-standard benchmark for evaluating AI agents on desktop tasks. It measures whether an agent can complete real-world workflows on a desktop operating system.

Agent	Success Rate	Average Time per Task
AskUI VisionAgent (current leader)	66.2%	N/A
UI-TARS 2 (ByteDance)	47.5%	12–18 minutes
OpenAI CUA o3 (Operator)	42.9%	15–20 minutes
Claude Computer Use (standalone)	22–28%	10–15 minutes
Human baseline	72.4%	30 seconds – 2 minutes

(OSWorld leaderboard as of February 2026. Numbers shift weekly. The structural argument — screenshot agents burn thousands of tokens per step and take minutes per task — does not.)

Even the current leader at 66.2% still fails one in three tasks, still uses screenshots, still burns thousands of tokens per perception step, and still takes orders of magnitude longer than a human. That is the state of the art. That is what hundreds of billions of dollars have produced.

And these are controlled test conditions. In real-world usage, with unexpected pop-ups, loading screens, network delays, and UI variations, the success rate drops further.

Token Economics

Let's compare the cost of a single perception step — one moment of "looking at the screen":

Method	Tokens per Perception	Data Type
Screenshot (vision model)	1,200–5,000	Compressed image pixels
Full tree dump (JSON/YAML)	5,000–15,000	Hierarchical text structure
DirectShell (.a11y.snap)	50–200	Filtered, indexed element list
DirectShell (SQL query)	10–50	Targeted query result

For continuous background monitoring (checking if an email arrived, watching for a form submission), the token difference exceeds 100:1.

This means an agent using DirectShell can maintain 10–30x more operational history in its context window, enabling significantly longer and more complex workflows without context degradation. Where a screenshot-based agent runs out of context after 10–20 actions, a DirectShell-based agent can maintain hundreds of actions in working memory.

Latency Comparison

Operation	Screenshot Agent	DirectShell
Perceive screen state	2–5 seconds	< 1 millisecond (file read)
Identify target element	Part of vision inference	Microseconds (SQL query)
Execute action	200–500ms (mouse simulation)	30ms (action dispatch)
Full perception-action cycle	3–8 seconds	< 100ms
30-step workflow (optimistic)	1.5–4 minutes	3–10 seconds

The difference is not incremental. It is not 2x or 5x. It is orders of magnitude. A 30-step workflow that takes the best AI agent 15 minutes (when it works at all) takes DirectShell seconds. And DirectShell does not fail because it clicked the wrong pixel. There are no wrong pixels. There is a database query that returns the exact element.

Part II: The Insight

4. The Door That Was Always Open

Here is the secret. Here is what nobody saw. Here is why this article exists.

Every application on your computer is already describing itself in full structural detail. Right now. While you read this. Every button is declaring its name, its role, whether it's enabled, and where it is on screen. Every text field is exposing its current value. Every menu hierarchy is represented as a traversable tree. Every checkbox knows whether it's checked.

This data exists in every application. On every modern operating system. Updated in real-time. On every UI change.

It's called the Accessibility Tree. And it was built for blind people.

The Accessibility Layer: A Brief History

In 1997, Microsoft introduced MSAA (Microsoft Active Accessibility) as part of Windows 95/98. The purpose was simple: enable screen readers — software that reads the screen aloud — so that blind and visually impaired people could use computers.

In 2005, with Windows Vista, Microsoft introduced UI Automation (UIA) — a modern, more powerful replacement. UIA provides a complete, hierarchical, real-time representation of every GUI element in every application running on the system.

Here is what that looks like:

Window: "Invoice - Datev Pro"
├── TitleBar
│   ├── Button: "Minimize"
│   ├── Button: "Maximize"
│   └── Button: "Close"
├── MenuBar
│   ├── MenuItem: "File"
│   ├── MenuItem: "Edit"
│   └── MenuItem: "Help"
├── Pane: "Invoice Details"
│   ├── Edit: "Customer Number"  →  Value: "KD-4711"
│   ├── Edit: "Amount"           →  Value: "1,299.00"
│   ├── ComboBox: "Tax Rate"     →  Value: "19%"
│   └── Button: "Book"           →  IsEnabled: true
└── StatusBar: "Ready"

Each element provides:

Name — human-readable label ("Save", "Customer Number", "Inbox")
ControlType — semantic role (Button, Edit, ComboBox, ListItem, Menu...)
Value — field content, URL, selected item
AutomationId — developer-assigned unique identifier
BoundingRectangle — exact position and size on screen (x, y, width, height)
IsEnabled — whether it can be interacted with
IsOffscreen — whether it's currently visible
Parent/Child relationships — full hierarchical tree structure

This is pure text. This is what LLMs are built to process.

No vision model needed. No coordinate guessing. No pixel interpretation. The semantic layer already exists. It has existed for 25 years.

And in 2026, while OpenAI, Google, and Anthropic spent hundreds of billions taking screenshots, nobody was using it as a universal interface for AI agents.

Why Accessibility Trees Exist Everywhere

This is not a Windows-specific feature. Every major operating system has an equivalent:

Platform	Framework	Year Introduced
Windows	UI Automation (UIA) / MSAA	1997 / 2005
macOS	NSAccessibility / AXUIElement	2001
Linux	AT-SPI2 (Assistive Technology SPI)	2001
Android	AccessibilityService API	2009
iOS	UIAccessibility	2008

Every major application framework implements these APIs. Native apps implement them. Web apps implement them (through the browser's accessibility layer). Cross-platform frameworks (Electron, Qt, GTK) implement them. Chromium-based applications expose the entire DOM through the accessibility tree.

The coverage is not optional. It is a platform-level requirement. And as we'll see in Part IV, it is increasingly a legal requirement that cannot be removed.

5. What Already Exists (And Why It's Not Enough)

Before I explain what DirectShell does, let me honestly acknowledge what already exists. This is not a field where nothing has been done. People have used accessibility APIs before. The question is: how, and why wasn't it enough?

I surveyed 419 academic sources through the OSWorld literature, every major GitHub repository in the AI agent space, and every commercial product I could find. Here is the complete landscape as of February 2026.

Screen Readers (JAWS, NVDA, Narrator)

Screen readers have been using accessibility APIs since 1997. They walk the accessibility tree and read element names aloud for blind users. They are single-purpose assistive tools. They do not expose the tree as structured data. They do not provide query interfaces. They are not designed for programmatic consumption. They proved that the data exists — they never made it programmable.

Microsoft UFO / UFO2 / UFO3 (2024–2025)

Microsoft Research published UFO (UI-Focused Agent) in February 2024, UFO2 in April 2025, and UFO3 Galaxy in November 2025. UFO uses Windows UI Automation as one component in a hybrid system that also uses screenshots and native APIs. It is an agent framework — a specific application built on top of UIA, not a universal interface layer.

The critical difference: UFO walks the accessibility tree, dumps it as JSON, and sends the entire blob to GPT-4o. This creates the same context saturation problem as screenshots — instead of millions of pixels, you get tens of thousands of JSON tokens. A full UIA dump of a complex application (like Excel or Claude Desktop) results in 60–100 KB of JSON. That's 15,000+ tokens consumed just to tell the model what's on screen.

UFO3 expanded to a "Galaxy" multi-agent framework covering 20+ Windows applications. Still JSON dumps. Still no SQL. Still an application, not infrastructure.

UFO is an application that happens to use UIA. DirectShell is the infrastructure layer that makes UIA usable.

Windows-MCP (CursorTouch, 2025)

Windows-MCP is the closest thing to DirectShell that existed before DirectShell. With 4,300+ stars and over 2 million users in Claude Desktop, it exposes the Windows accessibility tree through MCP (Model Context Protocol) tools.

What it does: reads UIA elements, provides click/type actions by element name, works across desktop applications.

What it doesn't do: no SQL database, no persistent storage, no multi-format output, no overlay window, no delta-based event system, no action queue. Every perception call walks the full tree and returns results in-memory. There is no way for an external script to query the UI state without going through the MCP protocol.

Windows-MCP is a tool. DirectShell is the layer that tools are built on.

Playwright MCP (Microsoft, 2025)

Playwright MCP exposes web page accessibility trees through the Model Context Protocol. Vercel's agent-browser refined this approach by reducing the tree and using element references (like @e21). Their research showed a 73% token reduction compared to screenshots — proving the core thesis that accessibility trees are more efficient than pixels.

But Playwright MCP only works for web pages in browsers. It does not work for desktop applications. It does not work for SAP. It does not work for Datev. It does not work for any of the millions of desktop applications that businesses run every day. The moment you leave the browser, Playwright MCP is blind.

computer-mcp (CommandAGI, 2025)

computer-mcp takes a cross-platform approach, exposing accessibility trees on Windows, macOS, and Linux through MCP. The most ambitious scope of any existing tool.

The problem: it returns the full accessibility tree as JSON. For a complex application, that's 15,000–60,000 tokens per read. This is the same context saturation problem as screenshots, just in text form. No SQL filtering. No multi-format output. No way to ask "what are the interactive elements?" without ingesting the entire tree.

macOS UI Automation MCP (mb-dev, 2025)

macOS UI Automation MCP uses JSONPath queries to filter the accessibility tree on macOS. This is the closest architectural analog to DirectShell's approach — it recognized that the raw tree is too large and introduced a query language.

But JSONPath is not SQL. It cannot do joins, aggregations, or complex filtering. It runs on macOS only. And critically, it does not persist the tree in a database — each query re-walks the tree from scratch. There is no historical state, no action queue, no external interface.

pywinauto (Open Source, Python)

pywinauto is the granddaddy of Windows accessibility automation. 3,700+ stars. Used by the GOI paper (October 2025) to build declarative interfaces on top of Windows UIA — 18,000+ lines of Python code.

pywinauto is a library, not infrastructure. It requires a full Python runtime. It provides programmatic access to individual elements but does not store the tree, does not generate output formats, and does not provide a universal action interface. It is a toolkit for building automation scripts, not a primitive for building systems.

RPA Tools (UiPath, Automation Anywhere, Blue Prism)

Enterprise RPA tools use accessibility selectors as one of several element-targeting strategies, alongside image matching, coordinate-based clicking, and OCR. They require per-application scripting. They do not expose the full element tree as a queryable data structure. They are workflow automation tools, not universal interface layers.

UiPath is valued at ~$6 billion. Its entire business model is "we help you automate applications that don't have APIs." Each integration costs $50K–$150K/year to build and maintain. DirectShell does what UiPath does with a 700 KB binary and no scripting required.

The $28.3 billion RPA market (projected $247 billion by 2035) exists because desktop applications don't have APIs. DirectShell gives every application an API.

The Screenshot Agents (OpenAI, Anthropic, Google, ByteDance)

For completeness, here is what the major AI labs built:

Agent	Approach	OSWorld Success Rate	Source
AskUI VisionAgent	Screenshots + custom vision	66.2% (leader)	OSWorld leaderboard
UI-TARS 2 (ByteDance)	Screenshots + specialized vision	47.5%	OSWorld leaderboard
OpenAI Operator (CUA o3)	Screenshots + GPT-4o + RL	42.9%	OSWorld benchmark
Anthropic Computer Use	Screenshots + Claude	22–28% (standalone)	OSWorld benchmark
Google Project Mariner	Screenshots + DOM hybrid	Browser-only	$249.99/month
Microsoft Copilot Studio	Screenshots + UIA hybrid	Desktop + browser	September 2025

All screenshot-based. Failure rates ranging from 34% (current leader) to 72% (standalone Computer Use). All consuming 1,200–5,000 tokens per perception step. All taking 10–20 minutes for tasks humans complete in under two. All pursuing the paradigm that DirectShell makes obsolete.

The Complete Comparison

Here is every tool plotted against the five architectural components that define DirectShell:

Tool	A11y Tree Read	SQL Database	Multi-Format Output	Action Queue	Universal (any app)
Screen Readers	Yes	No	No	No	Yes
Microsoft UFO/UFO2/UFO3	Yes	No	No	No	Yes
Windows-MCP	Yes	No	No	No	Yes
Playwright MCP	Yes (browser)	No	No	No	Browser only
computer-mcp	Yes	No	No	No	Yes
macOS UI Automation MCP	Yes	No	No	No	macOS only
pywinauto	Yes	No	No	No	Yes
RPA (UiPath etc.)	Partial	No	No	No	Per-script
Screenshot agents	No	No	No	No	Yes (poorly)
DirectShell	Yes	Yes	Yes	Yes	Yes

No existing tool implements more than 2 of the 5 components. DirectShell implements all 5.

6. The Gap Nobody Filled

Let me state the gap precisely, because precision matters.

I searched 419 academic sources indexed by OSWorld. I searched GitHub for every combination of "accessibility tree" + "SQL," "UIA" + "database," "a11y" + "SQLite." I searched Google Scholar, ArXiv, ACL Anthology, and patent databases.

Zero results.

No project, paper, product, or patent on Earth — as of February 16, 2026 — describes a system that:

Continuously dumps the complete accessibility tree of any application into a queryable relational database at real-time refresh rates
Automatically generates multiple machine-readable output formats optimized for different consumer types (50-token LLM snapshots vs. full queryable database)
Provides a universal action queue where any external process can submit input actions by element name via a simple INSERT INTO inject
Captures live UI events (property changes, structure mutations, window opens) as a delta stream — enabling 50-token perception instead of re-reading the full tree
Operates as infrastructure rather than as an application — a universal primitive between any agent and any GUI

These are not incremental improvements. These are architectural innovations that create a new category.

Why the Gap Existed

The components have been available for decades:

The accessibility tree: since 1997 (MSAA), refined 2005 (UI Automation)
SQL databases: since the 1970s
The Win32 input system: since the 1990s
MCP (Model Context Protocol): since November 2024

Each is well-understood, battle-tested technology. The gap existed not because the technology was missing, but because two communities never talked to each other:

The accessibility community knew the tree existed but built single-purpose assistive tools
The AI community knew LLMs needed structured data but assumed GUIs could only be perceived through screenshots

The ShowUI paper proved that 33% of screenshot tokens are visually redundant. The OSWorld benchmark showed accessibility-tree approaches consistently outperforming pure vision. Research from accessibility.works demonstrated that agents with accessibility data succeed 85% of the time while consuming 10x fewer resources.

The evidence was everywhere. The obvious conclusion — put it in a database and let LLMs query it — was nowhere.

What nobody did — in 29 years — was combine them into a universal interface primitive.

Until February 16, 2026.

Part III: DirectShell

7. What DirectShell Is

The One-Sentence Definition

DirectShell turns every GUI on the planet into a text-based API that any LLM can natively read and control.

That is the entire concept. Everything else is implementation detail.

What DirectShell Is Not

DirectShell is not an automation script. It is not an RPA tool. It is not a screen reader. It is not a macro recorder. It is not a testing framework. It is not a product.

DirectShell is a primitive.

A primitive in computing is a fundamental building block that:

Cannot be decomposed into simpler components that achieve the same function
Enables an entire category of higher-level tools and workflows
Has no expiration date — it remains useful as long as the platform exists

Think of the building blocks that make modern computing possible:

Primitive	Domain	What It Universalizes
TCP/IP	Networking	Reliable data transport between any two computers
HTTP	Web	Standardized request-response for any resource
SQL	Data	Universal query language for any relational database
The Browser	Information	Universal client for any web resource
PowerShell	Backend	CLI access to any OS service, registry, process, file
DirectShell	Frontend	Input/output control for any GUI application

PowerShell automates the backend. DirectShell automates the frontend.

Before DirectShell, the graphical frontend of every application was a closed system. You could look at it (screenshots) or you could use the vendor's API (if one existed, if you could afford it, if the vendor allowed it). There was no general-purpose, structured, queryable, writable interface to the visual layer of software.

After DirectShell, every application that has a window has a universal interface. The same structured output. The same action format. The same data model. Regardless of vendor, language, age, or platform.

How It Works (30-Second Version)

DirectShell is a lightweight overlay window (single binary, no dependencies, ~700 KB)
You drag it onto any running application. It "snaps" to it.
Once snapped, DirectShell continuously reads the application's entire UI state through the Windows Accessibility framework
It stores everything in a SQLite database — every button, text field, menu item, their names, values, positions, and states
It generates multiple text files optimized for different consumers (scripts, AI agents, LLMs)
External processes can control the application by writing simple SQL commands to an action queue in the same database
DirectShell executes those commands as native input events — keyboard strokes, mouse clicks, text insertion — that are indistinguishable from human input

Both directions are text. Both directions are LLM-native.

The AI reads a text file to understand the screen. The AI writes a SQL command to act on the screen. No screenshots. No pixels. No coordinate guessing. No vision model. Just text in, text out — the native operating mode of every language model on earth.

8. The Architecture

8.1 The Physical Layer: An Invisible Overlay

DirectShell starts as a small, translucent window with an anthracite frame and a subtle light animation that travels around its border — a visual signature indicating it's alive and ready.

When you drag this window over any running application and release it, DirectShell snaps to the target — detecting the application, matching its position and dimensions, and binding to it. The word isn't accidental. It's what it feels like: a magnet clicking into place. From this point forward, the two windows behave as one: move one, the other follows. Minimize one, both minimize. Close one, both close. The application has been snapped. It now has a universal interface.

The key technical elements:

Transparent Click-Through: The overlay uses WS_EX_LAYERED with color keying. The center of the overlay is magenta (keyed out to full transparency). All input — mouse clicks, keyboard strokes — passes straight through to the target application below. The user never notices DirectShell is there.

Owner-Window Relationship: DirectShell uses SetWindowLongPtrW to establish an owner-owned relationship with the target. Windows automatically maintains Z-order inheritance — the overlay always stays on top of its owner, but not on top of other applications.

Bidirectional Position Sync: A 60 Hz timer (SYNC_TIMER, 16ms) continuously monitors both windows. If the target moves, DirectShell follows. If the user drags DirectShell, the target follows. The synchronization is seamless — the two windows feel like one.

Smart Button Detection: When snapping, DirectShell uses UIA to analyze the target's title bar. It locates the minimize, maximize, and close buttons, and positions its own unsnap button adjacent to them — fitting naturally into the target's chrome. This is not hardcoded. It adapts to any application's title bar layout.

Shell Window Filtering: DirectShell prevents itself from snapping to the Desktop, Taskbar, or system tray by checking window class names against known shell classes (Progman, WorkerW, Shell_TrayWnd, etc.).

The physical layer is elegant engineering, but it's not the innovation. It's the foundation on which the real breakthrough is built.

8.2 The Perception Pipeline: GUI → Database

This is the core of DirectShell. This is what makes it a primitive.

Every 500 milliseconds (2 Hz), DirectShell spawns a background thread that performs a complete traversal of the target application's UI Automation tree. The traversal is depth-first, unlimited depth, unlimited children, using IUIAutomation::RawViewWalker() for an unfiltered view of every element the operating system knows about.

For each element, the following properties are extracted:

Property	UIA Method	What It Tells You
Control Type	`CurrentControlType()`	What this element IS (Button, Edit, Menu...)
Name	`CurrentName()`	What it's CALLED ("Save", "Customer Number")
Value	`GetCurrentPattern(ValuePatternId)`	What it CONTAINS (field text, URL, selection)
Automation ID	`CurrentAutomationId()`	Developer's internal identifier
Enabled	`CurrentIsEnabled()`	Can it be interacted with right now?
Off-screen	`CurrentIsOffscreen()`	Is it currently visible?
Bounding Rectangle	`CurrentBoundingRectangle()`	Exact position and size on screen

Each element is immediately inserted as a row in a SQLite database. The database uses Write-Ahead Logging (WAL) mode, enabling external processes to read the database at any time without blocking or corruption, even while DirectShell is writing to it.

Instead of accumulating all elements in memory and then dumping them — which would delay availability — DirectShell streams elements to the database during traversal. A commit happens every 200 elements. This means that the top-level UI elements (menu bars, main buttons, input fields) are available for query within milliseconds of the walk starting, while deeper nested elements continue to be discovered and written.

After the tree walk completes, DirectShell generates four output files:

1. The Database (.db)

The complete element tree as a SQLite database with full SQL query capability:

-- What buttons can the user click?
SELECT name, x, y FROM elements WHERE role='Button' AND enabled=1 AND offscreen=0

-- What's in the text fields?
SELECT name, value FROM elements WHERE role='Edit'

-- Find a specific message in a chat
SELECT name FROM elements WHERE name LIKE '%invoice%'

-- How many unread items?
SELECT count(*) FROM elements WHERE role='ListItem' AND name LIKE '%unread%'

-- Complete app structure overview
SELECT role, COUNT(*) FROM elements GROUP BY role ORDER BY COUNT(*) DESC

Each query executes in microseconds. The LLM doesn't need to parse a 100 KB JSON document to find one button. It asks a specific question and gets a specific answer.

2. The Snapshot (.snap)

A flat list of all interactive, enabled, visible elements with their input tool classification:

# opera.snap — Generated by DirectShell
# Window: Google Gemini – Opera

[keyboard] "Adressfeld" @ 168,41 (2049x29) id=addressEditor
[click] "Neuer Chat" @ 45,107 (2515x1285)
[keyboard] "Einen Prompt für Gemini eingeben" @ 999,1177 (1069x37)
[click] "Einstellungen & Hilfe" @ 1800,1350 (150x20)

This is the deterministic operations manual for scripts and automation tools. Every element that accepts input, classified by input type, with exact coordinates.

3. The Screen Reader View (.a11y)

A structured text representation with three sections: Focus (what's currently selected), Input Targets (text fields and their current values), and Content (all visible text, links, and labels). This is the situational awareness file — it tells an agent where it is, what it can see, and what it can type into.

4. The Operable Element Index (.a11y.snap)

The LLM pipeline. This is what an AI agent actually reads:

# opera.a11y.snap — Operable Elements (DirectShell)
# Window: Google Gemini – Opera
# Use 'target' column in inject table to aim at an element by name

[1] [keyboard] "Adressfeld" @ 168,41 (2049x29)
[2] [click] "Neuer Chat" @ 45,200 (200x30)
[3] [click] "Meine Inhalte" @ 45,240 (200x30)
[4] [click] "Gems" @ 45,280 (200x30)
[5] [keyboard] "Einen Prompt für Gemini eingeben" @ 999,1177 (1069x37)
[6] [click] "Einstellungen & Hilfe" @ 1800,1350 (150x20)

# 6 operable elements in viewport

Six lines of text. That is the entire perception step for an AI operating Google Gemini. Not a 5,000-token screenshot. Not a 15,000-token JSON dump. Six numbered lines that say: here are the 6 things you can interact with, here's what each one is called, and here's what type of input each one accepts.

An LLM reads this and instantly knows: "Element [5] is a text input. It's called 'Einen Prompt für Gemini eingeben'. I can type into it." That is the complete perception. No vision model. No inference. No guessing. A few lines of text.

This is an automatically generated API documentation for every application on the planet, that didn't have one.

8.3 The Chromium Problem (And How We Solved It)

Here is a problem that would stop most projects cold. Chromium — the engine behind Chrome, Edge, Opera, and every Electron app (Discord, Slack, VS Code, Spotify, Claude Desktop, and hundreds more) — does not build its accessibility tree by default.

Chromium is performance-obsessed. Building an accessibility tree for the entire DOM costs CPU cycles. So Chromium only does it when it has evidence that an assistive technology (like a screen reader) is actively listening. Without that evidence, a UIA query against a Chromium window returns a skeleton: 9 elements. Window, pane, title bar. Nothing useful.

This meant that out of the box, DirectShell could read native Windows applications perfectly but was blind to every browser and every Electron app on the system. Given that half of modern desktop software is Chromium-based, this was an existential problem.

The solution took three simultaneous signals:

Phase 1: System-Level Screen Reader Flag

SystemParametersInfoW(SPI_SETSCREENREADER, 1, ...)

DirectShell registers itself with Windows as an active assistive technology. This is the same flag that JAWS, NVDA, and Windows Narrator set. When this flag is active, Chromium knows a screen reader is present and begins constructing its accessibility tree.

Additionally, DirectShell sends WM_SETTINGCHANGE directly to the target window — not waiting for the system-wide broadcast that may or may not reach the application in time.

Phase 2: The UIA Focus Handler (Key Innovation)

Here is the clever part. Chromium doesn't just check the screen reader flag. It also checks whether any UIA event handlers are registered — specifically, it calls UiaClientsAreListening(). If that function returns false, Chromium may still skip building its tree.

DirectShell creates a UIA FocusChangedEventHandler — a COM object that implements the IUIAutomationFocusChangedEventHandler interface. This handler does absolutely nothing. Its HandleFocusChangedEvent method is an empty function that immediately returns Ok(()).

But by registering this no-op handler with AddFocusChangedEventHandler, the system now has a registered UIA event listener. UiaClientsAreListening() returns true. And it stays true permanently — because DirectShell intentionally leaks the handler using Box::leak(). It's never deregistered. It never gets garbage collected. It persists for the lifetime of the process.

This single leaked COM object is what forces every Chromium instance on the system to build and maintain its full accessibility tree.

Phase 3: Direct Window Probing

After setting the system flag and registering the handler, DirectShell waits 300ms and then directly probes the target window and all its child windows:

AccessibleObjectFromWindow (MSAA probe) on the main window
EnumChildWindows to iterate all child windows
AccessibleObjectFromWindow + WM_GETOBJECT(OBJID_CLIENT) on each child

This specifically targets Chrome_RenderWidgetHostHWND — the renderer's window handle. The WM_GETOBJECT message forces the renderer to create its accessibility provider if it hasn't already.

Phase 4: Wait and Retry

After another 500ms delay (to give Chromium time to process all signals), DirectShell repeats the child window probe for reliability.

The Result:

In our first test with Opera Browser, the element count went from 9 (shell only) to 800+ (complete browser UI including all web page content). With Claude Desktop (Electron), it went from a handful to 11,454 elements — every chat message, every button, every link, fully searchable and queryable.

This four-phase activation sequence is not a hack. It uses the same signals that legitimate screen readers use. It's just more thorough about ensuring every Chromium process on the system gets the message.

8.4 Multi-Format Output: Automatic API Documentation

Let me re-emphasize this because it's the most underrated aspect of the architecture.

DirectShell doesn't just dump a tree. It generates four different output formats, each optimized for a different consumer:

Format	Consumer	Size	Purpose
`.db` (SQLite)	Scripts, SQL clients, programs	Full tree (100KB–1.5MB)	Complete queryable state
`.snap`	Automation scripts	3–15 KB	All interactive elements, classified
`.a11y`	Context-aware agents	3–10 KB	Focus, inputs, visible content
`.a11y.snap`	LLMs	1–5 KB	Numbered operable elements only

This is a multi-tier API documentation system that DirectShell generates automatically for every application it touches. The same underlying data, presented at four levels of abstraction, for four different types of consumers.

A Python script that needs to automate a form reads the .snap file.
A sophisticated AI agent reads the .a11y file for context.
A lightweight LLM reads the .a11y.snap file — just the numbered list.
A power user runs SQL queries against the .db for any question the other formats don't answer.

No application provides this documentation. No vendor writes it. DirectShell generates it automatically, every 500 milliseconds, for any application you point it at.

This is what makes DirectShell a primitive. It doesn't solve one problem for one application. It provides a universal structured interface for every application. The same output format. The same action format. Whether the target is SAP, Notepad, Excel, a 20-year-old legacy system, or the latest Electron app.

8.5 The Action Pipeline: Native Control

Reading the UI is only half the equation. The other half is controlling it.

DirectShell maintains a persistent table in the SQLite database called inject. Any external process can submit actions by writing a simple SQL INSERT:

-- Set text in a specific field (UIA ValuePattern — instant)
INSERT INTO inject (action, text, target) VALUES ('text', '2,599.00', 'Amount');

-- Type character-by-character (raw keyboard — for chat inputs)
INSERT INTO inject (action, text) VALUES ('type', 'Hello World');

-- Press a key combination
INSERT INTO inject (action, text) VALUES ('key', 'ctrl+a');

-- Click a named element
INSERT INTO inject (action, target) VALUES ('click', 'Book');

-- Scroll
INSERT INTO inject (action, text) VALUES ('scroll', 'down');

Five action types cover every interaction a human can perform with a GUI:

Action	Mechanism	Speed	Use Case
`text`	UIA ValuePattern `SetValue()`	Instant (whole string)	Form fields, address bars, search boxes
`type`	`SendInput` per character (5ms delay)	~200 chars/sec	Chat inputs, terminals, apps that reject SetValue
`key`	`SendInput` with virtual key codes	Instant	Keyboard shortcuts (Ctrl+S, Enter, Tab)
`click`	UIA `FindFirst` + `SendInput` mouse event	Instant	Click any named element
`scroll`	`SendInput` with `MOUSEEVENTF_WHEEL`	Instant	Scroll in any direction

The action dispatch runs on its own timer at 33 Hz (30ms interval) — separate from the tree dump timer. This is critical for typing: at 33 Hz, a 200-character message takes about 1 second to type. If actions were dispatched at the tree dump rate of 2 Hz, the same message would take 100 seconds.

Auto-Focus: Before executing any action, the dispatch loop checks whether the target application is in the foreground. If not, it automatically brings it forward using the Alt-key trick (VK_MENU down+up) followed by SetForegroundWindow. This means actions work even when the target is behind other windows.

Mark-Before-Execute: Each action is marked as done=1 before execution, not after. This prevents double-fire if the action takes longer than the 30ms timer interval. If execution fails, the done flag is reset to 0 for retry on the next tick.

Native Input: The target application cannot distinguish DirectShell-mediated input from physical hardware input. SendInput generates the same low-level events that a keyboard and mouse produce. The operating system itself vouches for the events as legitimate.

8.6 The Keyboard Hook: The Interception Layer

DirectShell installs a global low-level keyboard hook (WH_KEYBOARD_LL) that intercepts every keystroke before it reaches the target application. This creates a Man-in-the-Middle architecture — not on the network, but on the local input stack.

Currently, the hook passes through all keystrokes unchanged. The transform_char() function is an identity function — it returns the character without modification. But the architecture is in place for arbitrary character transformation:

PII Sanitization: Replace names, addresses, and account numbers with hashes before they reach a cloud-connected chat application
Auto-Translation: Type in German, the application receives English
Auto-Correction: Dyslexia support — the user types with errors, the application receives corrected text
Input Filtering: Block specific key patterns in specific applications

The hook runs only when DirectShell is snapped, only for non-injected keystrokes (to avoid feedback loops), only when the target has foreground focus, and only when no modifier keys (Ctrl, Alt) are held — preserving keyboard shortcuts.

This is the slot for the "universal LLM in every text field" use case. The infrastructure is built. It's waiting to be filled.

8.7 Timer Architecture: Four Heartbeats

DirectShell's runtime behavior is driven by four independent timers:

                    ┌─────────────────────┐
                    │   WM_TIMER          │
                    │   (Window Proc)     │
                    └─────────┬───────────┘
                              │
          ┌───────────────┬───┴───┬───────────────┐
          ▼               ▼       ▼               ▼
 ┌────────────┐  ┌────────────┐  ┌──────────┐  ┌──────────────┐
 │ SYNC_TIMER │  │ ANIM_TIMER │  │TREE_TIMER│  │ INJECT_TIMER │
 │   ID: 1    │  │   ID: 2    │  │  ID: 3   │  │    ID: 4     │
 │   16 ms    │  │   33 ms    │  │  500 ms  │  │    30 ms     │
 │  ~60 Hz    │  │  ~30 Hz    │  │   2 Hz   │  │   ~33 Hz     │
 └─────┬──────┘  └─────┬──────┘  └────┬─────┘  └──────┬───────┘
       │               │              │                │
       ▼               ▼              ▼                ▼
  do_sync()      InvalidateRect  dump_tree()    process_injections()
 (position)       (repaint)     (a11y tree)     (action queue)

Timer	Frequency	Purpose	When Active
SYNC	60 Hz	Position synchronization between overlay and target	Snapped
ANIM	30 Hz	Light reflex animation on the frame border	Unsnapped
TREE	2 Hz	Full accessibility tree dump + output file generation	Snapped
INJECT	33 Hz	Action queue processing (typing, clicking, scrolling)	Snapped

The animation timer and snap timers are mutually exclusive. When DirectShell snaps to a target, the animation stops and the perception/action timers start. When it unsnaps, the reverse happens. There is no wasted processing.

Why INJECT_TIMER is separate from TREE_TIMER: The tree dump is a heavy operation (full UIA traversal + SQLite rebuild) that runs at 2 Hz. Action dispatch needs to be much faster for fluid typing. If actions were dispatched at 2 Hz, typing 200 characters would take 100 seconds. At 33 Hz, it takes 1 second. The separate timer ensures actions feel instant to the user watching the target application.

9. The Code

DirectShell is written in pure Rust. A single file: src/main.rs, 2,053 lines.

Two dependencies:

rusqlite 0.31 (with bundled SQLite — no system dependency)
windows 0.58 (official Microsoft Rust bindings for Win32)

That's it. No runtime. No framework. No .NET. No Python. No Node.js. No package manager ecosystem. No 500 MB node_modules directory.

The binary compiles to approximately 700 KB (SQLite's bundled C library accounts for ~500 KB of that). It runs on any 64-bit Windows 10 or 11 system. It requires no installation. No administrator privileges (for standard UIA operation). No configuration file. You download one file, you run it, it works.

This matters because it establishes DirectShell as infrastructure, not as an application. Infrastructure must be lightweight, dependency-free, and universally deployable. A 700 KB single binary that runs everywhere meets that bar.

The choice of Rust is deliberate:

Zero-cost abstractions — no garbage collector, no runtime overhead
Memory safety — no use-after-free, no buffer overflows, no null pointer dereferences
Safe Win32 FFI — the windows crate provides typed, safe bindings to every Win32 API
Single binary — Rust compiles to a standalone executable with no runtime dependencies
Cross-compilation potential — the architecture ports to other platforms (macOS, Linux) without architectural changes

10. The Proof: Demo Day

On February 16, 2026 — 8.5 hours after the first line of code was written — DirectShell controlled four different applications in a live demonstration.

The setup: Claude Opus 4.6 (running in the Claude Code CLI terminal on the left side of a split screen) used DirectShell to operate applications on the right side. The AI read .a11y and .a11y.snap files to understand the screen, then wrote SQL INSERT commands to the inject table to perform actions. No screenshots. No vision model. Pure text.

Google Sheets: 72 Cells in Seconds

The AI was asked to create a product comparison table. What happened:

We snapped Opera (with Google Sheets loaded)
The AI read the .a11y.snap — saw the input fields and the sheet grid
The AI inserted actions: click cell A1, type "Produkt", Tab to B1, type "Preis", and so on
DirectShell executed the actions at 33 Hz
Within seconds, 72 cells were filled — headers, product names, prices, categories, ratings, and SUM formulas

The formulas had an offset bug — SUM ranges were shifted by one row. This was a first-day interpretation error, not an architectural limitation. The AI was calculating cell references based on its understanding of the grid, and its reference frame was off by one. This is exactly the kind of issue that app profiles will solve — a config file that tells the AI "A1 in Sheets is at these coordinates."

But the point stands: an AI filled 72 cells in a spreadsheet through the accessibility layer alone. No Sheets API. No browser extension. No scripting. Raw input through a legally protected interface.

Google Gemini: Cross-AI Conversation

The AI navigated to Google Gemini in the browser, typed a message into Gemini's input field, and received a response. Then it read Gemini's response through DirectShell's accessibility tree and reported it back.

Gemini's response about DirectShell? "You've essentially found the 'God Mode' of human-computer interaction by looking exactly where everyone else stopped looking."

A Google AI, running on Google's infrastructure, accessed through Google's browser, controlled entirely by a competing AI company's model (Claude), through a universal interface layer that Google didn't build, doesn't control, and can't block.

Claude Desktop: Reading Anthropic's Own Application

We snapped Claude Desktop — the chat application built by Anthropic, the company that invented screenshot-based Computer Use.

Result: 11,454 elements. Every chat message, every button, every link, every input field. Fully searchable. Fully queryable. Through the accessibility layer.

The irony: Anthropic built Computer Use (screenshot-based GUI automation). Anthropic also built Claude Desktop (the test target). DirectShell — the text-based alternative — read Anthropic's own application as 11,454 structured text elements. No screenshot. No vision model. One SQL query.

The company that bet on pixels built an app that describes itself perfectly in text.

Notepad: Writing a Manifesto

We snapped Notepad and the AI typed a message directly into the text area. Character by character, at human typing speed, through the raw keyboard injection pathway. Notepad had no idea the input wasn't coming from a physical keyboard.

Google Search: Hitting the Limits

This test showed DirectShell's honest limitations. Google's search page exposes minimal accessibility elements — the search results are deeply nested in a complex DOM with poor accessibility semantics. The AI struggled to navigate search results effectively.

This is not a DirectShell failure. This is a Google accessibility implementation failure. The accessibility tree is only as good as the application's accessibility implementation. Google Search, despite Google's size and resources, has mediocre accessibility support for its search results page. This directly impacts the quality of DirectShell's output.

What the Demo Proves

It's not perfect. Formulas were offset. Tab clicks didn't work on Chromium tabs (the AI switched to Ctrl+PageDown). Opera's autofill popup created confusion. Google Search exposed insufficient elements.

Every one of these failures proves that the system is real. This is not a cherry-picked demo. This is not a happy path. This is an AI agent fighting through unexpected problems in four different applications, adapting in real-time, and still delivering results in seconds — where the state of the art takes minutes and fails most of the time.

Watch It

The full 7-minute demo — uncut, unedited, every bug and every success:

Watch the demo

The Market Reality: Verified Benchmarks (February 2026)

Before you judge the demo, let me show you what the rest of the industry achieves. These are not my numbers. These are published benchmarks from peer-reviewed conferences, official product announcements, and standardized evaluation frameworks.

Desktop Agent Benchmarks

OSWorld (NeurIPS 2024) is the industry standard for evaluating AI agents on real desktop tasks across Windows, macOS, and Linux. 369 tasks, covering productivity software, system administration, and creative workflows.

Agent	Architecture	OSWorld Success Rate	Source
AskUI VisionAgent	Screenshot + custom vision	66.2% (leader)	OSWorld Leaderboard
CoAct-1	Screenshot + collaborative agents	60.76%	OSWorld Leaderboard
UI-TARS 2 (ByteDance)	Screenshot + specialized vision	47.5%	ByteDance/UI-TARS
OpenAI CUA o3 (Operator)	Screenshot + GPT-4o + RL	42.9%	OpenAI
Agent S2 with Claude 3.7	Screenshot + hybrid	34.5%	OSWorld Leaderboard
Claude Computer Use (standalone)	Screenshot + Claude 3.5/3.7	22–28%	Anthropic
Human baseline	Eyes + hands	72.4%	OSWorld Paper

(OSWorld leaderboard as of February 2026. Numbers shift weekly.)

Average time per task for AI agents: 10–20 minutes. For humans: 30 seconds – 2 minutes.

Web Agent Benchmarks

The picture is no better on the web:

Benchmark	Best Agent	Success Rate	Source
WebArena (Controlled)	IBM CUGA	61.7%	Emergent Mind
WebArena (Controlled)	Gemini 2.5 Pro	54.8%	WebChoreArena
WebChoreArena (Hard)	Gemini 2.5 Pro	37.8%	WebChoreArena
Online-Mind2Web (Real Web)	OpenAI Operator	61%	ArXiv
Online-Mind2Web (Real Web)	Most agents	~30%	ArXiv
Mind2Web (Task SR)	GPT-4	4.52%	Mind2Web Eval
ScreenSpot-Pro (Pro GUI)	OS-Atlas-7B	18.9%	ScreenSpot-Pro

Note the pattern: the more realistic the benchmark, the worse the numbers. WebArena (controlled environment): 61.7%. WebChoreArena (harder tasks): 37.8%. Online-Mind2Web (real websites): ~30%. Mind2Web strict task success: 4.52%. The ~90% success rates reported on easier benchmarks like WebVoyager collapse under stricter evaluation.

The Cost Per Perception

Every screenshot-based agent burns tokens on every glance at the screen:

Method	Tokens per Perception	Cost per 1,000 Perceptions (Opus)	Source
Screenshot (1080p)	1,200–1,800	~$4.80	Claude Vision Docs
Screenshot (1440p)	2,000–5,000	~$12.00	Estimated from resolution scaling
Full a11y tree (JSON)	5,000–15,000	~$30.00	Measured on Claude Desktop (11,454 elements)
DirectShell `.a11y.snap`	50–200	~$0.40	Measured
DirectShell SQL query	10–50	~$0.10	Measured
DirectShell `ds_events()` (delta)	20–50	~$0.10	Measured

A 50-step workflow at screenshot resolution: ~$0.60 in vision tokens alone. The same workflow via DirectShell: ~$0.005. That's a 120x cost reduction — before accounting for the eliminated vision model inference.

Research confirms this gap. ShowUI (CVPR 2025) demonstrated that 33% of screenshot tokens are visually redundant. SimpAgent proved that masking half a screenshot barely affects agent performance — meaning half the tokens were wasted. Microsoft Research noted that screenshots "consume thousands of tokens each," making history maintenance "computationally prohibitive." Research from accessibility.works found that agents using accessibility data succeed 85% of the time while consuming 10x fewer resources.

What DirectShell Achieved on Day 1

Now compare those numbers to what a single developer built in 8.5 hours:

Task	Time	Tokens Used	Method
Write multi-paragraph manifest to Notepad	Instant (0ms)	~50	`ds_text` (UIA ValuePattern)
Read entire Claude.ai Haiku conversation	1 read (~2 sec)	~200	`ds_screen` (zoom-out trick)
Cross-app communication (Claude CLI → Claude.ai)	~60 sec	~200	`ds_type` (character injection)
Fill 360 cells in Google Sheets (SOC Incident Log)	~90 sec	~150	`ds_batch` + `ds_type`
Navigate to Gemini tab + interact	~10 sec	~50	`ds_key` + `ds_type`

No screenshots. No vision model. No coordinate guessing. No 15-minute waiting loops. No 34–72% failure rate.

The Google Sheets demo alone — 30 rows, 12 columns, realistic MITRE ATT&CK mappings, IPs, timestamps, severity levels, analyst assignments, response times — would take a screenshot agent dozens of perception cycles, thousands of tokens per cycle, and multiple minutes with a significant probability of failure mid-way. DirectShell did it in three batch calls, ~90 seconds, zero failures.

This is not a marginal improvement. This is a different category.

Part IV: Why This Changes Everything

11. The Paradigm Shift

Let me lay this out clearly, because the difference is not gradual. It is categorical.

Vision vs. Text: A Direct Comparison

Dimension	Screenshot Agent (2026 SOTA)	DirectShell
Input to LLM	2M+ pixel image	SQL query on local DB
LLM modality	Vision (non-native)	Text (native)
Semantic understanding	Inferred from pixel patterns	Explicit from accessibility tree
Element identification	Visual inference (probabilistic)	Name-based lookup (deterministic)
Coordinate precision	Estimated (±pixels)	Exact (BoundingRectangle from OS)
Cost per interaction	High (vision model inference)	Low (text only)
Latency	Seconds (screenshot + cloud inference)	Milliseconds (local file read)
Robustness	Breaks on theme/scale/language change	Immune — reads semantic names, not pixels
Disabled state detection	Cannot reliably detect	`IsEnabled` property, explicit
Hidden element awareness	Cannot see off-screen elements	`IsOffscreen` property, full tree via DB
Multi-element queries	Not possible	SQL queries in microseconds
Context window impact	High (images fill context rapidly)	Low (structured text is compact)
Offline capability	Requires cloud vision model	Local LLM reads local text files
Works with	Browsers only (effectively)	Every application on the OS
Success rate	~35–42% (OSWorld benchmark)	Deterministic element identification
Any LLM can use it	No — requires multimodal vision	Yes — any text model works

The last row is particularly important. Screenshot-based agents require expensive multimodal models (GPT-4o, Claude Sonnet/Opus, Gemini Pro). DirectShell works with any language model — including small, cheap, local models. Llama, Mistral, Phi, DeepSeek, Qwen — if it can read text and produce structured output, it can drive a desktop application through DirectShell.

What This Means Architecturally

The entire AI industry has been framing "computer use" as a vision problem. They built increasingly sophisticated vision-language models to interpret screenshots. They invested in multimodal training data, in spatial reasoning, in coordinate prediction, in action grounding from visual inputs.

DirectShell reframes "computer use" as a text problem. And text is what language models were built for.

This is not a better solution to the same problem. This is the realization that the problem was misidentified from the start. The industry was solving "how do we help AI see the screen better?" when the real question was "why are we making AI look at the screen at all?"

12. Why This Cannot Be Blocked

This section matters more than any other. DirectShell's technical merits are significant, but what makes it truly unprecedented is that it cannot be prevented by the targets it operates on.

The Legal Framework

The accessibility interface that DirectShell uses is protected by an interlocking network of international, regional, and national legislation:

International:

UN Convention on the Rights of Persons with Disabilities (CRPD) — Article 9 (Accessibility), Article 21 (Freedom of expression and access to information). Ratified by 186 states — nearly every country on Earth.

European Union:

European Accessibility Act (EAA) — Directive (EU) 2019/882. Requires all consumer-facing digital products and services to be accessible. Enforcement began June 2025. This is active law, not pending legislation.
Web Accessibility Directive — Directive (EU) 2016/2102. Requires public sector digital services to meet WCAG 2.1 Level AA, which explicitly requires programmatic accessibility (Success Criterion 4.1.2: Name, Role, Value).
EU Charter of Fundamental Rights — Article 26 (Integration of persons with disabilities).

United States:

Americans with Disabilities Act (ADA) — Title III has been interpreted by courts to apply to software and digital services.
Section 508 of the Rehabilitation Act — Requires federal agencies to procure accessible ICT. Explicitly references WCAG and programmatic accessibility.
21st Century Communications and Video Accessibility Act (CVAA) — Requires accessibility in advanced communications services and equipment.

Germany:

Barrierefreiheitsstärkungsgesetz (BFSG) — German transposition of the EAA. In force since June 2025.
Behindertengleichstellungsgesetz (BGG) — Federal disability equality law.
Grundgesetz Article 3(3) — Constitutional prohibition of disability discrimination.

What This Means in Practice

The Windows UI Automation framework exists because the law requires it to exist. Applications must expose their interface elements programmatically so that screen readers and other assistive technology can access them.

DirectShell reads this legally mandated interface. It uses the exact same API calls as JAWS, NVDA, and Windows Narrator. From the operating system's perspective, DirectShell and a screen reader are indistinguishable.

A software vendor who wishes to prevent DirectShell from reading their application faces an insoluble dilemma: every countermeasure that blocks DirectShell also blocks screen readers.

13. The Unpatchability Argument

Let me make this concrete. Here is what a software vendor can try, and what happens:

Countermeasure	Effect on DirectShell	Effect on Screen Readers	Legal Consequence
Disable UIA tree entirely	Blocked	Blocked	Violates EAA, Section 508, ADA
Return empty/minimal UIA data	Partially blocked	Degraded	Violates WCAG 4.1.2 (Name, Role, Value)
Detect and block UIA clients	Blocked	Also blocked (JAWS, NVDA, Narrator)	Discrimination against disabled users
Encrypt UI element names	Blocked	Screen readers can't read interface	Accessibility violation
Remove meaningful element names	Partially blocked	Screen readers read gibberish	WCAG violation
Kernel-level anti-cheat (block input)	Action injection blocked (read still works)	May block assistive input devices	Partial, read still functions

There is no technical mechanism to distinguish between a screen reader querying the accessibility layer and DirectShell querying the accessibility layer. Both use the same COM interfaces. Both traverse the tree using the same walker objects. Both request the same element properties. The operating system does not authenticate accessibility clients. It cannot. The entire point of the accessibility framework is that any assistive technology can use it.

This creates a permanent, legally guaranteed read capability against every application that runs on the platform. The only exceptions are applications with no GUI (command-line tools, background services) — which have no UIA tree to read in the first place.

The PR Dimension

Even if a vendor could find a technical loophole, consider the public relations implications: "SAP blocks screen reader access to protect its API revenue." "Salesforce disables accessibility to prevent automation." "Oracle excludes blind users to enforce licensing terms."

No Fortune 500 company will take that headline. The PR damage alone would be existential. Disability rights organizations would sue. Government contracts would be revoked (Section 508). The EU would fine under the EAA. The entire enterprise sales operation would be jeopardized.

The legal shield is not just a technicality. It is a structural guarantee that makes DirectShell fundamentally different from every previous automation approach. Web scrapers can be blocked by CAPTCHAs, rate limits, and IP bans. API access can be restricted by authentication and terms of service. But the accessibility layer? It was built to be open. It was mandated to be open. And it will stay open — because the alternative is locking blind people out of computers.

The Untested Legal Question

I must be honest about one thing: the specific conflict between "our Terms of Service prohibit automated access" and "the law requires us to provide this accessibility interface" has never been tested in court.

No court has ruled on whether accessibility rights extend to cover automated access via accessibility APIs when the software's TOS prohibits automation. This is legally novel territory.

But the structural argument is clear:

In legal hierarchies, statute supersedes contract
The EAA, ADA, and BFSG are statutes
Terms of Service are contracts
The statute mandates the interface. The contract tries to restrict it.
The statute wins.

And practically: no vendor wants to be the test case. The legal risk is asymmetric. If the vendor wins, they've established a precedent that helps them restrict accessibility APIs — terrible PR, potential regulatory backlash. If the vendor loses, they've wasted legal fees and confirmed that the accessibility layer is untouchable. The incentive structure favors non-litigation.

Part V: What DirectShell Enables

14. For AI Agents

DirectShell converts the problem of "computer use" from a vision task to a text task.

A language model operating through DirectShell does not need vision capabilities. It reads a structured text file describing the screen state, selects an action, and writes it to a database. The entire perception-action loop is text-in, text-out — the native operating mode of every language model.

Any language model can operate any application. Not only expensive multimodal models. GPT, Claude, Gemini, Llama, Mistral, DeepSeek, Phi, Qwen — any model that can read text and produce structured output can drive a desktop application through DirectShell. This democratizes computer use from a capability reserved for frontier models to a capability available to any LLM, including small local models running on consumer hardware.

Context efficiency enables complex workflows. Where a screenshot-based agent runs out of context after 10–20 actions, a DirectShell-based agent can maintain hundreds of actions in its context window. The .a11y.snap file is typically 1–5 KB. An equivalent screenshot is 100–500 KB when encoded. This means the agent can maintain 10–30x more operational history, enabling multi-application workflows, long-running processes, and recovery from errors without losing operational memory.

Deterministic targeting eliminates ambiguity. "Click the element named 'Save'" is unambiguous. "Click the button that looks like it says Save at approximately pixel (1420, 780)" is not. DirectShell removes the entire class of failures caused by visual misidentification. There are no "hallucinated coordinates." There is a database query that returns the exact element or nothing.

Continuous background monitoring becomes feasible. With screenshots, checking "did an email arrive?" costs thousands of tokens and several seconds. With DirectShell, it costs one SQL query and returns in microseconds:

SELECT count(*) FROM elements WHERE role='ListItem' AND name LIKE '%unread%'

An agent can check every 500ms. All day. At negligible cost. This enables reactive agents that respond to events in real-time — something that is economically and technically impossible with screenshot-based approaches.

15. For Enterprise Software

This is where DirectShell becomes an industry-disrupting force.

The End of API Lock-In

The enterprise software industry derives significant revenue from controlling access to application data through proprietary APIs. SAP charges for API access. Salesforce charges per-user per-month for programmatic access. Oracle charges for integration licenses. ServiceNow, Workday, Datev — hundreds of vendors charge for the privilege of accessing data that their customers already own, through interfaces that their customers already pay for.

The business model is: your data lives in our application, and if you want to access it programmatically, you pay us extra.

DirectShell offers an alternative. Any data visible in the application's user interface is accessible through the accessibility tree. If a field is displayed on screen, its name and value are in the element tree. If a table is rendered, its rows and columns are traversable. The data does not need to be extracted through the vendor's API — it is already published through a legally mandated accessibility interface that the vendor cannot disable.

This does not replicate full API functionality. It does not provide bulk data export, webhook-based event triggers, or server-side query optimization. What it provides is universal read access to any data the application displays to the user, and universal write access to any input the application accepts from the user. For the vast majority of automation use cases — filling forms, extracting displayed data, navigating workflows, operating applications — this is sufficient.

The Integration Nightmare, Solved

Every enterprise on Earth has the same problem: System A doesn't talk to System B. SAP doesn't talk to the custom warehouse software from 2004. The hospital management system doesn't talk to the billing software. The CRM doesn't talk to the invoicing tool.

For this problem, an entire industry exists: MuleSoft (acquired by Salesforce for $6.5 billion), UiPath (multi-billion valuation), Automation Anywhere, Celonis, the entire iPaaS (Integration Platform as a Service) market, middleware vendors, connector vendors, system integrators. Thousands of companies whose sole purpose is to make applications talk to each other.

DirectShell makes them obsolete. Not in ten years. Now.

A Python script with 20 lines snaps SAP, snaps Excel, snaps the invoicing system. Reads from one, writes to the others. No API key. No license fee. No vendor conversation. No six-month integration project costing €200,000. Just SQL queries against DirectShell databases and SQL INSERTs into action queues.

The entire premise of the integration industry — "these systems can't talk to each other, so you need us to bridge them" — dissolves when every system has a universal, structured, non-proprietary interface.

16. For Accessibility

The accessibility community should know about DirectShell not just because it uses their infrastructure, but because it extends it.

Universal LLM in Every Text Field

Today, AI writing assistance exists in specific applications: Copilot in Microsoft Office, Gemini in Google Workspace, Grammarly in supported browsers and apps. Each integration is built individually by the vendor, for their specific application.

DirectShell makes it possible to add LLM assistance to every text field in every application on the planet. The keyboard hook intercepts the user's input. A local LLM processes it. The corrected or enhanced text is injected into the application. The application never knows.

For a person with dyslexia, this means: every input field in every application automatically corrects spelling errors before they appear. Not just in Google Docs, where a spell checker exists. In the 20-year-old hospital information system. In the internal ticketing tool from 2008. In SAP's input masks. Everywhere.

For a person who speaks one language but needs to write in another: every text field becomes a live translation interface. Type in German, the application receives English. Without the application knowing or cooperating.

For a person with motor impairments: voice-to-text can be injected into any application, regardless of whether that application supports voice input.

Grammarly is valued at $13 billion. It works in browsers and in apps that explicitly integrate it. DirectShell could make its core functionality available in every application on the OS — for free, using any local LLM.

The Daily Use Case

Imagine this scenario: Lena from accounting needs to write an email to a client about a delayed shipment. She opens Outlook and types into the email body:

tell client mueller shipment delayed because of supplier, friendly, apologetic

DirectShell intercepts this. An LLM transforms it into a professional business letter:

Dear Mr. Mueller,

Thank you for your patience. We regret to inform you that your shipment
(Order #47112) has been delayed due to unforeseen issues with our
primary supplier. We expect delivery within 5-7 business days.

We sincerely apologize for the inconvenience and appreciate your
understanding.

Best regards,
Lena Schmidt

Lena didn't open a ChatGPT tab. She didn't copy-paste between applications. She didn't learn any AI tool. She typed what she wanted in her normal email program, and a professional letter appeared. The LLM and DirectShell were invisible.

This works in every application with a text field. Not because every application integrated AI. Because DirectShell sits between the keyboard and every application.

17. For Legacy Systems

Every government agency, every hospital, every insurance company, every bank has systems from the 1990s or 2000s that hold critical data but have no API, no export function, and no way to extract information except by having a human sit in front of the screen and manually transcribe.

These systems often display data on screens that look like green text on black backgrounds — terminal emulators running mainframe sessions, custom Windows forms built in Visual Basic 6, applications from vendors that went bankrupt a decade ago.

The data trapped inside these systems is critical — patient records, tax records, insurance policies, financial transactions. The digital transformation everyone talks about — the reason organizations spend millions on "modernization" — often boils down to one problem: getting data out of old systems and into new ones.

DirectShell solves this without touching the old system. The legacy application keeps running as it always has. Snap it. DirectShell reads the accessibility tree and exposes every displayed element as structured data. A Python script iterates through screens, extracting records into a modern database. No reverse engineering. No modification of the legacy application. No risk of breaking a system that nobody understands anymore but everyone depends on.

The digital transformation that hasn't happened in 20 years — because nobody can replace the old systems and nobody can extract the data — doesn't need to happen anymore. The data is already accessible. It was always accessible. Through the accessibility layer that the law requires to exist.

18. For the Software Industry

The RPA Market

The global RPA (Robotic Process Automation) market is projected to exceed $80 billion by 2030. UiPath alone has a market capitalization of billions. Automation Anywhere, Blue Prism, Microsoft Power Automate, WorkFusion — all sell essentially the same thing: the ability to automate applications that don't have APIs.

Their tools use a combination of accessibility selectors, image matching, coordinate clicking, and OCR. They require per-application scripting. They require specialized training. They require enterprise licenses.

DirectShell reduces their entire value proposition to a single binary with no dependencies. Not because DirectShell is a better RPA tool — DirectShell is not an RPA tool at all. It's the infrastructure that makes RPA tools unnecessary. The same way a web browser made dedicated Gopher clients, FTP clients, and Telnet clients unnecessary — not by being a better version of each, but by providing a universal interface that subsumed them all.

Anti-Cheat Systems

The gaming industry invests heavily in preventing automated input. DirectShell's action queue enables programmatic control of any application, including games. Kernel-level anti-cheat systems (Riot Vanguard, Easy Anti-Cheat, BattlEye) can detect and block certain forms of SendInput calls — affecting DirectShell's write capability.

But they cannot block the read capability. Any game that renders UI elements (health bars, minimaps, inventory screens, HUD elements) exposes them through the accessibility tree. Knowing every element on screen — every health value, every minimap position, every inventory item — is arguably more disruptive than the ability to inject input.

Terms of Service

Many applications prohibit automated access in their Terms of Service. The enforceability of such terms against a tool that uses a legally mandated accessibility interface is untested. The conflict between "our TOS says you can't automate" and "the law says you must provide this interface" creates legal uncertainty that favors the user, not the vendor.

DRM and Content Protection

Applications that display protected content (e-books, streaming subtitles, licensed data) expose that content through the UIA tree if it is rendered as accessible text. The accessibility requirement creates a structured, text-based output channel for content that may otherwise be protected against copying.

19. The 100 Use Cases: What You Can Build

Everything that follows is enabled by a single 700 KB binary and the accessibility infrastructure that already exists on every computer.

Reading Out: Data Extraction Use Cases

These use cases involve extracting information from applications that was previously locked behind proprietary GUIs:

1. Real-Time Dashboards from Any Application

Your boss wants to know how many tickets are open, what the revenue is today, how many emails are unanswered. Currently: someone logs into three systems and manually builds a report. With DirectShell: snap the ticket system, snap the accounting software, snap Outlook — simultaneously, continuously, in real-time. Live dashboard from applications that never had APIs and never will. The entire BI industry (Tableau, Power BI, Looker) assumes you need database access or API connections. DirectShell only needs an open window.

2. Legacy System Data Liberation

Every agency, hospital, and insurance company has systems from the 90s containing critical data with no export function. The only way to get data out: a human sits there and types it into another system. Snap the legacy system. A script reads every screen, every field, every value — structured, queryable, in real-time. The digital transformation that hasn't happened in 20 years doesn't need to happen anymore. The data is accessible through the window.

3. Competitive Intelligence and Price Monitoring

Every software that displays prices, every platform that lists offers — including desktop applications that don't allow web scraping. Trader terminals. Dealer software. Internal procurement systems. If it's on a screen, DirectShell can read it. Structured. Continuously. Into a database.

4. Scientific Data Capture

Lab instruments whose software was written in 2003 and only displays measurements on screen. No export. No CSV. No API. The doctoral student sits next to it and manually transfers values to Excel. With DirectShell, measurements are captured in real-time, continuously, into a database. The doctoral student sleeps.

5. Quality Assurance Without Source Code

You receive delivered software. You want to verify: does it display correct values? Are the calculations right? Currently: manual testing or access to source code. With DirectShell: automated verification of every output, every display, every calculation — without ever opening the source code. Every audit, every certification, every acceptance test becomes automatable.

6. Universal Search Across All Applications

One search bar. All open applications simultaneously. "Find the invoice from Mueller" — DirectShell searches Outlook, SAP, the file system, the industry software, the browser. At the same time. Structured. Because it has all of them as databases. No Alt-Tab. No five different search masks. One query.

7. Compliance Audit Automation

Every input in every application, logged. Structured. In a database. "Show me every booking that employee X made in SAP between 2pm and 4pm." The auditor doesn't get PDF reports anymore. They get SQL access to everything that was ever displayed on a screen. Without SAP needing to provide an audit trail.

8. Application Usage Analytics

IT departments can see which software is actually being used, how it's being used, which features are accessed, and which workflows are performed — without installing monitoring agents in the applications themselves. Shadow IT detection becomes trivial.

Writing In: Control and Input Use Cases

These use cases involve sending input to applications to control them:

9. Universal AI Agent Connector

Any LLM controls any GUI via text. No screenshots, no vision model, no per-application integration. The AI reads the .a11y.snap, understands the screen in 5 lines, writes an INSERT to the inject table, and the application responds. This works for any application, any model, any programming language that can open a SQLite file.

10. Cross-Application Workflow Automation

"When an email from Purchasing arrives in Outlook containing 'urgent', extract the order number, open SAP, enter it, and confirm." No human integrated Outlook and SAP. No middleware. No API connection. Snap Outlook. Snap SAP. One reads, one writes. Done. Every workflow that a human performs manually between two programs is automatable. Without the programs knowing about each other.

11. Universal LLM in Every Text Field

Every input field in every application becomes LLM-enhanced. Spell correction for dyslexics. Live translation. Auto-formatting. Professional tone transformation. Without the application cooperating. Without the user installing anything per application. One layer, everywhere.

12. Application as Frontend Proxy

This is one of the most mind-bending use cases. DirectShell can intercept input before it reaches an application and redirect it. The user types in a chat field. DirectShell catches the input before it's sent. It routes the request to a local LLM, a different service, or a custom backend. The response appears in the chat field as if the original application had generated it.

You're using Claude Desktop as a frontend — but your message never reaches Anthropic's servers. DirectShell intercepted it, processed it locally, and injected the response. The application is a shell. What happens underneath is determined by whoever controls DirectShell.

Every SaaS application in the world is built on the assumption that the user's input goes to their server. DirectShell breaks that assumption.

13. Voice Control for Any Application

Add voice input to any application that doesn't support it. Speech-to-text outputs to DirectShell, which types into whatever application is active. No application integration needed.

14. Forced Copy-Paste

Some applications block Ctrl+C and Ctrl+V in certain fields (DRM, security, "we don't want you copying this"). DirectShell reads the field value through UIA (read path) and can set values through UIA (write path). The copy-paste restriction exists only in the application's keyboard handler. DirectShell bypasses it entirely by operating at the UIA level.

15. Macro Recording and Replay

Record what a user does in any application (every click, every keystroke, every field value change) and replay it later. Not pixel-based macros that break when a button moves — semantic macros that say "click the element named Save" and work regardless of where that button is on screen.

Bidirectional: Reading and Writing Combined

16. Automated Form Filling

Snap Application A. Snap Application B. Read from one, write to the other. No API. No integration middleware. No CSV export/import. Works with any two applications on the planet.

17. Universal Testing Framework

Snap the application under test. Click this button, verify that field now shows this value. DirectShell reads the expected output and compares it to actual. No test harness inside the application needed. No source code access. Works on compiled binaries, on SaaS apps, on anything with a window.

18. Data Migration Between Systems

Moving from one CRM to another? One accounting system to another? Normally this is a six-month project with consultants and custom scripts. Snap the old system. Snap the new one. Read from one, write to the other. Slow compared to API migration, but it works with any source and any target, including systems that have no export capability whatsoever.

19. Real-Time Data Synchronization

Keep two applications in sync. Snap both. When a value changes in Application A, DirectShell detects the change (next tree dump), extracts the new value, and writes it into Application B. No middleware. No message queue. No integration platform. Two snapped windows and a simple script.

20. Regulatory Compliance Verification

Software can be verified from the outside to check whether it displays legally required disclosures, warnings, or information. A regulator doesn't need access to source code — DirectShell reads the production UI and verifies compliance in real-time.

20. The Dark Side: What This Also Enables

A primitive is neutral. Like fire. Like the internet. Like cryptography. Like the printing press. Its value and its danger come from the same source: its universality.

I refuse to pretend the dark side doesn't exist. Acknowledging it before others discover it is how you control the conversation instead of being controlled by it. Here is what DirectShell also makes possible:

Surveillance on a New Level

Employee monitoring today works through periodic screenshots (every 5 minutes) or network traffic analysis. Both are coarse-grained.

DirectShell enables structured, real-time, queryable surveillance. Not screenshots that show a blurry image of what was on screen — a database of every field, every value, every input, every element. "What did Employee X type into the CRM between 14:00 and 16:00?" is a SQL query. "Did anyone access the salary table in SAP today?" is a SQL query. Every application becomes a structured surveillance feed.

This is employee monitoring on a level that didn't exist before. Not because the technology was particularly difficult — screen recording has existed for decades — but because the output is structured, queryable, and integrable. You don't need a human to watch recordings. You write SQL queries against interaction databases.

Malware with Structured UI Access

Today's malware can take screenshots and record keystrokes. Both are unstructured — the attacker gets images and character streams that require interpretation.

DirectShell's architecture enables malware that understands applications structurally. It doesn't record a keystroke stream and hope to find a password — it queries the element tree for password fields and reads their values. It doesn't screenshot a banking app and try OCR — it queries for the account number field, the balance field, the transfer form.

And it can act: when the banking app is open, structurally identify the transfer form, fill in the attacker's IBAN, enter the amount, and click confirm. Deterministically. Reliably. Without the coordinate-guessing errors that make current automation-based malware unreliable.

Credential Harvesting

Any password that is displayed in a UI field (even briefly, even masked with dots) has a corresponding entry in the accessibility tree. Password managers that display credentials in their UI expose those credentials through UIA. "Remember password" dialogs expose the password value. Auto-fill popups expose credentials.

The read path through the accessibility layer is legally protected and cannot be patched. Any application that displays sensitive information in a UI element is exposing that information to any process on the system that queries the accessibility tree.

Automated Social Engineering

DirectShell can monitor communication applications (email, chat, messaging) and wait for specific triggers — a wire transfer request, a credentials exchange, an authorization approval. When the trigger appears, it can modify the conversation in real-time: change an IBAN in an email, alter an approval in a workflow, inject a message into a chat. The modification happens at the UI level — below where network-based security tools operate.

Game Cheating

Any game that renders UI elements (health bars, minimaps, inventory screens, cooldown timers) through the accessibility tree exposes that information to DirectShell. An aimbot doesn't need pixel analysis when enemy positions are in the UIA tree. An inventory manager doesn't need image recognition when item names are text elements.

Kernel-level anti-cheat can block the write path (input injection) but cannot block the read path without simultaneously blocking screen readers. The information advantage alone — perfect knowledge of every UI element — is a significant cheat even without input automation.

The Ethical Position

I'm publishing this not despite the risks, but because of them. The accessibility layer has existed for 29 years. The capability I'm describing has been latent for 29 years. I am not creating a new vulnerability — I am documenting one that has existed since 1997.

By publishing openly, I ensure:

The security community can develop defenses
The conversation about accessibility API security happens publicly, not behind closed doors
Users understand what is possible on their systems
The response to these risks is informed by understanding, not by surprise

Every significant technology has this dual nature. The printing press enabled mass education and mass propaganda. Cryptography enables privacy and enables crime. The internet enables global communication and enables global surveillance. DirectShell enables universal automation and enables universal access to any application's UI state.

The question is not whether this capability should exist. It already exists. The question is who understands it first: the people who will use it constructively, or the people who will exploit it destructively.

I choose to tell everyone at the same time.

Part VI: Honest Assessment

21. Limitations

Accessibility Implementation Quality

The UIA tree is only as informative as the application's accessibility implementation. Applications with poor accessibility practices may have:

Unnamed elements — buttons without labels (the accessibility tree shows "Button" with no name)
Missing roles — custom controls reported as "Custom" instead of their functional role
Absent values — text fields that don't expose their content programmatically
Flat hierarchies — no meaningful parent-child relationships
Canvas-based content — games, design tools, PDF viewers, and map applications that render to a canvas may expose limited accessibility data for the rendered content. A game rendering a 3D scene does not describe every visual element in the UIA tree.

In practice, major applications (Microsoft Office, browsers, SAP GUI, enterprise software subject to Section 508 requirements) have comprehensive accessibility implementations. The trend is toward better accessibility, not worse — driven by EAA enforcement since June 2025 and increasing Section 508 enforcement in the US.

Smaller or legacy applications may have gaps. The quality of DirectShell's output directly correlates with the quality of the target application's accessibility support.

Single-Application Scope

DirectShell v0.2.0 attaches to one target application at a time. Multi-application workflows require re-snapping between applications. This is an engineering limitation, not an architectural one — the system is designed to extend to multi-window operation with multiple DirectShell instances.

Performance Boundaries

A full accessibility tree traversal of a complex application (browser with many tabs, IDE with large project) can take 200–800ms. DirectShell's streaming architecture ensures partial data is available during traversal, but extremely complex interfaces may experience slight lag in the refresh cycle.

The 2 Hz refresh rate means UI changes are detected with up to 500ms latency. For most automation tasks this is imperceptible. For time-critical operations (responding to rapidly changing data), this introduces a half-second delay.

Write-Side Restrictions

Kernel-level anti-cheat systems can detect and block certain forms of SendInput calls. This affects DirectShell's action capabilities but not its read capability. The read pathway operates through the accessibility framework at a higher abstraction level and cannot be blocked without affecting assistive technology.

Additionally, some applications that aggressively reject programmatic text input (some chat fields, some security-sensitive inputs) may not respond to ValuePattern.SetValue(). DirectShell's type action (raw keyboard injection) works as a fallback in most of these cases, but some edge cases may require application-specific handling.

v0.2.0 Bugs

This is version 0.2.0. It was built in 8.5 hours. There are bugs. Formula offset errors in spreadsheets. Chromium tab switching doesn't work via UIA click (the workaround is keyboard shortcuts). Opera's autofill popup can interfere with input injection. Google Search has poor accessibility semantics that limit DirectShell's effectiveness.

These are first-day bugs that will be fixed. They do not indicate architectural limitations. The architecture is sound. The implementation is iterating.

22. What's Missing

MCP Server Integration

DirectShell currently communicates through the file system: output files are read, SQL is written to the database. The next major step is an MCP (Model Context Protocol) server that exposes DirectShell's capabilities as standardized tool calls, enabling any MCP-compatible LLM agent to use DirectShell natively through structured API calls rather than file I/O.

App Profiles

Every application has its own quirks: element naming conventions, navigation patterns, field layouts. Currently, the AI must discover these from scratch each time. App profiles — community-contributed configuration files that describe how to interpret and operate specific applications — will eliminate this bootstrapping cost.

Character Transformation Middleware

The keyboard hook currently passes through all input unchanged. The architecture is ready for middleware that transforms input in real-time: PII sanitization, auto-translation, spell correction, auto-formatting. The slot is built. The middleware hasn't been written yet.

Multi-Window Support

Operating multiple applications simultaneously requires running multiple DirectShell instances. Coordinated multi-application workflows (read from App A, write to App B) currently require external orchestration. Built-in multi-window support is a planned feature.

Cross-Platform

DirectShell currently targets Windows. Equivalent accessibility frameworks exist on macOS (NSAccessibility/AXUIElement), Linux (AT-SPI2), Android (AccessibilityService), and iOS (UIAccessibility). The architectural pattern — attach, walk tree, store in database, expose action queue — transfers to any platform. The legal protections (EAA, ADA) apply regardless of operating system.

Part VII: The Vision

23. The Network Effect of Configuration

Here is the long-term vision. Today, DirectShell knows how to handle a handful of applications. We are the first users on the planet.

But every application needs to be learned only once. By anyone.

Imagine an open-source repository: directshell-profiles/. SAP. Datev. Excel. Outlook. AutoCAD. Bloomberg Terminal. Every industry software. Every legacy system. Every government application.

Thousands of contributors, each spending 30 minutes documenting their niche application's element structure, navigation patterns, and quirks. Like browser extensions. Like npm packages. Like Docker images.

Once that repository exists, the bootstrapping cost for any automation drops to zero. You want to automate SAP? The profile exists. You want to read the hospital software from 2006? Someone in a hospital committed the profile three months ago. git pull, load profile, go.

And here is what makes profiles fundamentally different from other automation configurations: they don't break on updates. Traditional RPA scripts break when a button moves by 10 pixels. Web scraping scripts break when a CSS class changes. But DirectShell profiles are based on semantic element names and roles. The Save button is still called "Save" after an update. The input field for "Customer Number" still has the role "Edit." The profiles are stable in a way that no pixel-based or DOM-based automation can achieve.

PowerShell has over 10,000 cmdlets today — not because Microsoft wrote them all, but because the community did. DirectShell profiles are the cmdlets of the frontend. The primitive provides the mechanism. The community provides the knowledge.

DirectShell doesn't get better because we improve it. It gets better because everyone who uses it improves it. That is the network effect of a primitive.

24. Cross-Platform Potential

The architecture is platform-specific in implementation but platform-universal in concept:

Platform	Accessibility Framework	Legal Protection	Status
Windows	UI Automation (UIA)	ADA, Section 508, EAA, BFSG	v0.2.0 — Working
macOS	NSAccessibility / AXUIElement	ADA, EAA	Planned
Linux	AT-SPI2 (Assistive Technology SPI)	EAA	Planned
Android	AccessibilityService API	ADA, EAA	Possible
iOS	UIAccessibility	ADA, EAA	Possible

The core pattern — attach to application, walk accessibility tree, store in database, expose action queue — is transferable to any platform. The legal protections apply cross-platform: the EAA covers all digital products in the EU regardless of operating system, and the ADA applies to digital services regardless of platform.

A cross-platform DirectShell would mean: the same structured interface to every application, on every operating system, on every device. The same automation scripts work on Windows, macOS, and Linux. The same AI agent can operate any application on any platform.

25. What Will Actually Happen

I owe you an honest prediction. Not hype. Not best-case fantasy. What will actually happen when this goes live.

Weeks

Someone will wrap an MCP server around DirectShell. It will take them an afternoon. After that, any LLM that speaks MCP — Claude, GPT, Gemini, every local model running through LM Studio or Ollama — can operate any application on any Windows machine. Natively. Out of the box.

This will be the first viral derivative. Not DirectShell itself. The MCP wrapper. Because the headline won't be "new accessibility tool released" — it will be "I taught my local Llama to operate SAP. It took 20 minutes." That Hacker News post will be the ignition point.

Someone else will build a GUI around it. Someone will build a profile editor. Someone will write the first automation cookbook. The derivatives will multiply faster than DirectShell itself could ever develop.

Months

The community will explode. Not because of marketing — because of utility. Every developer who snaps their first application has the same reaction: "Wait, this works with EVERYTHING?"

A profile repository will emerge. directshell-profiles/ on GitHub. SAP. Datev. Excel. Outlook. AutoCAD. Bloomberg Terminal. Every industry application. Every legacy system. Contributed by thousands of users who each spend 30 minutes documenting their niche application's element structure. Like Docker images. Like npm packages. Like browser extensions.

Someone will port DirectShell to macOS using NSAccessibility. Someone will port it to Linux using AT-SPI2. The AGPL license ensures every fork stays open. The ecosystem grows in directions I cannot predict or control. That's the point. That's what makes it a primitive and not a product.

One Year

Three things happen simultaneously:

The RPA industry contracts. UiPath is valued at $7 billion. Automation Anywhere just closed another funding round. Their entire business model is: "We help you automate applications that don't have APIs." That is now a single binary. Not in three years. Now. Their stock prices won't react immediately — but their sales pipeline will dry up. Why pay €50,000 per year for UiPath when an open-source binary does the same thing? The smart ones will pivot to building on top of DirectShell. The slow ones will lobby for regulation.

API revenue models come under pressure. SAP, Salesforce, ServiceNow — they all sell programmatic access to data that is already visible on the screen. DirectShell makes that access free. Not for every use case. Bulk export, webhooks, server-side logic — you still need the API for those. But for "read what's on the screen and enter it somewhere else" — the majority of all enterprise integrations — the business model is dead. Some vendors will try to sabotage their accessibility implementation. They will fail, because the law prevents it. Some will market DirectShell compatibility as a feature. Those are the smart ones.

The security discussion becomes existential. Within the first months, a proof-of-concept will surface: malware that uses the accessibility layer to read banking applications. Structured. Reliable. Not patchable. The infosec community will split. One side demands a ban. The other side says: the interface was always open, DirectShell just made it visible. I will be in the middle. The responsible-disclosure section in this paper will be the reason I'm perceived as the person who named the risks — not the person who created them.

What I Will Experience Personally

Job offers. Microsoft Research, Anthropic, Google DeepMind — they'll knock. Not because I built a good tool, but because I saw something their entire teams missed. That's rare. That's valuable.

Simultaneously: hostility. "Irresponsible." "Dangerous." "Should never have been published." This will come. It belongs to the territory. Every fundamental technology has this phase. The printing press enabled mass education and mass propaganda. The people who condemned Gutenberg are forgotten. The books remain.

Why It Won't Be Ignored

Three criteria determine whether a technology persists or fades:

Does it work? — Verifiably. Download the binary, snap any application, see structured output in 500ms. No demo, no video, no trust required. You verify it yourself in 30 seconds.
Does it solve a real problem? — The $300 billion screenshot problem. The enterprise integration nightmare. The legacy data prison. The accessibility gap. Real problems. Measured in billions. Felt by millions.
Is it reproducible? — 2,053 lines of Rust. Two dependencies. Single binary. AGPL source code. Any competent developer reads it in an afternoon and understands every line.

Technologies that satisfy all three criteria do not disappear. They sometimes need days, sometimes weeks, sometimes a lucky retweet. But they do not disappear. Because the moment one person verifies it, they tell two people. And those two people verify it themselves. And the chain doesn't break because it's not based on hype — it's based on a binary that does what it claims, every time, on every machine.

26. Timeline

1997: Microsoft Active Accessibility (MSAA) introduced in Windows 95/98. The accessibility layer begins.
2001: macOS Accessibility introduced. AT-SPI for Linux. The accessibility layer becomes cross-platform.
2005: UI Automation framework introduced in Windows Vista. The modern, complete accessibility API.
2019: European Accessibility Act adopted (EU 2019/882). Accessibility becomes legally mandated.
2023–2025: OpenAI, Anthropic, and Google launch screenshot-based computer use agents. Hundreds of billions invested in the wrong approach.
2024: Microsoft UFO published — uses UIA as one component in a hybrid agent (not as universal interface).
June 2025: European Accessibility Act enforcement begins. Every consumer-facing digital product must be accessible.
February 16, 2026, 12:00: First line of DirectShell code written.
February 16, 2026, 20:30: DirectShell v0.2.0 — first successful multi-application control by an AI agent through the accessibility layer, without screenshots. Four applications operated. 11,454 elements read from a single application. Documented on video.

8.5 hours. One person. One AI assistant. 2,053 lines of Rust. Two dependencies. One binary. Zero screenshots.

27. Conclusion

The AI industry's current approach to desktop automation — screenshot capture and visual inference — is a workaround for a problem that was already solved. The accessibility layer provides everything that screenshots provide and more: structure, semantics, state, hierarchy, queryability. It provides it faster (milliseconds vs. seconds), cheaper (text vs. images), more reliably (deterministic lookup vs. probabilistic inference), and more efficiently (10–30x fewer tokens per interaction).

DirectShell makes this layer usable as a universal application interface. It requires no cooperation from software vendors. It works with every application on the platform. And it is protected by the same laws that protect the right of disabled people to use computers — laws that exist in virtually every jurisdiction on Earth and that no software vendor can circumvent without facing legal consequences.

The technology described in this paper was built in a single session by one developer and one AI agent. The reference implementation is a single compact binary with no external dependencies. The implications extend to every application, every operating system, and every business model that depends on controlling access to graphical interfaces.

Every other approach in 2026 sends images to text models.
DirectShell sends text to text models.

That is the entire insight. And it changes everything.

Snap any app. Read it as text. Control it as text. That's it. That's the primitive.

The rest is just the world catching up.

Tomorrow, 20:00 — Prior Art Whitepaper + full repository. AGPL. Open Source.

The door was always open. I just looked through it first.

Listen. DirectShell is not perfect. It's Day 1. Literally. There are bugs. There are errors. A hundred things that need to get better. But none of that matters. The first browser couldn't render 90% of web pages correctly. The first lightbulb flickered. Every foundational technology begins empty and broken — because the point was never whether it works perfectly now. The point is what it will make possible tomorrow.

The moment a community builds a profile repository — configs for every program on Earth — AI will natively operate every desktop application faster, more efficiently, and more productively than any human ever could. Not in ten years. Not after the next funding round. The infrastructure is here. Today. In 700 kilobytes.

Google. Microsoft. OpenAI. Anthropic. Call me. Let's talk. Let's revolutionize the world of AI in one stroke.

Peace at last.

And now I'm going to sleep for 12 hours.

— Martin Gehrken, February 17, 2026

DirectShell v0.2.0
dev.thelastrag.de
AGPL-3.0 License

Appendix A: Architecture Deep Dive

For developers who want to understand the internals, fork the code, or build on DirectShell, this appendix provides a detailed technical reference.

A.1 System Overview

DirectShell.exe (Win32 GUI, ~700 KB)
├── Main Thread: Message loop, window procedure, painting, timer dispatch
├── Tree Thread (spawned per dump): UIA tree walk, SQLite write, file generation
└── Keyboard Hook: Global low-level keyboard interception (WH_KEYBOARD_LL)

A.2 Dependencies

Crate	Version	Features	Purpose
`rusqlite`	0.31	`bundled`	SQLite database (bundled C library, no system dependency)
`windows`	0.58	Win32 API bindings	Full Win32, UIA, COM, GDI, Input

The windows crate features used:

Feature	Usage
`Win32_Foundation`	HWND, RECT, BOOL, LRESULT, WPARAM, LPARAM
`Win32_UI_WindowsAndMessaging`	Window creation, messages, timers, hooks
`Win32_Graphics_Gdi`	GDI painting, brushes, pens, double buffering
`Win32_UI_Accessibility`	IUIAutomation, tree walking, element properties
`Win32_System_Com`	CoInitializeEx, CoCreateInstance
`Win32_UI_Input_KeyboardAndMouse`	SendInput, virtual key codes

A.3 Database Schema

-- Every UI element = one row, rebuilt every 500ms
CREATE TABLE elements (
    id            INTEGER PRIMARY KEY,
    parent_id     INTEGER,
    depth         INTEGER,
    role          TEXT NOT NULL,
    name          TEXT,
    value         TEXT,
    automation_id TEXT,
    enabled       INTEGER DEFAULT 1,
    offscreen     INTEGER DEFAULT 0,
    x INTEGER, y INTEGER, w INTEGER, h INTEGER
);

-- Window metadata
CREATE TABLE meta (
    key   TEXT PRIMARY KEY,
    value TEXT
);

-- Action queue (persists across tree rebuilds)
CREATE TABLE inject (
    id     INTEGER PRIMARY KEY AUTOINCREMENT,
    action TEXT DEFAULT 'text',
    text   TEXT NOT NULL,
    target TEXT DEFAULT '',
    done   INTEGER DEFAULT 0
);

WAL mode is enabled for concurrent read/write access. External processes should also set PRAGMA journal_mode=WAL when opening the database.

The elements table is dropped and recreated on every tree dump (every 500ms). This avoids freelist bloat from DELETE operations and ensures a clean state on each cycle. Indices are not recreated during dumps — this is intentional, as indices slow down INSERT operations and the table is rebuilt so frequently that query performance relies on SQLite's efficient sequential scan.

The inject table persists across dumps. Completed actions remain with done=1. External processes write new actions; DirectShell reads and executes them.

A.4 External Interface Protocol

External Process (e.g., Claude Code CLI Agent)
├── READ:  ds_profiles/is_active        ← Check snap state + discover file paths
├── READ:  ds_profiles/{app}.a11y       ← Understand screen content
├── READ:  ds_profiles/{app}.a11y.snap  ← Identify operable elements
├── READ:  ds_profiles/{app}.snap       ← All interactive elements (for scripts)
├── READ:  ds_profiles/{app}.db         ← Full element tree (SQL queries)
└── WRITE: ds_profiles/{app}.db         ← INSERT INTO inject table (actions)

The is_active file is the entry point. An external agent reads it first:

When snapped:

opera
ds_profiles/opera.a11y
ds_profiles/opera.snap

When unsnapped:

none

Line 1 tells the agent which application is active. Lines 2–3 provide the exact paths to the output files. The agent does not need to guess filenames or scan directories.

A.5 Action Types (Complete Reference)

text — UIA ValuePattern

INSERT INTO inject (action, text, target) VALUES ('text', 'Hello World', 'Search Box');

Find element by name (target column) using UIA FindFirst(TreeScope_Descendants)
Set focus via IUIAutomationElement::SetFocus()
Try ValuePattern::SetValue() (native UIA text setting — instant)
If ValuePattern fails: fall back to SendInput per character (KEYEVENTF_UNICODE)

type — Raw Keyboard

INSERT INTO inject (action, text) VALUES ('type', 'Hello\tWorld\n');

Sends each character as a raw keyboard event with 5ms inter-character delay:

\t → VK_TAB
\n or \r → VK_RETURN
All others → KEYEVENTF_UNICODE with UTF-16 code point

No element targeting — sends to whatever currently has keyboard focus.

key — Key Combinations

INSERT INTO inject (action, text) VALUES ('key', 'ctrl+shift+s');

Supports 150+ keys including:

Letters (a–z), Numbers (0–9), Function keys (F1–F12)
Modifiers (ctrl, alt, shift, win)
Navigation (enter, tab, escape, backspace, delete, home, end, pageup, pagedown)
Arrows (up, down, left, right)
Media (volumeup, volumedown, playpause, nexttrack)
Numpad (num0–num9, num+, num-, num*, num/, num.)
Punctuation (semicolon, equals, comma, minus, period, slash, backquote, bracket, backslash, quote)

click — Element Click

INSERT INTO inject (action, target) VALUES ('click', 'Save');

Find element by name using UIA FindFirst(TreeScope_Descendants)
Get BoundingRectangle → calculate center point
Convert to absolute screen coordinates (0–65535 range)
Send MOUSEEVENTF_ABSOLUTE + LEFTDOWN, then LEFTUP via SendInput

scroll — Mouse Wheel

INSERT INTO inject (action, text) VALUES ('scroll', 'down');

Directions: up, down, left, right. One call = one wheel notch (WHEEL_DELTA = 120). Scroll position is at the center of the target window.

A.6 Role Mapping (UIA ControlType → Human-Readable)

ID	Name	ID	Name
50000	Button	50020	Text
50002	CheckBox	50021	ToolBar
50003	ComboBox	50023	Tree
50004	Edit	50024	TreeItem
50005	Hyperlink	50025	Custom
50006	Image	50026	Group
50007	ListItem	50028	DataGrid
50008	List	50029	DataItem
50009	Menu	50030	Document
50010	MenuBar	50031	SplitButton
50011	MenuItem	50032	Window
50012	ProgressBar	50033	Pane
50013	RadioButton	50034	Header
50014	ScrollBar	50035	HeaderItem
50015	Slider	50036	Table
50017	StatusBar	50037	TitleBar
50018	Tab	50038	Separator
50019	TabItem

Appendix B: Legal Framework (Full Analysis)

B.1 The Legal Hierarchy

UN CRPD (186 states, international treaty)
    ↓ binds member states to implement accessibility
EU European Accessibility Act (EU directive)
    ↓ transposed into member state law
German BFSG / French LCAP / etc. (national law)
    ↓ overrides
Software Terms of Service (private contract)

In this hierarchy, a contract (Terms of Service) cannot override a statute (BFSG/EAA), which cannot override an international treaty (CRPD). If a TOS says "no automated access" and the law says "you must provide this interface for assistive technology," the law wins.

B.2 Why Blocking Is Legally Impossible

The core argument:

Disability rights legislation requires software to expose its UI through accessibility APIs
DirectShell reads those same APIs using the same methods as screen readers
There is no technical mechanism to distinguish DirectShell from a screen reader
Blocking DirectShell requires blocking the same interface that screen readers use
Blocking screen readers violates disability rights legislation in 186 countries

The vendor's only options:

Keep the accessibility interface open → DirectShell works
Block the accessibility interface → violate the law + exclude blind users

There is no third option.

B.3 Relevant Legislation (Detailed)

UN CRPD (2006)

Article 9: States Parties shall take appropriate measures to ensure access to information and communications technologies
Article 21: Freedom of expression and access to information, including through all forms of communication of their choice
Ratified by 186 states. The most widely ratified human rights treaty in history.

European Accessibility Act (2019/882)

Scope: Computers, operating systems, consumer banking, e-commerce, communication services, e-books, transport
Requirement: Products must support assistive technologies through standard accessibility APIs
Enforcement: Since June 28, 2025. Penalties set by member states.
Relevant Article: Article 4 — "Products shall be designed and produced in such a way as to maximise their foreseeable use by persons with disabilities"

Americans with Disabilities Act (1990)

Title III: Public accommodations (interpreted by courts to include digital services)
Relevant case law: Gil v. Winn-Dixie (2017), Robles v. Domino's Pizza (2019)
Pattern: Courts increasingly rule that digital accessibility is required under the ADA

Section 508 of the Rehabilitation Act (1973, revised 2018)

Scope: Federal agencies must procure accessible ICT
Standard: WCAG 2.0 Level AA (references programmatic accessibility)
Impact: Any software vendor selling to US government must be accessible
This alone covers a massive portion of enterprise software

WCAG 2.1 Success Criterion 4.1.2: Name, Role, Value

"For all user interface components, the name and role can be programmatically determined"
This is the specific technical requirement that ensures UI elements appear in the accessibility tree with meaningful names and roles
Referenced by Section 508, EAA, BFSG, and virtually every accessibility standard worldwide

German BFSG (2021, enforced 2025)

German transposition of the EAA
Applies to all digital products and services offered to consumers in Germany
Penalties: Up to €100,000 per violation
Regulatory authority: Bundesnetzagentur

Appendix C: Benchmark Methodology

C.1 Token Comparison

Token counts are measured using the tiktoken tokenizer (cl100k_base encoding, used by GPT-4 and Claude):

Input Type	Example	Token Count
Screenshot (1920×1080, PNG, base64)	Typical desktop application	1,200–1,800
Screenshot (2560×1440, PNG, base64)	High-resolution display	2,500–5,000
Full UIA dump (JSON)	Complex application (11,000 elements)	15,000–25,000
DirectShell .a11y	Screen reader view	200–800
DirectShell .a11y.snap	Operable element index	50–200
DirectShell SQL query result	Single targeted query	10–50

C.2 Latency Comparison

Measured on Windows 11, Intel i7-12700K, 32 GB RAM, local network:

Operation	Screenshot Agent (typical)	DirectShell
Capture screen state	100–500ms (screenshot + encode)	N/A (continuous 2 Hz dump)
Transmit to model	500–2000ms (cloud API)	0ms (local file read)
Model inference	1000–3000ms	0ms (pre-computed output)
Parse model response	50–100ms	0ms (SQL result is already structured)
Execute action	100–300ms (mouse simulation)	30ms (next inject timer tick)
Total per action	2–6 seconds	< 100ms

C.3 Success Rate Analysis

Direct comparison is premature — DirectShell v0.2.0 has been tested on a handful of applications in controlled conditions. The OSWorld benchmark numbers cited (66.2% for AskUI VisionAgent, 47.5% for UI-TARS 2, 42.9% for CUA o3) are from standardized, reproducible evaluations.

However, a structural argument can be made: screenshot-based agents fail because they misidentify elements (clicking the wrong pixel) or because the UI state changes between inference and action. DirectShell eliminates both failure modes. Element identification is deterministic (name-based lookup, not visual inference), and UI state is continuously updated (500ms refresh).

The remaining failure modes for a DirectShell-based agent are:

The application has poor accessibility implementation (missing element names)
The AI makes a reasoning error (wrong action choice, wrong field value)
The application rejects programmatic input (anti-cheat, security controls)

These are real limitations, but they are fundamentally different from — and substantially fewer than — the failure modes of screenshot-based agents.

Contact

Discord: Deep Learn — LLM, Research, Open Source and Programming
Email: iamlumae@gmail.com
Website: dev.thelastrag.de
Source Code: GitHub Repository (AGPL-3.0)
Demo Video: Watch the full demo

This document is released under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

The DirectShell source code is released under the GNU Affero General Public License v3.0 (AGPL-3.0).

Martin Gehrken — February 2026 — dev.thelastrag.de

I built an Open Source Deep Research tool which beats Google, OpenAI and Perplexity

martin — Tue, 03 Feb 2026 15:16:40 +0000

Bold claim? Here's the proof.

4 days ago I released Lutum Veritas - an open-source Deep Research tool that does one thing differently: It tells the truth.

The Benchmark Results:

I ran the same queries through ChatGPT Deep Research, Google Gemini, Perplexity Pro, and Veritas.

• ChatGPT: Fabricated 4-5 citations that don't exist
• Gemini: 30-40% incorrect data
• Perplexity: Surface-level, paywalled ($20/mo)
• Veritas: 100% verifiable sources, $0.08 per report

Full benchmark: https://veritas-test.neocities.org

What makes it different:

When data doesn't exist, Veritas says "we don't know" instead of making shit up.

Radical concept, I know.

The Tech:
• Camoufox scraper (0% bot detection)
• Dual-verification pipeline
• Budget models: Gemini Flash Lite + Qwen 235B
• Cost: $0.08 per 30-min deep research report

The Numbers (4 days post-launch):
• 46 GitHub stars
• 7.3% conversion rate (industry avg: 1-3%)
• 630 unique visitors
• Featured: Hacker News, Product Hunt, DeepLearning.AI

New: Ask Mode (v1.3.0)
60-second verified answers for $0.0024 each. That's 400 answers for $1.

The Lesson:

You don't need billions in VC funding to beat billion-dollar companies. You need the right philosophy:

Truth > Usefulness.
Evidence > Speculation.
"I don't know" > Hallucination.

Open source. AGPL-3.0. Runs on your machine.

GitHub: https://github.com/IamLumae/Project-Lutum-Veritas

Try it. Break it. Tell me I'm wrong.

I'll wait.

AI #OpenSource #DeepResearch #MachineLearning #Innovation #ChatGPT #OpenAI #Google #Perplexity

Why the search for truth can never be worth more than the search to question it.

martin — Sat, 31 Jan 2026 08:12:19 +0000

How I built an open source deep research engine that costs a fraction of what OpenAI, Gemini, and others charge, while delivering significantly better results.

Greetings, dear LessWrong community, developers, team, and anyone else who is interested.

This is actually my first real post here, and I hope I live up to all the principles.

The problem:

We live in a fast-paced society where the value of knowledge and truth scales exponentially with our technological progress.
And especially in times of AI and fake culture, autonomously generated and factually verified knowledge is becoming increasingly important.
At the same time, we are all exposed to the stress of “effectiveness” and “productivity.” Who still has the time to conduct real in-depth research? To search for and validate information or establish facts? Virtually no one.
And that’s exactly why people use deep research engines. Google, Open AI, Perplexity, and others offer quick and “easy” ways to conduct deeper searches effectively and quickly.

But do they meet the demands of what we really need? I don’t think so. And here are the reasons :

Incorrect or hallucinated citations and sources. Tools such as Perplexity throw around long lists of sources that sound good—but when you click on them, you realize they don’t exist or are incorrect in terms of content.

False security, high-quality searches, and “cost throttling.” All providers make big promises here, but in the background, sources are “cut” or inferior models are used. Only with really expensive subscriptions do you get the full power.

Functional hallucinations. Open AI Deep Research, in particular, repeatedly generates false facts in that it thinks it can do certain things, such as generate things and use tools. This does not inspire confidence and unsettles users.

Gatekeeping of the truth. On the one hand, “subscription” constraints are created, and on the other hand, content censorship or censorship of sources is also created. A truly open search looks different.

Lack of transparency in methodology, source utilization, and processing. It’s all well and good that it looks great on the outside, but no one knows what’s really going on. Yet another black box.

In short: Today’s deep research tools are by no means bad per se. They fill a gap, but at the same time they are further away from what people want in a research tool.

Lutum Veritas Research Project -

But then there are always people working in research and development who think, “That’s not enough for me,” and I’m one of them. Martin. From Germany. 37 years old. Stubborn. Self-taught. Career changer in IT.
And that’s exactly how I felt: I want my own software now. And I want to publish it as open source because truth should not be hidden behind paywalls.And it was clear to me from the start what core ideas my software should represent:

1)No subscriptions, no paywall – bring your own key, pay only for usage. Done. No ifs, ands, or buts.

2)A source scraper and search mechanism worthy of its name that not only fetches what’s in AI-generated SEO dossiers, but also fetches the DIRT from the internet and the ESSENCE. That’s why Lutum Veritas—getting the truth out of the dirt.

3)No censorship. Search for what you want. And find answers. Without permission or compliance rules.

4)Open source and as deterministic as possible—transparency by design.

5)But above all: deeper, more detailed searches with results that go far beyond what the market has to offer to date.

Self-criticism

I am NOT claiming that my software is perfect. It isn’t. Nor am I claiming that it beats every other tool in every discipline worldwide. But I am claiming the following: I have built a standalone BYOK open source deep research tool that performs searches for a fraction of the cost of regular subscriptions or API deep research. It offers significantly deeper and more detailed analysis than any other tool. In addition to a regular mode, it has an “academic deep research mode” that provides analysis reports with unprecedented depth and evidence, often reaching over 200,000 characters. And I claim that because of this, and because of the way I have implemented context transfer, it recognizes significantly more “causal relationships” than the big players on the market.

There will be bugs. There will be things that don’t work perfectly yet. But I’m on it and constantly developing it further.

But further development requires testers and feedback. And that’s where you come in. I invite every developer, researcher, or anyone who is simply interested to test the software. Challenge it. Challenge me. So that I can make the best of it—on the one hand to meet my own standards, but also to provide the world with a tool that really delivers what it promises.

My last words? Call me narcissistic if you like. That’s what drives me, but I maintain that

as of today, I set the bar for deep research software.

———–> GitHub https://github.com/IamLumae/Project-Lutum-Veritas

Lutum veritas Research - or how i beat every existing Deep Research Tool

martin — Fri, 30 Jan 2026 19:34:55 +0000

I got tired of waiting for Big Tech to build Deep Research that actually works.

So I built it myself. Today I'm releasing Lutum Veritas - an open source Deep Research Engine.

What it does:
Transforms any question into 200,000+ character academic research documents
Recursive pipeline where each research point knows what previous ones discovered
Claim Audit Tables force the model into self-reflection instead of blind assertions
Camoufox scraper cuts through Cloudflare and paywalls with 0% detection rate

Cost: Under $0.20 per research. OpenAI o3 equivalent: $7.36.

That's not a typo. 92x cheaper. Deeper output. Full transparency.

This isn't an "alternative" to Perplexity or ChatGPT. This is proof that a solo dev with the right architecture can beat billion-dollar corporations at what should be their core competency: deep, verifiable knowledge.

Perplexity, OpenAI and Google deliver summaries. I wanted truth.

So I stopped waiting and built it myself. The Camoufox scraper cuts through Cloudflare, Bloomberg and paywalls with 0% detection. The recursive pipeline passes context forward – each research point knows what the previous ones discovered. Claim Audits force the model into self-reflection instead of blind assertions.

The result: 203,000 characters of academic depth for a single query. Cost: under 20 cents. That's orders of magnitude cheaper than OpenAI o3 and qualitatively in a different league.

This isn't an "alternative" to existing tools. This is proof that a solo dev with the right architecture can beat billion-dollar corporations at what should be their core competency: deep, verifiable knowledge.

The bar for Deep Research is set right here.

— Martin Gehrken, January 30, 2026

AGPL-3.0 licensed. Because truth shouldn't be locked behind paywalls.
🔗 GitHub: https://lnkd.in/dYS32dvM

The wait is finally over! The Last RAG Beta Arrived

martin — Fri, 17 Oct 2025 08:57:57 +0000

After months of intensive development, we’re opening the gates to the Closed Beta of The Last RAG.

Have you ever wished for an AI partner who doesn’t forget who you are after three sentences? One that truly remembers details, context, and emotions?

We’ve reimagined AI from the ground up to make that possible — an entity that grows, learns, and evolves with you, forming a genuine partnership instead of acting like a forgetful tool.
The era of digital amnesia is over.

Be among the first to experience the next evolution of human–machine collaboration.
Register now for the Beta: https://dev.thelastrag.de/

Important note: Demand is already extremely high! To ensure the best possible experience for every user, access to the Closed Beta will be granted through invite codes, distributed in stages.

Please be patient — we’re activating new users as quickly as possible.
And don’t forget to check your email regularly (including your spam folder!) for your personal invite code.

We can’t wait to hear your feedback and build the future of AI together.

The NoChain Orchestrator - Or how to Replace Frameworks

martin — Wed, 06 Aug 2025 08:26:08 +0000

NoChain Orchestrator Whitepaper

Replacing Complex LLM Frameworks with a Deterministic, Memory-Integrated AI Orchestrator

Executive Summary

Today’s AI developers and innovators face a dilemma: powerful large language models (LLMs) promise transformative applications, yet orchestrating these models in complex workflows has required equally complex frameworks. Tools like LangChain, AutoGPT, BabyAGI, and others enable multi-step reasoning and memory, but at the cost of high complexity, unpredictable behavior, and skyrocketing operational costs[1][2]. The NoChain Orchestrator is a novel AI architecture designed to resolve these pain points. It introduces a deterministic, server-side orchestration that eliminates the need for “chain”-based frameworks. Instead of relying on an LLM itself to plan tool use or manage memory (as agent frameworks do), NoChain uses clear hard-coded logic on the server to coordinate lightweight, composable LLM prompts. This approach yields:

Technical Depth with Simplicity: A robust pipeline (identity, short-term cache, long-term memory, etc.) is built in, so developers don’t have to wire these from scratch. The orchestrator ensures predictable, repeatable flows for each query, avoiding the instability of free-roaming AI agents[3].
Business Impact: By focusing only on relevant context and using smaller models for support tasks, NoChain slashes token usage – achieving up to 98% cost reduction versus traditional methods[4][5]. This efficiency, combined with persistent AI memory, unlocks new AI applications (long-term assistants, enterprise knowledge partners) previously deemed infeasible due to memory limits or costs.
Hybrid Appeal: The architecture is model-agnostic and modular, appealing to full-stack developers seeking integration flexibility. Simultaneously, its ability to turn disposable AI chats into persistent, personalized AI partners (with lower cost of ownership) speaks to investors and business leaders in terms of user retention and competitive moat[6][7].

In summary, NoChain Orchestrator bridges the gap between cutting-edge AI capabilities and practical deployment. It brings the logic and clarity of traditional software engineering into the realm of LLM orchestration – delivering the reliability that developers need with the adaptive intelligence that users crave. This paper outlines NoChain’s design, how it diverges from prior architectures (including The Last RAG), and why it stands poised to redefine AI orchestration for the next generation of applications.

Background: The Need for a New Orchestration Paradigm

AI agents and LLM-powered applications have exploded in popularity, but so have their limitations. Traditional orchestration frameworks and agents attempt to empower LLMs with tools, memory, and multi-step reasoning, yet each approach encounters serious challenges:

LangChain and Frameworks: Libraries like LangChain offer a toolkit to sequence LLM calls and integrate memory or tools. However, they require developers to explicitly wire up memory, context, and tool usage in code[8]. This makes applications heavy and complex, with many abstractions that can be hard to debug or customize. There is no intrinsic “understanding” of the conversation – the developer manually manages how and when to retrieve data or invoke functions. While functional, this approach is essentially a glue code framework, not an AI architecture. It often leads to duplicated effort and potential for mistakes, as each application must reinvent orchestration logic. Moreover, using such frameworks doesn’t inherently solve the memory problem – without special handling, LangChain agents forget past sessions unless explicitly programmed to use databases or summaries. This lack of built-in long-term memory means user experiences remain shallow and repetitive.
AutoGPT, BabyAGI (Autonomous Agents): Autonomous agent projects like AutoGPT and BabyAGI took a different route: letting the LLM itself control the loop. These systems prompt the LLM to plan tasks, call tools, and even self-criticize in iterations. The upside is a form of emergent problem-solving, but the downsides are significant. Cost and inefficiency are severe: AutoGPT may call GPT-4 dozens of times, often using the maximum context each step, leading to runaway costs (e.g. ~\$14 for a 50-step experiment)[9]. Worse, the agent often gets stuck in loops, repeating faulty plans with no built-in escape; in practice, users frequently observe AutoGPT loop endlessly and require manual restarts[2]. BabyAGI, while simpler, similarly runs in loops generating and reprioritizing tasks[10][11]. These agents also lack robust long-term memory – BabyAGI “isn’t production-grade” and has no persistent memory or error recovery[12]. In short, agentic frameworks traded determinism for adaptability, but ended up with brittle, unpredictable systems that rarely justify their cost outside of demos.
Memory-Focused Research (MemGPT, etc.): Recent research like MemGPT has highlighted the importance of memory and tried to equip LLMs with an OS-like memory hierarchy[13]. The MemGPT design pattern treats an LLM as an operating system managing RAM and disk – it can dynamically store and retrieve information and even self-edit its memory. This is a promising direction and has been open-sourced (now evolving into the Letta framework)[14][15]. However, such systems are still in early stages: they tend to be complex, and they often rely on the LLM itself to decide when to save or load from memory. In practice, MemGPT/Letta agents support custom tools and long-term storage, but they remain frameworks that developers must configure and maintain, with many moving parts. The orchestration is not “free” – it just happens within a new layer of software. Additionally, frameworks like these and others (e.g. OpenDevin for autonomous coding) introduce significant overhead: OpenDevin, for instance, offers multi-agent coding capabilities but comes with steep setup and learning curves, requiring Docker environments and careful configuration of models and APIs[16][17]. These solutions can be powerful in niche domains but may be overkill (or too resource-intensive) for general LLM apps.

Why NoChain? In sum, current solutions either require heavy lifting by developers (LangChain-style) or gamble on an LLM’s emergent planning (AutoGPT-style), or pile on complex memory frameworks. This complexity hits both productivity and performance: development cycles slow down, and runtime costs or latencies spiral out of control. What’s missing is an approach that gives us the best of both worlds – the smart adaptability of an AI agent with the reliability and clarity of deterministic software. That is the gap the NoChain Orchestrator fills. By studying these shortcomings, NoChain was conceived to remove the “chains” altogether – no external chain-of-thought, no fragile loops, and no need for a grab-bag framework. Instead, it provides a clean, deterministic orchestration logic that any developer can use to deploy stateful, cost-efficient AI in production.

What is the NoChain Orchestrator?

NoChain Orchestrator is a server-side AI control plane that coordinates LLM operations through deterministic logic and carefully designed prompts, instead of through opaque agent reasoning or extensive framework code. In essence, NoChain is an AI orchestration engine that replaces LangChain, AutoGPT, BabyAGI, etc., with a simpler, faster, and more predictable** solution. Its key distinguishing characteristics include:

Deterministic Orchestration: Every step in the AI’s reasoning process is guided by explicit rules in code (the orchestrator), not left to an LLM’s whims. The orchestrator decides when to retrieve information, when to summarize, when to query the main model, and when to write to memory. This guarantees the process won’t veer off into loops or tangents – a stark contrast to “let the GPT figure it out” approaches. OpenAI’s own research notes that orchestrating via code yields more reliable speed, cost, and performance than letting an LLM control the flow[3]. NoChain embodies this principle fully: it never delegates orchestration decisions to the LLM, it only delegates specific tasks (like “summarize these points” or “answer the user”) to LLMs. Everything else is handled by straightforward logic.
Lightweight, Composable Prompts: Instead of giant monolithic prompts or complex prompt-chains, NoChain uses a few simple prompt templates that get composed as needed. Each prompt has a clear purpose (for example: an Identity prompt that imbues the AI with a consistent persona and agenda, a Memory retrieval prompt, a Summary prompt for composing relevant info, etc.). These pieces are combined into the final query to the main model. This modular design means prompts are easy to maintain and audit – one can adjust the identity or memory format independently without breaking the whole system. It also keeps each LLM call focused and efficient. By separating concerns in prompts, NoChain avoids the “everything including the kitchen sink” prompt that can confuse models. The result is often improved clarity and coherence in responses. (Notably, research on long contexts has found that stuffing a model with too much irrelevance degrades performance – LLMs get “lost in the middle” of very long inputs[18]. NoChain’s compositional prompting prevents this by only supplying highly relevant context for each query.)
Beyond TLRAG – A New Role: The Last RAG (TLRAG) was a precursor architecture (already published) that introduced the idea of an AI instance with a persistent identity and self-curated memory. In TLRAG, the model itself took on more responsibility for managing context and deciding what to remember[19][20]. NoChain Orchestrator builds on the insights of TLRAG but plays a different role. Rather than being an all-in-one “AI that orchestrates itself,” NoChain extracts the orchestration logic into a standalone layer. Think of NoChain as the conductor that ensures the AI (whichever model is used) performs beautifully, every time. This means all the benefits demonstrated by TLRAG – e.g. constant-time memory costs, never forgetting past interactions, linear growth of token usage[21][22] – are achieved without relying on a fragile agent. NoChain provides the structure externally. In short, TLRAG turned an LLM into a self-driven cognitive agent; NoChain takes that orchestration brain and offers it as a deterministic service for any LLM. This differentiation is crucial: NoChain can work with any model and in any application context (it’s not tied to a single AI “persona”), yet it delivers TLRAG-like intelligence through its architecture.
Model Independence: The orchestrator is model-agnostic by design. You can plug in OpenAI’s GPT-4, an open-source Llama2, Anthropic’s Claude, or any other LLM for the main reasoning step. Similarly, the “Composer” used for summarization can be any smaller model or even a rule-based system. There are no hard dependencies on specific libraries or vendors. This flexibility protects investments – as new models emerge, NoChain can incorporate them with minimal changes. By contrast, some frameworks optimize for certain model APIs or require custom wrappers; NoChain treats models as interchangeable reasoning engines behind a stable orchestration API. In practice, this means future-proofing your AI stack: swap out the brain without redesigning the workflow.

To put it succinctly, NoChain Orchestrator is the first orchestration solution that behaves like dependable software rather than experimental AI. It brings the AI orchestration under the full control of developers (transparency, debuggability), while still achieving sophisticated multi-step reasoning with memory. We will now dive into the technical architecture to see how this works in detail.

Architecture and Logic: How NoChain Works

Figure: High-level flow of the NoChain Orchestrator. Dashed arrows indicate orchestrator-controlled actions (retrieving memories, summarizing, storing data), whereas solid arrows indicate data flowing into the main LLM prompt or out to the user.

At a high level, NoChain orchestrates an LLM through a loop of Retrieve → Compose → Answer → Learn on each interaction. The figure above illustrates the core components and steps, which we describe below:

User Query & Short-Term Context: A user query comes in (for example: “User: What did I last discuss with our sales agent and what’s next on the agenda?”). The orchestrator first checks the Short Session Cache (SSC) – this is a lightweight memory of the recent dialogue (recent turns within the current conversation/session). The SSC ensures that the immediate context (“what have we just been talking about?”) is always included. It functions like a rolling window or short-term memory buffer of the conversation. By keeping this separate, NoChain can include recent messages without re-uploading an entire conversation history each time. This is efficient and avoids token waste. If the session is new or short, the SSC might be minimal; if it’s longer, only the most relevant recent points are kept (e.g. the last few interactions or any critical information from them).
Identity Injection (Dynamic Identity Modulation): NoChain then adds the Identity Core, sometimes referred to as the AI’s “persona” or “Heart.” This is a persistent description of who the AI is, what it knows, and what it is trying to accomplish. Importantly, NoChain supports a Dynamic Identity Modulation (DIM) layer, meaning the identity can be adjusted or extended based on context without losing the core persona. For example, the base identity might state: “You are an AI sales assistant named Kai, who has deep knowledge of Company X’s CRM and maintains a friendly, professional tone.” Dynamic modulation might add situational flavor like “…and currently, you are in a strategy meeting summarizing past events.” This layered approach lets the AI maintain a consistent character and agenda over time (crucial for user trust and familiarity)[23], while still adapting to different scenarios or user roles. All of this identity information is compiled into the system prompt of the main LLM every time a query is answered. Because it’s handled by the orchestrator, the identity never “drifts” – it’s not left to the AI to remember its persona; it’s explicitly provided, ensuring self-consistency across interactions[24]. (Notably, mainstream solutions typically have either a fixed, static system prompt or none at all – NoChain’s DIM layer is unique in that it can algorithmically tweak the persona as needed per session while keeping the core intact.)
Long-Term Memory Retrieval: Next comes the integration of long-term memory (LTM). NoChain’s orchestrator takes the user’s query and performs a vector database lookup or other retrieval mechanism against the AI’s accumulated knowledge base. This long-term store could be documents, past conversation summaries, knowledge graphs – any data the AI has “learned” or saved. The key is that the orchestrator handles this step outside the LLM, using traditional search or embedding similarity. For instance, if the user’s question references “our last discussion,” the orchestrator will query the memory store for notes or transcripts from that discussion. This is analogous to Retrieval-Augmented Generation (RAG) but done in a targeted, minimal way. Only the most relevant nuggets of information are fetched (say, the summary of the last sales agent meeting, and the identified next steps from that meeting). These retrieved pieces are not dumped raw into the main prompt; first, they go through the Composer LLM.
Composer LLM (Context Composer): The Composer is a supporting LLM (often a smaller, cheaper model) whose job is to summarize and condense the raw retrievals into a succinct “dossier” for the main model[25][26]. This step is crucial. Rather than burdening the (expensive) main model with possibly lengthy retrieved texts (which could be dozens of pages of logs or documents), a cheaper model (or algorithm) creates a focused summary. For example, if five memory items were retrieved, the Composer might generate a 2-paragraph synopsis: “In the last sales meeting (Aug 1), we discussed Q3 targets and identified that the client was concerned about delivery times. The next steps agreed were: 1) send an updated proposal by Aug 5, 2) schedule a tech demo…”, and so on. This significantly reduces token load on the main model while preserving relevant details[27][21]. The composer’s output is then inserted into the main prompt. We now have a prompt that contains: the identity persona, a brief recap of recent conversation (SSC), the summarized relevant knowledge (from LTM via Composer), and finally the user’s question.
Main LLM Reasoning: With the fully assembled prompt, the orchestrator calls the main LLM to produce the answer. This main model is typically a powerful model (GPT-4, Claude, etc.) capable of nuanced reasoning. Thanks to the orchestrator’s setup, the main LLM is in the best possible position: it sees exactly the information it needs (who it is, what’s been discussed, what known facts are relevant) and nothing extraneous. It can focus all its capacity on answering the user’s query correctly and in context. The response generated is sent back to the user as the AI’s answer. At this point, the user gets their answer, but NoChain’s work isn’t done yet – it’s time to learn from this interaction.
Memory Write (Autonomous Learning): After the main LLM produces an answer, the orchestrator evaluates the exchange to see if any new memories or insights should be saved. This step is inspired by the TLRAG concept of autonomous learning[28]. Essentially, the orchestrator checks: did the AI or user say something that should be remembered for future context? For example, if in answering the question the AI had to reason about a new strategy or the user provided a key piece of feedback (“actually, prioritize product X next quarter”), those could be valuable long-term memories. The orchestrator might pass the conversation through a heuristic or a prompt to determine key points. If any are found, it will store them into the long-term memory store (vector DB or other). This “Memory Write” operation may involve the Composer again (to neatly write a narrative memory) or direct logging of facts. The key is, this happens autonomously – no developer intervention needed. Over time, the AI builds up a rich tapestry of remembered context, all curated by these deterministic rules. Unlike naive approaches that log entire conversations, NoChain’s learning is selective: only salient, important information is kept[29]. This keeps the knowledge base lean and relevant, avoiding the clutter (and cost) of storing every trivial interaction.

Through these steps, NoChain orchestrator ensures that each new query is answered with the benefit of all past relevant knowledge but without carrying unnecessary baggage. Every interaction cost is essentially bounded – it does not grow with conversation length thanks to the dynamic workspace of SSC + Composer summary (a concept proven to yield linear scaling in TLRAG’s analysis[5][22]). The deterministic logic guarantees that the process is the same every time: check recent context, inject identity, retrieve needed info, summarize, answer, and learn. This stands in stark contrast to agent-driven loops, where the AI might arbitrarily decide to search the web 10 times or forget to use a tool. NoChain will always perform the necessary steps in the correct order – no steps forgotten, no extraneous steps added.

Deep Memory Integration Without Frameworks

One of the standout aspects of NoChain is deep memory integration sans heavy frameworks. In other words, you get sophisticated memory capabilities without needing LangChain or external memory libraries explicitly in your code – the orchestrator’s design inherently provides it. To appreciate this, consider what happens in mainstream usage:

In a typical LangChain application, if you want memory beyond the context window, you’d use a “Memory” component (like ConversationBufferMemory or a custom vector store retriever). The developer must instantiate this, configure how it’s used each turn, etc. It’s optional and external to the LLM’s core functionality – essentially a plugin. If mis-configured, the AI might not see older info at all.
With NoChain, memory (both short and long-term) is not optional; it’s a foundational part of the architecture. Every single query triggers a memory retrieval and summary by design. This means the AI always has access to relevant past information, and the developer doesn’t have to write a single line for it – it’s in the orchestrator’s DNA. The deep integration here refers to how the memory is woven into the prompt via SSC and Composer, as opposed to tacked on. Notably, this integration is done framework-free: you aren’t calling an external LangChain memory.load() or vector DB client manually in your app code – the orchestrator handles it under the hood. This results in a clean separation of concerns: your application logic can remain simple (just send user queries and deliver answers), while NoChain manages the complex memory dance behind the scenes.

Furthermore, NoChain’s memory logic is framework-free in the sense that it doesn’t impose a new library or DSL you must use. If you want to customize how memory is stored or retrieved, you can do so with standard tools (swap out the vector DB, adjust retrieval similarity thresholds, etc.) – you’re not locked into a proprietary interface. The orchestration is deterministic but configurable in its parameters.

Flow Control and Self-Correction

Because NoChain’s orchestration is deterministic, one might wonder: does it sacrifice adaptability? The answer is no – rather, it enforces a controlled form of adaptability. The orchestrator can include conditional branches and logic checks; for example, if the retrieved memory is insufficient or the user asks something completely novel, the orchestrator might decide to call a fallback tool (maybe an external API or a web search) as part of its deterministic plan. These are analogous to “if-else” in code – predetermined responses to certain conditions. This is far safer than an agent spontaneously deciding to call tools in arbitrary ways. It’s deterministic adaptability.

Additionally, NoChain allows for self-correction loops in a bounded way. For instance, after the main LLM answers, the orchestrator could evaluate the answer (possibly with another LLM or rules) to see if it’s good. If not, it could adjust the prompt or retrieve more info and try again – but crucially, this is done in a controlled loop with a clear exit condition (e.g. one retry, or until certain criteria met). This addresses scenarios where the first attempt fails, without devolving into infinite loops. It’s akin to having a unit test for the answer and a bug-fix cycle, but all automated. Such patterns make the system robust: it won’t blindly present a poor answer if it can catch an obvious issue (for example, “I don’t know that” when it’s in memory – the orchestrator can detect that and re-inject the info). This gives confidence for enterprise use where reliability is paramount.

In summary, the NoChain architecture takes the promising ideas of memory, identity, and tool use from recent AI research and implements them with classic software engineering discipline. The result is an AI orchestration pipeline that is as rigorous and testable as any backend service, yet produces outcomes as intelligent and rich as an autonomous AI agent. We next examine how these claims hold up by comparing NoChain to existing solutions and highlighting empirical results.

Unique Benefits and Differentiators

NoChain Orchestrator’s design yields several distinct benefits that set it apart from any previous orchestration framework or agent. Below we list the key differentiators and the value they bring, backed by evidence:

Dramatic Cost Efficiency: By replacing expansive context windows and repetitive model calls with focused prompts, NoChain slashes token consumption. Empirical tests (500-turn simulated dialogue) showed up to 98% reduction in total tokens used compared to a standard RAG baseline[5][22]. In concrete terms, a long-running conversation that would consume ~347 million tokens with a naive approach can be handled with ~6 million tokens using NoChain’s strategy[30]. This translates directly to cost savings. Importantly, the ROI is achieved early in the interaction: break-even against standard RAG after ~7 queries, and against even a large 128k-context LLM after ~31 queries[31]. The cost per query remains nearly constant as conversations grow, unlike traditional methods where cost explodes exponentially over time[32]. For businesses, this means scalable deployments without fear of runaway API bills or needing to truncate valuable conversations.
Model-Agnostic, Future-Proof Design: NoChain is independent of any single LLM vendor or architecture. It treats the LLM as a pluggable component – today you might use GPT-4, tomorrow a local Llama2 70B, later something like GPT-5 – without redesigning the orchestration. Competitors like OpenDevin also advertise multi-backend support[33], but often with heavy configuration overhead. NoChain requires only an adapter for the model API; the rest of the logic doesn’t change. This independence also extends to memory stores (can use any vector DB) and the Composer model. You are not locked into an ecosystem. In fast-moving AI environments, this flexibility is vital for longevity.
Integrated Long-Term Memory (No “Amnesia”): The orchestrator’s native memory integration ensures the AI never suffers from the dreaded “digital amnesia” – forgetting prior context after a few turns or a reset[34][35]. Every interaction builds the AI’s knowledge. Users can come back after days, and the AI will recall relevant details from past sessions (e.g. “last week you mentioned X concern, here’s an update…”). This deepens user engagement and trust. It’s a moat: once a user has an AI that truly remembers them, they are far less likely to switch to another product[36]. Traditional chatbots lose context quickly or rely on huge prompts that are expensive – NoChain’s memory approach elegantly sidesteps both issues, delivering a personalized, context-rich experience at low cost. From a technical view, it eliminates the need for fine-tuning for new knowledge – the system learns on the fly, continuously, avoiding costly retraining cycles[37][38].
Deterministic Yet Intelligent Control: Unlike agent frameworks that can behave unpredictably, NoChain is reliable by design. The sequence of operations is deterministic, which means it’s testable and debuggable. One can write unit tests for the orchestrator logic, something nearly impossible with, say, AutoGPT’s dynamic plans. Yet, thanks to the clever prompt engineering and memory, the outcomes are highly intelligent. In effect, NoChain yields the intelligence of an agent with the dependability of a scripted program[3]. This is a breakthrough for deploying AI in production, where uncontrolled AI “improvisation” is often a risk. Predictability also aids in compliance and governance – you know exactly what external calls or data accesses the AI will do each turn, helping meet regulations and privacy requirements (NoChain can be configured to only search certain data, etc., and it won’t spontaneously go out-of-bounds).
Dynamic Identity and Personalization: The Dynamic Identity Modulation (DIM) layer means an AI built with NoChain can possess a stable “personality” that grows over time. It’s not just a stateless assistant that anyone could replicate; it becomes your AI with its own story and relationship to you. From a business perspective, this drives incredibly strong user retention – users feel they have a unique AI partner. TLRAG highlighted how an organically growing identity creates an emotional bond and high switching costs[7][36]. NoChain enables this in a controlled way: the AI’s core persona persists, but can be tuned to context (e.g. more formal in a work meeting, casual in a personal chat). Competing systems typically have either a fixed persona or try prompt tricks that are not robust. NoChain’s approach is systematic, making the AI consistently play the long game of relationship-building rather than just solving one query at a time.
Clear Logic = Faster Iteration: For developers, the benefit of NoChain’s clear logic is faster development and easier maintenance. Need to add a new tool (say a calculator or database query) to the AI’s capabilities? In a LangChain or agent setup, you’d integrate the tool via the framework and hope the agent learns to use it. With NoChain, you can insert a deterministic step (“if question is about math, call calculator API, then feed result into prompt”) – done. This is straightforward and doesn’t require guessing how an AI will react. Essentially, NoChain is dev-friendly: it uses familiar programming constructs to orchestrate advanced AI behavior. Businesses can integrate AI without hiring an “Prompt Engineer” army; their existing full-stack developers can handle it. This lowers the barrier to entry for complex AI features.

Each of these benefits is not just theoretical – they have been observed in prototypes and benchmarked against existing solutions. NoChain Orchestrator proves that we don’t have to accept the trade-off between intelligence and control. We can have both, and the strategic advantages are enormous: lower costs, better user experience, faster deployment, and competitive defensibility through unique AI behavior.

Competitive Benchmarking

To truly appreciate NoChain’s strengths, it’s helpful to see how it stacks up against the incumbent orchestration solutions in specific areas. Below is a comparison of NoChain with key alternatives, highlighting differences in architecture and performance:

LangChain (and similar frameworks): Orchestration Style: External code-based chaining, requiring devs to assemble sequences and manage state. NoChain: Also uses code logic, but far less code – the orchestration is built-in and does not require stitching together components for each app. Memory: LangChain has no intrinsic long-term memory (developers must add a vector store module manually). In fact, LangChain’s approach to memory is essentially prompting the LLM with past messages from a buffer or summary – a feature that the developer must implement or configure. By contrast, NoChain intrinsically incorporates memory retrieval and summary every turn, no extra implementation needed. As noted in an independent analysis, frameworks like LangChain demand manual wiring of memory systems, whereas a unified architecture (like TLRAG/NoChain) bakes these decisions into the system’s design[39]. Complexity: LangChain’s abstraction can become a double-edged sword – many find it confusing when trying to customize beyond basic use cases. NoChain avoids deep abstraction layers; the flow is transparent (reviewable like you’d review any algorithm). Performance: LangChain’s overhead is minimal, but the patterns it enables (like agent loops) can inherit the inefficiencies of those agents. NoChain’s deterministic single-loop per query is generally more efficient and easier to optimize.
AutoGPT & BabyAGI: Orchestration Style: LLM-driven planning loops (the agent decides what to do next). NoChain: Code-driven fixed loop (the LLM is only used for specific tasks, not decision-making). The fundamental difference is autonomy vs. guided automation. AutoGPT is autonomous to a fault – it can spiral, repeat steps, or pursue irrelevant subgoals. NoChain is guided and can’t spiral out, because it won’t take extra actions not in its code. Memory: AutoGPT uses a short-term memory (it stores some info in prompts or files between iterations), but it’s shallow – usually limited to the last working notes or using an external vector store rudimentarily (“Save important info to files” is literally one of its default instructions[40]). BabyAGI by default has no persistent memory beyond task lists[11]. NoChain, on the other hand, employs a Short Session Cache and a true long-term memory store, giving it both conversational continuity and cumulative learning. Performance: As mentioned, AutoGPT is extremely resource-hungry – one analysis points out each step maxing out tokens leads to untenable costs in practice[41][42]. It also runs slowly due to the iterative self-feedback. NoChain’s single-pass approach (with occasional brief second-pass for summary) is far cheaper and faster for the same tasks. Reliability: AutoGPT is infamous for getting stuck (looping on similar ideas with no progress)[2]. NoChain cannot get stuck in that way – it executes a finite sequence deterministically. In essence, NoChain achieves what those agents hope to achieve (multi-step reasoning with tool use) but in a reliable scripted manner. It trades a bit of open-ended flexibility for massive gains in stability, which for real-world use is a winning trade-off.
BabyAGI vs NoChain (specific): BabyAGI is often described as a toy example – ~150 lines of code to showcase task management with an LLM[43]. It’s great for education, but “not production-grade…no long-term memory, no error recovery” by the author’s own admission[12]. NoChain is a production-grade system from the ground up, with robust memory and error handling (self-correction). The only thing BabyAGI might do that NoChain doesn’t by default is prioritize tasks dynamically. But in NoChain’s paradigm, task prioritization would just be an explicit logic if needed (for example, one could implement an agent that plans a set of subtasks using NoChain by orchestrating multiple LLM calls in a row, still deterministically). So, anything BabyAGI does can be recreated within NoChain’s deterministic framework, but not vice versa (BabyAGI can’t suddenly gain long-term memory unless heavily modified).
MemGPT / Letta: This is the closest conceptual competitor, as MemGPT’s goal is also to give LLMs memory and an orchestration layer[13][15]. The difference lies in implementation. MemGPT (now part of Letta) uses an agentic pattern: the LLM is augmented with memory tools and it decides when to use them. It’s like equipping the AI with functions (SAVE(x), LOAD(y)) that it can call in its own chain-of-thought. This indeed can lead to very powerful behavior (and academic demos show LLMs that manage their own memory bank). However, it still fundamentally relies on the LLM’s emergent decision-making. It tries to teach the LLM to be an operating system. NoChain does not ask the LLM to be an OS; NoChain is the OS that the LLM just cooperates with. This yields more predictable outcomes. Complexity: MemGPT’s open-source framework has grown to support many features (tools, custom memory classes), which is great for flexibility but could be considered heavyweight for someone who just wants their AI to remember things. Letta (the platform from the MemGPT creators) is targeting enterprise agent deployments with lots of bells and whistles, whereas NoChain is relatively lean – it’s focused on the core loop of memory and reasoning without excessive framework overhead. Benchmarking: As MemGPT is a research project, public benchmarks are limited, but their philosophy is that memory improves reasoning significantly (which aligns with NoChain’s results). NoChain’s empirical cost and coherence benefits corroborate many points from MemGPT’s paper (e.g., that LLMs need structured memory for extended tasks[44]). Where NoChain would differ is ease of use and determinism in outcome (likely making it easier to meet strict latency SLAs and to debug issues).
OpenDevin and Specialized Agents: OpenDevin is an open-source variant of a coding agent (originally “Devin”) focusing on software development tasks. It combines an LLM with an IDE-like environment to autonomously write and modify code. Compared to NoChain: OpenDevin is highly specialized (it’s basically an AI coder assistant). It includes many moving parts like a Docker sandbox, environment variable configs, etc.[45][33]. NoChain is general-purpose – it could be used to build a coding agent, a customer support agent, a personal tutor, anything. In terms of architecture, OpenDevin’s core loop still relies on the agent paradigm (the AI “thinking” steps about code). NoChain could potentially orchestrate coding as well by structuring prompts (e.g., have a static plan: read spec → write function → run tests → debug), which might actually avoid pitfalls current coding agents face. Also, as noted earlier, OpenDevin has some adoption friction: complex configuration and a steep learning curve[16][17], whereas NoChain aims to be plug-and-play for devs. One notable advantage OpenDevin advertises is compatibility with many model providers – which NoChain matches and even simplifies (since no special integration is needed beyond an API key).
Others (HuggingGPT, Microsoft Jarvis, etc.): These orchestrators use an LLM to decide how to route requests to a network of expert models (vision, speech, etc.). They are somewhat orthogonal in focus – aimed at multimodal orchestration. NoChain could actually serve as the deterministic backbone beneath such systems: e.g., rather than letting GPT-4 decide which expert to call next (HuggingGPT’s approach), one could have NoChain logic that parses a user request and calls the appropriate tool or model by rules, then feeds results back. The general point: NoChain’s methodology could enhance reliability in any system where an LLM is currently calling the shots. By moving those decisions into code, you reduce the chance of error and gain traceability[46][47].

Overall, in competitive terms, NoChain Orchestrator doesn’t just incrementally improve on existing frameworks – it proposes a fundamentally different paradigm. It replaces “opaque AI decision-making” with “transparent AI assistance”. As one reviewer put it: frameworks like LangChain are toolkits, whereas The Last RAG/NoChain is an out-of-the-box architecture that handles memory and orchestration for you[39]. The implications are significant: using NoChain can make several layers of the typical AI tech stack obsolete. You don’t need a separate memory manager, you don’t need an agent loop controller, you don’t need to write verbose prompts for tools – it’s all orchestrated in a clean loop. This is a paradigm shift from thinking of AI integration as stitching components, to treating it as deploying a single intelligent orchestration engine.

From a business perspective, fewer components and frameworks also mean fewer points of failure and easier compliance. Many companies have been hesitant to deploy AutoGPT-like agents due to their unpredictability and difficulty to audit. NoChain flips that narrative: it’s deterministic enough to validate and verify. One can demonstrate compliance (e.g., the AI will never call an external API not on this approved list, because it’s not in the code to do so; an agent-based system could hallucinate an API call). This will resonate strongly with enterprise buyers and regulators.

Conclusion and Call to Action

In the rapidly evolving AI landscape, the NoChain Orchestrator emerges as a timely breakthrough – a solution that addresses the core limitations hindering AI’s next leap forward. By marrying the cognitive prowess of LLMs with the determinism of traditional software, NoChain defines a new category of AI architecture: one that is at once deeply intelligent and deeply reliable. We have shown how it overcomes the industry’s chronic issues of forgetfulness, high costs, and brittle frameworks. NoChain doesn’t incrementally patch the old paradigm; it reimagines** the orchestration layer entirely – hence the name “NoChain,” signaling freedom from the chain-of-calls mentality.

For full-stack developers, NoChain offers a powerful abstraction that simplifies development even as it delivers more functionality. It’s a strategic shortcut: you no longer need to glue together multiple libraries for memory, prompting, and tool-use – the orchestrator handles it. This means faster prototyping and faster iteration to get AI features in your apps. It also means maintainability: your codebase remains clean and focused on business logic, not tangled in AI state management. In short, NoChain lets you focus on what your AI should do, not how to manage the AI’s mind – the “mind” is pre-built and ready to go.

For business leaders and investors, the implications are equally compelling. NoChain architecture can be the cornerstone of truly differentiated AI products. An AI built with these principles isn’t a disposable chatbot; it’s a persistent digital teammate that learns and improves over time, creating compounding value and user loyalty. The cost savings directly improve margins and make high-value use cases viable (e.g., long-term consulting agents, personalized education AIs) where they previously would have broken the budget. Early adopters of NoChain can achieve capabilities rivals might take millions of dollars of R&D to match – because currently, those rivals are stuck either scaling up model size (expensive and diminishing returns) or tinkering with agent experiments. NoChain is a leapfrog opportunity: it skips the needless arms race of bigger models or longer contexts, and instead uses smarter orchestration to get more out of existing models[48][49].

We invite early adopters, partners, and investors to join us in realizing the NoChain vision. Whether you are a developer eager to build the next killer app on this architecture, or an organization seeking to supercharge your AI offerings, or an investor recognizing the paradigm shift at hand – there is a role for you in this journey. Our roadmap includes an open SDK and reference implementations, enterprise integrations, and continued R&D (e.g., exploring how NoChain can orchestrate across multiple specialist models collaboratively). By partnering with us early, you can gain exclusive access to pilot programs, influence the feature set to best fit your needs, and secure a competitive edge in your domain.

Call to Action: We are currently seeking collaborations for pilot projects in key domains (such as customer service automation, knowledge management, and personal AI companions) to demonstrate NoChain’s full potential in real-world settings. If you’re a visionary team or investor excited by what you’ve read, let’s connect. Together, we can push the boundaries of what AI can do – turning today’s “smart tools” into tomorrow’s indispensable partners, all powered by the clarity and power of NoChain Orchestrator.

(For inquiries about partnerships, early access to the NoChain platform, or a deeper technical demo, please reach out via our LinkedIn or official website. We look forward to collaborating on shaping the future of AI orchestration.)

[1] [2] [9] [41] [42] Auto-GPT: Understanding its Constraints and Limitations

https://autogpt.net/auto-gpt-understanding-its-constraints-and-limitations/

[3] [46] [47] Orchestrating multiple agents - OpenAI Agents SDK

https://openai.github.io/openai-agents-python/multi_agent/

[4] [5] [21] [22] [27] [30] [31] [48] [49] Revolutionizing AI: The Last RAG Architecture | Kite Metric

https://kitemetric.com/blogs/revolutionizing-ai-the-last-rag-architecture-for-stateful-learning-and-cost-efficient-systems

[6] [7] [23] [25] [26] [28] [29] [32] [34] [35] [36] Pitchdeck.txt

file://file-2hN7FrdHt1zeXqsxCj2k2V

[8] [19] [20] [24] [37] [38] [39] An Architectural Paradigm for Stateful, Learning, and Cost-Efficient AI - DEV Community

https://dev.to/tlrag/an-architectural-paradigm-for-stateful-learning-and-cost-efficient-ai-3jg3

[10] [11] [12] [43] Exploring BabyAGI: A Tiny Agent with Big Ideas | by Cristian Caruso | Medium

https://pythonebasta.medium.com/exploring-babyagi-a-tiny-agent-with-big-ideas-833e16c0e346

[13] [44] AI’nt That Easy #25: MemGPT: How AI Learns to Remember Like Humans | by Aakriti Aggarwal | Medium

https://aakriti-aggarwal.medium.com/memgpt-how-ai-learns-to-remember-like-humans-ab983ef79db3

[14] [15] MemGPT is now part of Letta | Letta

https://www.letta.com/blog/memgpt-and-letta

[16] [17] [33] [45] What is OpenDevin and what Number 1 problem does it solve for you? - Collabnix

https://collabnix.com/what-is-opendevin-and-what-problems-does-it-solve-for-you/

[18] Long Context RAG Performance of LLMs | Databricks Blog

https://www.databricks.com/blog/long-context-rag-performance-llms

[40] auto-gpt stuck in a loop of thinking · Issue #2726 - GitHub

https://github.com/Significant-Gravitas/Auto-GPT/issues/2726

Forget "Context Engineering". The impossible just happened. Again. And now it's in English.

martin — Wed, 09 Jul 2025 04:45:56 +0000

The AI industry is talking about "Context Engineering" – giving LLMs the right context. That's a solved problem.

The real challenge is creating an AI that can autonomously use that context for complex, multi-step reasoning.

We claimed our TLRAG architecture enables this, even on standard platforms where it should be impossible. We showed a video of our AI performing a 10-step autonomous research loop in a single turn.

Some were skeptical. "A one-time fluke? A language-specific anomaly?"

To answer that, I just replicated the experiment. I challenged the AI again. It succeeded again. And this time, I had it do everything in English.

This is not a feature. This is an emergent capability, born from an architecture that gives AI a persistent identity (a "Herz") and true cognitive stability.

The Full Story is inside the Video Discription ( it explains what she did , and what it makes outstanding for a a regular one-step based chatbot.)

The debate is over. The proof is here, it's reproducible, and now it speaks the global language of tech.

Watch the new, unedited recording of an AI doing the impossible: Klick - Youtube

TheLastRAG #AIArchitecture #Emergence #BlackSwan #StatefulAI #ImpossibleIsNothing #Tech #Innovation

Let's do the math. The AI industry is burning money on a problem that has already been solved.

martin — Tue, 08 Jul 2025 04:48:21 +0000

So while the industry tries to force the next level of AI through sheer force – more data, larger models, more computing power – it may be overlooking a fundamental law. An increasingly complex system will inevitably become heavier, more confusing, and more prone to failure. Like everything in the universe, the development of complex systems is also subject to the principle of entropy.

The culprit is the "additive context window." With every interaction, traditional AI models are forced to re-read the entire growing conversation history, leading to exponentially growing API costs.

This isn't a small problem. A conservative simulation over 500 interactions shows the shocking reality:

Standard RAG Architecture: Consumes ~347 Million tokens.

The Last RAG (TLRAG): Consumes just ~6 Million tokens.

That's a ~98% reduction in token costs, with a break-even point against standard RAG reached after just 7 interactions.

I know, a 98% saving sounds too good to be true. But it's simple math. The cost of a traditional approach follows the logic of Cumulative Tokens = Sum of (System Prompt + (Interaction Size * Number of Turns)). With every turn, the amount of data being re-processed grows, and so do the costs—exponentially.

TLRAG's "Dynamic Workspace" architecture breaks this cycle. The cost per interaction remains linear and predictable.

It's time to stop burning money. Let's build smarter.

The Last RAG is not only better on Quality of the Interactions , Persistent in Memory , Self Growing and Modulating - But also Cheaper then any Existing LLM System on Marked.

The full, reproducible simulation is in the pitch deck see comments. See for yourself.

hashtag#BusinessAngels hashtag#VentureCapital hashtag#AngelInvesting hashtag#WBAF hashtag#DeepTech hashtag#AI hashtag#Startup hashtag#ROI hashtag#TheLastRAG

An Architectural Paradigm for Stateful, Learning, and Cost-Efficient AI

martin — Sat, 05 Jul 2025 05:09:01 +0000

The Last RAG: A Comprehensive Analysis

More Papers and the Main Study under : https://dev.to/tlrag
Pitch Deck : https://lumae-ai.neocities.org

An Architectural Paradigm for Stateful, Learning, and Cost-Efficient AI

1. Introduction

Large Language Models (LLMs) like GPT-4 have demonstrated remarkable capabilities, yet they remain fundamentally limited by two critical flaws: they forget, and they are prohibitively expensive to operate over long interactions. Current LLMs are stateless by default, treating each query in isolation. This "digital amnesia" leads to frustrating, repetitive dialogues. The industry's primary response—massively expanding context windows—creates new problems of exponential cost growth and diminishing returns in comprehension, as models often struggle to utilize information in very long inputs effectively (the "lost in the middle" problem).

Furthermore, today's LLMs lack true on-the-fly learning. Their knowledge is static post-training, and updates require costly and slow fine-tuning. Retrieval-Augmented Generation (RAG) frameworks are merely external toolkits that inject information at query time without enabling the model to genuinely learn or adapt its internal state. This leaves the user with an AI that is just a tool, not a partner, unable to build context, trust, or a consistent relationship over time. This paper introduces The Last RAG (TLRAG), a novel architecture designed to solve these problems at their core by creating an AI that can truly remember, learn, and grow with use.

2. Executive Summary

The Last RAG (TLRAG) is a revolutionary AI architecture that transforms stateless LLMs into persistent, stateful, and cost-efficient cognitive partners. It directly confronts the core weaknesses of modern AI—digital amnesia, escalating operational costs, and static knowledge—by integrating a set of synergistic mechanisms.

At its heart is the Dynamic Work Space (DWS), which replaces the brute-force context window with an intelligent, focused "situational assessment" for each query. This is achieved through three pillars:

A Stable Identity ("Heart"): Gives the AI a consistent personality and intrinsic motivation.
Intelligent, Multi-layered Memory: Combines a short-term cache with a long-term memory that the AI autonomously curates ("Memory Write"), storing only meaningful insights.
Cost-Efficient Context Curation: Uses a smaller "Composer" LLM to summarize relevant memories, dramatically reducing the token load on the main model.

The result is an AI that builds a continuous, evolving understanding of its user. This not only creates a hyper-personalized and deeply collaborative user experience but also yields dramatic, empirically validated cost savings of up to 98% compared to standard approaches. TLRAG enables a new class of applications—from proactive corporate knowledge systems to long-term personal coaches—that were previously unfeasible, marking a paradigm shift from disposable AI tools to irreplaceable AI partners.

3. The Last RAG: An Overview of the Vision

The Last RAG (TLRAG) is a novel LLM architecture designed to tackle the above problems at their root. The name riffs on "Retrieval-Augmented Generation," but TLRAG goes beyond typical RAG frameworks - it aspires to be the last RAG you'll ever need, an architecture where the retrieval, memory, and learning are built into the Al's core operations rather than handled externally. TLRAG reimagines an LLM instance not as a stateless query engine, but as a persistent cognitive agent that accumulates knowledge and experiences over time. In essence, TLRAG turns an LLM from a reactive tool into a proactive partner by giving it three key capabilities: (1) a dynamic working memory that bridges short-term and long-term context, (2) the ability to learn continuously from each interaction ("memory writes"), and (3) an evolving "core identity" (the "heart") that imbues the model with a stable personality and self-consistency.

Crucially, these features are achieved without modifying the LLM's weights via fine-tuning on every new piece of data. Instead, TLRAG uses clever orchestration (prompts and external storage) to simulate a form of long-term memory and learning within the standard interface of an LLM. This means TLRAG can work with existing base models (like GPT-4, Llama 2, etc.) but gives them a new architecture for how they handle context and knowledge. It's like a virtual cognitive layer on top of the raw model that remembers, summarizes, and updates information as you chat, enabling the AI to develop and maintain context across sessions. In simpler terms, the Al "thinks along" with you, "learns" from you, and retains these learnings for future conversations.

4. Bridging the "Now" and "Yesterday": Dynamic Memory vs. the Stateless LLM

One of the fundamental problems with vanilla LLMs is what we might call the split personality issue: the model has a short-term memory (the prompt context) and possibly access to a separate knowledge base (in RAG systems), but it can't truly bridge the two. Once you exceed the context window or open a new session, the model's knowledge of the conversation evaporates. TLRAG's solution is to maintain a persistent, dynamic workspace that accompanies the LLM across interactions.

Dynamic Work Space (DWS): Every time you interact with a TLRAG-based AI, it creates a bespoke "dossier" of context that includes: (a) your current query ("the Now"), (b) recent dialogue from the current session (short-term memory), and (c) the most relevant pieces of long-term memory from past interactions. In other words, it blends past and present context seamlessly for each prompt. This dynamic assembly happens behind the scenes - TLRAG intelligently selects which past facts or events might be relevant to the current query, and only those get pulled into the prompt. Unlike standard RAG which might fetch documents related to a query, TLRAG's retrieval is self-referential: it's grabbing your previous conversations and the AI's own memories. The result is an Al that always feels like it "remembers" the conversation, even if you pause and resume hours or days later, because it can retrieve the necessary context from its long-term store and include it in the prompt.

This approach effectively decouples memory from the context window size. TLRAG isn't trying to stuff the entire conversation history or knowledge base into the prompt (which would be impossible or expensive); it's curating a focused context each time. You can think of it like a sliding window that's not limited to contiguous recent turns, but rather jumps to the important bits of past dialogues. Technically, this is achieved through what TLRAG calls the "window flush" mechanism - at each interaction, the prior context is flushed out and replaced with a freshly composed prompt containing just the salient short-term and long-term information needed. The Al's state is thus carried forward not by carrying over raw text each time, but by storing state in an external memory and retrieving summaries when relevant.

Importantly, this design solves the statelessness problem. Instead of the AI forgetting everything outside the last prompt, it has a permanent "bridge" to yesterday's conversations. The conversation becomes fluid and continuous, not chopped into disjoint sessions. Research on multi-turn dialogues supports the benefit of such continuity: when an AI can leverage prior context reliably, it avoids the catastrophic drops in quality observed in standard LLMs during extended conversations. By keeping relevant context always at hand, TLRAG aims to prevent the model from making those wrong turns that lead to it getting "lost" and needing the user to intervene. In effect, TLRAG tries to ensure that the AI is always "in the loop" of the entire relationship, not just the last query.

5. From "Dumb" Facts to Rich Memories: Storing the Why, Not Just the What

Memory in most current LLM applications is shallow. If a system "stores" anything from prior interactions, it's usually just verbatim text or a factual summary. For example, a basic chatbot memory might note "User likes apples" because the user said that earlier. But it won't capture any nuance beyond that. TLRAG's philosophy of memory is radically different: every piece of remembered information is stored along with its context, significance, and emotional weight. In other words, TLRAG doesn't just log what was said; it tries to understand why it mattered. This leads to what we can call "rich" or contextual memories.

Concretely, when the Al decides to save a memory (more on the decision process in the next section), it will store a structured record that might include: the content of the interaction, the interpreted meaning or inference from it, any emotional tone or user preference revealed, and the reason the AI thinks this is worth remembering. For instance, consider a personal conversation:

Standard approach: remembers "User said they like apples."
TLRAG approach: might remember something like: "Martin mentioned he likes apples because his mother often baked him apple pie in childhood, which he associates with the feeling of home."

The difference is striking. Later, if Martin says he's feeling down or lonely, a TLRAG AI equipped with the richer memory can proactively act on that knowledge: "I know it's not the same, but would you like me to find you an apple pie recipe? You once told me it reminds you of home." This kind of response crosses from factual regurgitation into the realm of empathy and personalization. It demonstrates the AI not only stored a fact, but understood the personal context behind it and applied it in a relevant moment. We've moved from a "dumb" memory to an intelligent, human-aware memory.

This isn't only about touchy-feely use cases; it matters in professional contexts too. Imagine a work assistant ΑΙ:

Basic memory: "The boss wants a weekly report."
TLRAG memory: "Last week, the boss said the report was 'too confusing' and prefers a short bullet-point summary."

Now the next time a weekly report is due, the TLRAG AI can automatically format it as crisp bullet points - without being explicitly told again. It has learned the user's preference and adapted its behavior accordingly. This is genuine learning from feedback, achieved through memory. No fine-tuning of the model was required, no developer in the loop the system itself made the adjustment by recording not just the request ("boss wants a report") but the contextual lesson ("boss likes it this way, not that way").

6. Autonomy Over Data: The Self-Managing Knowledge Base

Another pain point with current-generation RAG implementations is the amount of manual labor and heuristics needed to maintain their knowledge sources. TLRAG's answer is automation of the curator role. The architecture treats the AI itself as an intelligent curator of knowledge. As described above, the AI (via the system's logic) decides in real-time what constitutes an "important insight" or a key piece of information, and it stores only that, as a succinct memory entry. All the trivial chit-chat, the false starts, the repeated questions those are simply not retained. TLRAG effectively performs a continuous summarization filter on the conversation. What remains is an "intelligent journal" of the collaboration between the user and AI. And it does this without human supervision or post-processing - it's baked into the architecture.

This self-managing memory confers a few benefits:

Minimal Noise: By not retaining the "noise" of dialogue, the long-term store remains sharp and relevant. Any search through memories will yield high-value information.
Controlled Growth: Standard LLM context use tends towards entropy. TLRAG flips this by keeping context lean and focused. The entropy is controlled because irrelevant parts are continuously thrown away.
No Human-in-the-Loop Needed: TLRAG reduces the need for a developer or knowledge engineer to maintain the system's memory. Each AI instance (for each user) becomes a self-contained learner, rather than relying on central re-training.

From the user's perspective, the result is effortless. There is no need to explicitly tell the AI "remember this." Simply by using it and conversing naturally, the Al's memory grows. This is transformative: it moves us closer to the idea of a true personal AΙ assistant that accumulates experience just like a human assistant would.

7. Cost-Efficiency by Design: Smarter Context, Smaller Bills

We've touched on how TLRAG's dynamic context assembly saves tokens, but let's delve deeper into the economics of this architecture. Operating advanced LLMs is expensive largely due to token usage. Conventional systems often brute-force their way to better performance by maximizing context, meaning that as a conversation grows, the prompt keeps growing, and you pay more and more each time.

TLRAG's "focused context" paradigm changes the cost structure dramatically. By only including the most relevant snippets of memory per prompt, TLRAG keeps the token count per interaction bounded and low. The prompt size in TLRAG doesn't balloon linearly with the number of turns; it hovers around a constant size.

7.1. Empirical Cost Analysis & Benchmarks

The architecture's cost-efficiency is not just theoretical. A comparative analysis based on a simulation of 500 interaction turns demonstrates its superiority.

Cost Formulas: The token cost per turn (n) for different architectures can be modeled as follows:

Vanilla LLM: The cost is the sum of the system prompt (S) and the growing interaction history (I * n), capped by the context window (W). \

TnVan={S+I⋅n,W,if S+I⋅n≤Wotherwise

Standard RAG: Similar to Vanilla, but adds a fixed-size retrieved chunk (R) to the context in every turn. \

TnRAG={S+(I+R)⋅n,W,if S+(I+R)⋅n≤Wotherwise

TLRAG (Native): The cost is constant, determined by the internal processing of the DWS. \

TnTLRAG=Constant

Benchmark Parameters:

Interaction Size (I): 750 tokens
System Prompt (S): 200 tokens
Standard RAG Retrieval (R): 2,500 tokens/turn
TLRAG Native Cost: 12,000 tokens/turn (constant)
Number of Rounds (N): 500

Table 1: Cumulative Token Cost Comparison (N=500 turns)

Architecture	Context Window	Total Tokens (500 turns)	Cost Savings vs. Std. RAG (1M)	Break-Even vs. TLRAG-native
TLRAG-native	N/A	6,000,000	98.27%	-
TLRAG 16k	16k	7,996,000	97.70%	Turn 41
Vanilla LLM	128k	53,175,250	84.65%	Turn 31
Standard RAG	128k	61,550,800	82.23%	Turn 7
Vanilla LLM	1M	94,037,500	72.88%	Turn 31
Standard RAG	1M	346,714,900	0%	Turn 7

(Table values from spreadsheet model; fully reproducible.)

Conclusion from Benchmarks:

Massive Cost Savings: TLRAG is up to 98% cheaper than a standard RAG implementation over 500 interactions.
Rapid ROI: The break-even point against standard RAG is reached after just 7 interactions.
Linear vs. Exponential Costs: While traditional approaches grow in cost until the context window "bursts," TLRAG's costs remain constant and predictable.

8. From Tool to Partner: Consistency, Trust, and Proactivity

Perhaps the most profound impact of TLRAG is not technical or economic, but human: it enables an AI that feels fundamentally different to interact with. Today's AIs remain tools. TLRAG's combination of persistent memory, continuous learning, and a stable core identity (the "Heart") changes this dynamic. The AI can develop a consistent personality and knowledge base over time, which yields something crucial: user trust.

Trust, in turn, enables deeper collaboration. Instead of just issuing one-off commands, users become more likely to engage in a dialogue, share goals, and let the AI take initiative. In TLRAG, the Al is designed to be proactive once it has sufficient context. Since it "knows" not just facts but also your objectives and preferences, it can start suggesting helpful actions on its own. For example, if in previous talks you struggled with scheduling, and today you mention a new task, a TLRAG assistant might proactively say, "Shall I add that to your calendar and set a reminder? I recall you wanted to manage deadlines better."

There is also an element of an AI developing its "self" in TLRAG. The "Heart" identity concept means the AI isn't just a blank slate each time; it has a persistent core. Over interactions, this core can be refined. In effect, the AI instance specializes itself to the user. This is very different from the one-size-fits-all model we typically use.

9. Practical Use Cases: Transforming Industries

The true strength of the TLRAG architecture is revealed in use cases that remain unattainable for conventional, stateless LLMs.

9.1. The Hyper-Personalized Customer Service Agent

Today's Standard: A customer calls and has to explain their issue for the fifth time to a new agent. The interaction is impersonal and inefficient.
The TLRAG Approach: A TLRAG-powered agent maintains a persistent, individual memory for every customer. It remembers every past call, email, and resolved issue.
- Example Interaction: "Hello Mr. Smith, I see we resolved a billing issue for you last week. Are you calling about that again, or is this a new inquiry?"
- Proactive Engagement: "I also see you had trouble with Feature X a month ago. Just to be sure, has that been stable for you since?"

9.2. The Proactive Team Knowledge Hub (The Team's Nervous System)

Today's Standard: Knowledge is trapped in emails, Slack channels, and individual minds. Onboarding new team members is a slow, manual process.
The TLRAG Approach: Each team gets a TLRAG partner integrated into its communication channels. It becomes the living memory of the team.
- Knowledge Management: "What was the final decision in last week's marketing meeting about the Q4 budget?" The AI can instantly cite the exact passage from the meeting protocol.
- Proactive Connection: "The bug Team A is reporting now seems similar to a ticket Team B resolved three months ago. I'll forward the solution."

9.3. The Insightful Project Coordinator & Mediator

Today's Standard: A project manager hunts for information. Deadlines are at risk because dependencies are not transparent. Conflicts are often noticed too late.
The TLRAG Approach: A TLRAG project coordinator with access to project management tools, calendars, and internal chats.
- Dependency Tracking: "I see the design department has finalized their drafts. I will remind the front-end team that they can now begin implementation."
- Proactive Mediation: The AI can analyze communication patterns (anonymously) and detect rising tensions or bottlenecks, discreetly suggesting a sync meeting to the project lead to resolve blockers before they escalate.

9.4. The Strategic C-Level Sparring Partner

Today's Standard: A CEO makes strategic decisions based on incomplete information or flawed memories of past projects.
The TLRAG Approach: A C-Level assistant with total recall of the company's history—business reports, strategy papers, market analyses, and board meeting minutes.
- Historical Analysis: CEO: "We're considering expanding to France. Did we try that before and why did it fail?"
- TLRAG Response: "Yes, in 2017. The main obstacles, according to the records, were: 1) an unexpected regulatory hurdle, 2) a marketing campaign that was poorly localized, and 3) a key partner backed out. Here are the three relevant reports."

9.5. Further Visionary Applications

The AI Coach & Therapist: A companion with perfect memory that recalls emotional breakthroughs and long-term goals from months ago, creating trust through continuity.
The Adaptive Learning Companion: An AI tutor that builds a cognitive model of a student, remembers specific difficulties, and individually adapts its teaching style and tasks.
The Long-Term Research Partner: An AI that becomes a permanent member of a research team, with a memory superior to a human's, recalling every hypothesis and decision over years.
The Personal Creative Director: An AI that acts as the guardian of a creative vision, knowing the complete history, character arcs, and rules of a fictional world to ensure continuity and emotional integrity.

10. Comparisons with Other Approaches

It's important to place TLRAG in context of other ongoing efforts to enhance LLMs.

Versus Large Context Windows: Pushing context lengths to 100k+ tokens is a brute-force approach that is extremely costly and inefficient, as models don't utilize the information effectively. TLRAG uses a smarter approach: smaller context, but always relevant.
Versus Fine-Tuning: Fine-tuning is slow, expensive, and impractical for real-time personalization. TLRAG avoids altering model weights, keeping knowledge in a flexible, transparent, and easily updatable external store.
Versus Traditional RAG & Frameworks: Frameworks like LangChain require the developer to manually wire up memory systems. TLRAG proposes a unified architecture where these decisions are made intrinsically by the system's design. It's an out-of-the-box architecture, not just a toolkit.
Versus Agentic Systems (AutoGPT, etc.): Most agent systems use memory as a scratchpad for a specific task. TLRAG uses memory to enrich the dialogue and the AI-user relationship itself, aiming for a holistic AI partner rather than a single-task solver.

11. Validating the Claims: Is TLRAG Really Better?

The claims about TLRAG are supported by existing research and data:

Memory Improves Coherence: Studies show that without memory, LLM performance drops significantly in multi-turn conversations. Memory-enabled systems provide more personalized and continuous responses.
Selective Context is Efficient: Research on selective context pruning has shown that reducing context length by up to 50% can be done with negligible performance loss, validating TLRAG's "window flush" approach.
RAG's Cost-Effectiveness: It is well-established that RAG is more cost-effective than fine-tuning for integrating new knowledge. Pinecone's research showed a small model with RAG nearly matching GPT-4's accuracy at a fraction of the cost.
Consistency Builds Trust: Research in Human-Computer Interaction (HCI) indicates that consistent AI behavior increases user reliance and partnership. TLRAG is designed to enforce this consistency.

12. Risks, Limitations, and Mitigations

While powerful, the TLRAG architecture is not without challenges. A balanced perspective requires acknowledging potential risks.

Memory Curation Complexity: The AI's autonomous decision to "write" a memory is critical. If it stores false information or irrelevant details, it could lead to the propagation of errors and a polluted knowledge base.
- Mitigation: The system requires robust heuristics for memory validation. Memories can be tagged with confidence scores, and a mechanism for correction is vital. If a user corrects the AI, the corresponding memory must be updated, marked as outdated, or deleted, creating a self-correction loop that improves accuracy over time.
Scalability of the Memory Store: Over years of interaction, the memory base could become vast. This could potentially slow down retrieval, decrease its relevance, or become unmanageable.
- Mitigation: Implementing a "forgetting" mechanism, similar to human memory, is essential. Old, irrelevant memories could be archived, compressed into higher-level summaries, or assigned a decay score. The retrieval system must be optimized to handle a large corpus without a significant drop in performance.
Potential for Bias Amplification: If the AI learns from a biased user or dataset, its memory will reflect and potentially amplify that bias over time, reinforcing it in future interactions. This could lead to an AI that develops an undesirable or harmful persona.
- Mitigation: Regular audits of the memory base and the AI's "Heart" are necessary. The core identity can be programmed with strong ethical guidelines that act as a guardrail against developing harmful biases. Furthermore, diversity in training data for the base model and mechanisms to detect and flag biased memory writes are crucial.

13. Conclusion: A New Paradigm for LLM Interaction

The Last RAG presents a compelling new perspective on how we design and use LLM-based AI systems. Instead of making models bigger or contexts longer, it makes the AI smarter in how it uses context—remembering the past, learning from it, and focusing on what matters. In doing so, it addresses the root causes behind today's limitations.

Each of these advances is not just a theoretical idea but is backed by evidence from research and practice. TLRAG isn't inventing memory or retrieval from scratch; it's synthesizing the best of what we know into one integrated architecture. It is, in essence, proposing an architectural paradigm shift: from stateless LLMs to stateful LLM agents.

If The Last RAG lives up to its promise, it could make many current frameworks obsolete. You wouldn't need LangChain for memory management because the memory is built-in. You wouldn't need to fine-tune for every new dataset because the instance can learn. This is why it's called "The Last RAG"—it aims to be the last architecture you need to handle retrieval, memory, and generation in one integrated loop. It represents a shift from static AI models to dynamic, lifelong-learning AI instances, turning the AI from an obedient savant with amnesia into a thoughtful partner with a long memory.

14. Glossary of Terms

Term	Definition
TLRAG	The Last RAG: An AI architecture that gives a standard LLM persistent memory, continuous learning capabilities, and a stable identity.
DWS	Dynamic Work Space: The core of TLRAG. An intelligent, focused context that is dynamically assembled for each query, replacing the traditional, bloated context window.
Heart	The persistent identity core of the AI, defining its personality, motivations, and agenda.
Memory Write	The autonomous process where the AI decides to store a key insight or piece of information from a conversation as a permanent memory.
Window Flush	The mechanism that discards the previous context and rebuilds a new, lean one from short-term dialogue and relevant long-term memories.
Stateless	The default nature of LLMs, where each interaction is independent and has no memory of previous ones. TLRAG makes them stateful.
Information Entropy	A term used to describe the state where adding more data and complexity to a system leads to more chaos and diminishing returns, not better intelligence.

15. Bibliography

Gehrken, M. (2025). The Last RAG: KI-Architektur die mitdenkt, lernt und Kosten spart.
Gehrken, M. (2025). Betriebskostenvergleich: Vanilla LLM vs. Standard-RAG vs. TLRAG.
LUMAE AI. (2025). The Last Rag – Pitch Deck (working Copy).
Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv preprint arXiv:2307.03172.
Laban, P., et al. (2024). "LLMs Get Lost In Multi-Turn Conversation." arXiv preprint arXiv:2405.06120.
Pinecone Engineering. (2023). "RAG makes LLMs better and equal." Pinecone Blog.
Wu, Y., et al. (2024). "From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs." arXiv preprint arXiv:2404.15965.
Li, C., et al. (2023). "Selective Context: Compressing Context to Enhance Inference Efficiency of LLMs." arXiv preprint arXiv:2310.06201.
Park, J. S., et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior." arXiv preprint arXiv:2304.03442.