<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alfino Hatta</title>
    <description>The latest articles on DEV Community by Alfino Hatta (@alfinohatta).</description>
    <link>https://dev.to/alfinohatta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4015611%2F06c0614c-40f0-491a-b71d-d9a3d81cb788.jpg</url>
      <title>DEV Community: Alfino Hatta</title>
      <link>https://dev.to/alfinohatta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alfinohatta"/>
    <language>en</language>
    <item>
      <title>Building Corroborate.ai: An Auditable Way to Decide What an AI Actually "Knows"</title>
      <dc:creator>Alfino Hatta</dc:creator>
      <pubDate>Sat, 04 Jul 2026 23:32:16 +0000</pubDate>
      <link>https://dev.to/alfinohatta/building-corroborateai-an-auditable-way-to-decide-what-an-ai-actually-knows-1n9m</link>
      <guid>https://dev.to/alfinohatta/building-corroborateai-an-auditable-way-to-decide-what-an-ai-actually-knows-1n9m</guid>
      <description>&lt;h2&gt;
  
  
  Why I built a knowledge arbitration engine instead of just another memory layer
&lt;/h2&gt;

&lt;p&gt;If you've spent any real time building with LLMs, you've probably run into the same wall I did. Memory systems today are very good at storing things and surprisingly bad at explaining why they believe what they believe. Most of the popular options, things like Mem0, Zep, and similar tools, tend to boil the whole problem down to a single model call. The model looks at a handful of facts, picks a winner, and the system moves on with its life. There's no audit trail, no way to reason about why one claim beat another, and honestly, no real concept that a fact could be true in one place and false in another.&lt;/p&gt;

&lt;p&gt;That last part is what actually bothered me enough to start building something new.&lt;/p&gt;

&lt;p&gt;Think about it for a second. A pricing rule can be perfectly legal in Germany and illegal in the United States. A regulation can be accurate today and obsolete in six months. Two sources can both look credible on paper and still flatly contradict each other. Collapsing all of that nuance into a single black box LLM judgment felt like exactly the wrong foundation for anything used in insurance, legal, or banking contexts, domains where you eventually have to explain your reasoning to a regulator, not just satisfy a user with a plausible sounding answer.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;Corroborate.ai&lt;/strong&gt;, an Android reference client for a knowledge arbitration engine that treats truth not as a single yes or no output, but as a function of context. The mechanism at the center of it all is something I call a &lt;strong&gt;Confidence Auction&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with letting one model call decide what's true
&lt;/h2&gt;

&lt;p&gt;Before I get into how Corroborate.ai works, I want to spend a little more time on why I think the "single model call" approach is such a fragile pattern, because it's genuinely everywhere right now.&lt;/p&gt;

&lt;p&gt;When you ask an LLM something like "is this claim true," you're really asking it to do three jobs at once. First, it has to retrieve or recall relevant information. Second, it has to weigh the credibility of that information against competing information. Third, it has to make a final judgment call and phrase that judgment confidently, because that's how these models are trained to communicate. The problem is that all three of those jobs happen inside a single opaque forward pass. You get an answer, but you don't get the reasoning that produced it, and you definitely don't get anything you could hand to a compliance officer or an auditor and say "here's why the system believed this."&lt;/p&gt;

&lt;p&gt;For a lot of consumer use cases, that's a perfectly acceptable trade off. Nobody needs an audit trail for a chatbot recommending a recipe. But the moment you're dealing with regulated industries, that opacity turns into liability. If your system tells an insurance agent that a policy exclusion applies, and it turns out to be wrong, "the LLM said so" is not an answer anyone wants to give a regulator, a client, or a court.&lt;/p&gt;

&lt;p&gt;I wanted a system where the reasoning was visible by construction, not bolted on after the fact as an explanation generated by yet another LLM call trying to rationalize a decision it already made.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core idea: deterministic scoring instead of a single vibe check
&lt;/h2&gt;

&lt;p&gt;Instead of asking an LLM to render a verdict, Corroborate.ai runs every candidate claim through a &lt;strong&gt;deterministic scorer ensemble&lt;/strong&gt;. Each claim is evaluated along several independent dimensions, and each of those dimensions is computed by its own dedicated scorer rather than a single model guessing at all of them simultaneously.&lt;/p&gt;

&lt;p&gt;The dimensions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source Reliability (Sr).&lt;/strong&gt; How trustworthy is the origin of this claim? A claim from a verified regulatory filing should not carry the same weight as one scraped from an anonymous forum post.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recency Decay (St).&lt;/strong&gt; How stale is this information? Facts age, and some age faster than others. A claim about a tax rate from three years ago should decay differently than a claim about a scientific constant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Corroboration (Sc).&lt;/strong&gt; How many independent sources back this claim up? A single unverified assertion should never be treated the same as a claim that multiple unrelated sources agree on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regional Authority (Sa).&lt;/strong&gt; Does this claim actually hold in the jurisdiction relevant to the user? This is the dimension that captures the Germany versus United States pricing example I mentioned earlier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once each of these scores is computed, they get combined using a &lt;strong&gt;geometric mean&lt;/strong&gt; rather than a simple arithmetic average. This was a deliberate and, honestly, a somewhat contentious design decision when I was sketching it out on paper.&lt;/p&gt;

&lt;p&gt;Here's why it matters. With an arithmetic mean, a claim can compensate for a terrible score on one dimension by having a great score on another. A claim that's wildly out of date but happens to come from a hundred corroborating sources could still average out to looking trustworthy. A geometric mean does not let you get away with that. If any single dimension collapses toward zero, the entire combined score collapses with it. In practice, that means a claim that's extremely recent and well corroborated but legally invalid in the user's region gets correctly punished instead of sneaking through because its other scores were strong. One bad dimension cannot be quietly averaged away.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contradiction Guard: knowing when to say "I'm not sure"
&lt;/h2&gt;

&lt;p&gt;Here's the part of the system I'm probably proudest of. When two competing claims land within a 0.10 confidence delta of each other after scoring, the system does not just pick whichever one is marginally higher and move on. Instead, it triggers something called the &lt;strong&gt;Contradiction Guard&lt;/strong&gt;, which returns both candidates to the caller as genuinely ambiguous, along with their individual scores and provenance.&lt;/p&gt;

&lt;p&gt;This might sound like a small implementation detail, but I think it's actually the philosophical center of the entire project. Most AI systems are optimized to always produce a confident sounding answer, because that's what feels useful in a demo. But a confident wrong answer is worse than an honest "these two things are in tension and here's why." In regulated domains especially, an honest admission of ambiguity, backed by transparent scoring, is a feature, not a failure. It gives a human reviewer exactly what they need to make the final call themselves, instead of quietly inheriting a hidden coin flip from the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a claim actually gets resolved, end to end
&lt;/h2&gt;

&lt;p&gt;It's worth walking through the full request lifecycle, because the architecture reflects the same philosophy as the scoring logic. Nothing gets to happen silently.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Android client sends a query to an API Gateway, along with metadata about the requesting tenant and the relevant region.&lt;/li&gt;
&lt;li&gt;The gateway routes the request to the correct regional partition. Data residency is not an afterthought bolted onto the routing layer later. It's baked into the design from the very first hop, since a claim resolved under the wrong jurisdiction's data isn't just inconvenient, it can be actively wrong.&lt;/li&gt;
&lt;li&gt;A Resolution Engine retrieves the most relevant semantic candidates for the query and hands them off to the Scorer Ensemble.&lt;/li&gt;
&lt;li&gt;Each scorer computes its dimension independently, and the results are fused together using the geometric mean described above.&lt;/li&gt;
&lt;li&gt;If the confidence delta between the top two candidates is too small, the Contradiction Guard kicks in and the client receives an ambiguous response with both candidates and their full scoring breakdown attached. Otherwise, the client receives a single resolved claim, complete with provenance information and a reference into a signed, append only audit log.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That audit log deserves its own mention. It's a &lt;strong&gt;Merkle anchored, append only log&lt;/strong&gt;, meaning every resolution event produces a signed record that cannot be quietly edited or deleted after the fact. If you ever need to demonstrate provenance to satisfy something like the EU AI Act, or simply to answer an internal question about why the system behaved a certain way six months ago, that log is designed to give you a real, tamper evident answer instead of a best guess reconstructed from memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack behind the client
&lt;/h2&gt;

&lt;p&gt;On the Android side, the client itself is built with a fairly modern and, I think, pretty clean set of tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kotlin&lt;/strong&gt; as the primary language throughout the app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jetpack Compose&lt;/strong&gt; paired with &lt;strong&gt;Material 3&lt;/strong&gt; for the entire UI layer, which made it much easier to represent the scoring breakdowns and ambiguous claim states visually instead of burying them in plain text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrofit&lt;/strong&gt;, &lt;strong&gt;OkHttp&lt;/strong&gt;, and &lt;strong&gt;Kotlinx Serialization&lt;/strong&gt; handling networking and data mapping between the client and the backend services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the backend integration side, the architecture assumes a &lt;strong&gt;Neo4j&lt;/strong&gt; graph database for modeling relationships between claims, since so much of what makes a claim credible or not depends on its relationships to other claims and sources. Semantic search over candidate claims is handled through &lt;strong&gt;Qdrant&lt;/strong&gt; as a vector store, and raw encrypted episodes, meaning the underlying source material a claim was derived from, live in S3 style object storage. All of it ultimately gets anchored by the Merkle audit log so that nothing in the pipeline is invisible after the fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compliance was not an afterthought
&lt;/h2&gt;

&lt;p&gt;Given that the intended use cases sit in insurance, legal, and banking, I made the decision early on to build regulatory behavior directly into the system rather than treating it as something to bolt on right before a launch. Two pieces of that I'm especially glad I got right from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GDPR Article 17 erasure cascades.&lt;/strong&gt; When a user requests deletion, the system does not simply delete a row in a database and call it done. Linked episodes tied to that user are hard deleted. Claims that still have independent corroboration from other, unrelated sources get PII stripped rather than destroyed outright, since the underlying fact might still be legitimately known and referenced from elsewhere even after this particular user's data is gone. Claims that have no independent corroboration left after the user's data is removed get hard deleted entirely, since keeping them around would mean keeping information that only existed because of the deleted user in the first place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Role based access control.&lt;/strong&gt; The system defines three roles. AGENT covers basic resolve and ingest actions, the kind of everyday operations most users of the system will perform. VERIFIER is meant for human in the loop review, letting a qualified person step in and confirm or override an ambiguous resolution. ADMIN is reserved for configuration changes and erasure operations, keeping the most sensitive capabilities gated behind the appropriate permission level rather than leaving them open to anyone with basic access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some of the harder design decisions along the way
&lt;/h2&gt;

&lt;p&gt;Building this wasn't a straight line, and a few decisions took longer to settle on than I expected going in.&lt;/p&gt;

&lt;p&gt;Choosing the geometric mean over a simpler weighted average was one of them. It's mathematically stricter, and stricter math means more claims end up flagged as ambiguous rather than confidently resolved. Early on, that felt like it might make the system less useful, since users generally want answers, not more questions. But the more I thought about the target use cases, the clearer it became that a system used in insurance or legal contexts should be biased toward honest uncertainty rather than false confidence. A slightly less "decisive" system that's honest about its limits is more trustworthy, and ultimately more useful, than one that always has an answer ready.&lt;/p&gt;

&lt;p&gt;Deciding how aggressive to make the erasure cascade was another one. It would have been simpler to just hard delete everything tied to a user on request and call it compliant. But that approach ignores the reality that facts can be independently corroborated by other sources that have nothing to do with the user requesting deletion. Building the PII stripping path instead of a blanket deletion took more engineering effort, but it respects both the user's right to be forgotten and the integrity of facts that other, unrelated sources still legitimately support.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell someone building something similar
&lt;/h2&gt;

&lt;p&gt;If you're building any kind of AI memory or retrieval system that has to survive real contact with a compliance team, or honestly even one that just needs to earn genuine user trust, my biggest takeaway is this. Resist the temptation to let a single LLM call be your source of truth. It's fast, it's easy to prototype, and it demos beautifully. But it gives you nothing to audit and nothing to explain when someone eventually asks why the system believed something. Building a deterministic, inspectable scoring layer took a lot more upfront design work than just wiring up a prompt, but it's the difference between a system you can defend with evidence and one you can only apologize for after the fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The project is still evolving, and there's plenty left on the roadmap. I want to expand the Neo4j and Qdrant integration further, refine the heuristics behind the Contradiction Guard so its ambiguity threshold can adapt a bit more intelligently to context, and generally keep hardening the compliance tooling as I learn more about what regulated industries actually need in practice. If any of this resonates with a problem you're facing, or you just want to dig into the internals, the full source and architecture details are on GitHub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check out the repo here: &lt;a href="https://github.com/alfinohatta/Corroborate.ai" rel="noopener noreferrer"&gt;github.com/alfinohatta/Corroborate.ai&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Corroborate.ai is licensed under AGPL-3.0. Contributions and feedback are welcome, especially on the Confidence Auction mechanics.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kotlin</category>
      <category>nlp</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
