DEV Community: Alfino Hatta

Building Recourse.ai: A Parametric Insurance Prototype for the Age of Agentic AI

Alfino Hatta — Sat, 18 Jul 2026 09:49:19 +0000

Every enterprise I've talked to in the last year has the same story: AI agents are being deployed faster than risk and compliance teams can review them. Procurement moves in weeks, legal review moves in quarters, and the gap in between is where the real exposure lives.

Traditional Errors & Omissions (E&O) insurance was never designed for this pace. It assumes a world where risk changes slowly enough that an annual form and a broker phone call are sufficient to underwrite it. That assumption breaks down completely with autonomous systems. An agent can make thousands of decisions an hour, and a subtle failure mode (what people in this space call "silent AI" risk) can run for weeks before a human notices anything is wrong. By the time a claim gets filed under the old model, the damage has already compounded, and the insurer is reconstructing what happened from logs and depositions instead of live data.

I wanted to sit with a narrower, more concrete question: what would insurance look like if it were driven by live operational signals instead of static paperwork? Not as a thought experiment, but as something I could actually build and click through. That question turned into Recourse.ai, an Android prototype built with the Lloyd's Lab ecosystem in mind. It bridges AI operational telemetry with parametric insurance workflows, converting real-time agent behavior into instant, verifiable payouts whenever a pre-agreed safety threshold gets crossed.

This post walks through what the app does, why I made the architectural choices I made, where I deliberately cut corners as a prototype, and what I'd change before anyone put real money behind it.

What "parametric" actually means here

Parametric insurance isn't a new idea. It's used heavily in areas like crop insurance and catastrophe bonds, where a payout is triggered automatically once a measurable condition is met (rainfall below X millimeters, wind speed above Y knots), rather than requiring a lengthy claims investigation to establish fault and damages.

Recourse applies that same logic to AI operations. Instead of rainfall or wind speed, the triggering condition is something like "agent error rate exceeds a defined threshold for a defined window" or "an agent takes an action outside its approved guardrails." Once that condition is met and verified against the policy's rules, the payout process starts automatically. No adjuster has to reconstruct what happened weeks later. The telemetry data that triggered the payout is the evidence.

This matters because it changes the incentive structure on both sides. Insurers can price risk based on live signals rather than guesswork, and the organizations being insured have a direct incentive to invest in better guardrails, because better guardrails literally lower their premiums and reduce their exposure to slow claims.

What Recourse actually does

At its core, the app turns AI operational risk into something measurable and insurable. It does this through three connected capabilities:

Quantifying risk. Every monitored AI agent gets a "Guardrail Maturity Score," calculated from its live behavior rather than a self-reported checklist filled out once and forgotten. The score reflects things like how often an agent operates within its defined boundaries, how it responds to anomalous inputs, and how frequently human intervention is required.

Automating payouts. When a defined threshold is breached (say, an agent's error rate crosses a contractually agreed limit, or a kill-switch is triggered), the parametric rules engine verifies the condition against the policy and initiates a payout. In the prototype, this is modeled to clear in under 48 hours, a dramatic contrast to the 90-plus days typical of traditional E&O claims.

Benchmarking performance. For finance leaders, the app surfaces real-time ROI and efficiency metrics, showing how the cost of automation compares against traditional benchmarks and how much has been recovered through the insurance layer itself.

Designing around real personas

Rather than build a generic dashboard and hope it was useful to everyone, I anchored the whole app around four personas who would actually interact with a product like this:

Persona	Primary objective	Key feature used
CFO / COO	ROI and financial stability	Executive metric grid (recovered funds, efficiency gains)
Head of AI	Operational reliability	Agent telemetry and automatic safety kill switches
General Counsel	Governance and audit	Region-specific compliance audits (EU AI Act, DIFC)
Broker	Capacity distribution	Capacity monitoring and parametric policy management

Every screen in the app maps back to one of these people's actual job. The login flow is persona-driven, so a CFO signing in sees financial recovery numbers front and center, while a Head of AI signing in lands on agent-level telemetry. The incident deep-dive view is written for General Counsel, with estimated financial exposure and jurisdictional legal context laid out clearly enough to drop into a board memo. And the policy triggers screen exists mainly for brokers, showing the rules engine logic in a way that's transparent rather than a black box.

This persona-first approach shaped a lot of small decisions that might otherwise seem arbitrary, like why the executive entry screen is built for instant boardroom demonstrations rather than a typical consumer onboarding flow. The app isn't trying to be sticky or habit-forming. It's trying to be legible to a specific decision maker in under thirty seconds.

Architecture decisions

I went with a decoupled MVVM (Model-View-ViewModel) architecture on Android. The main reason was separation of concerns: I wanted the UI layer to stay dumb and testable, while all the payout logic and business rules lived somewhere I could reason about independently of any particular screen.

In practice, that means Activities and their Adapters handle rendering and user input, ViewModels hold UI state as StateFlow, and Repositories own the job of fetching and shaping data, whether that data comes from a live database or a fallback source. Business logic, including the payout engine itself, sits behind the Repository layer so it isn't entangled with any specific view.

The resilient data layer

The more interesting design problem was the data layer. Live demos are unforgiving. A dropped database connection mid-pitch, in front of a room full of underwriters or CFOs, kills credibility instantly, regardless of how good the actual product is. So instead of assuming a database connection would always be available, I built the repository layer to support two paths:

A live JDBC connection to MySQL 8.0, giving realistic query behavior across a nine-table insurance schema covering agents, policies, incidents, and thresholds.
A "Local Mode" fallback that transparently swaps in high-fidelity mock data if the database isn't reachable, so the app never shows a blank screen or a crash dialog in front of an audience.

That fallback logic lives in a custom DatabaseHelper class. Critically, the ViewModels above it don't need to know or care which path actually served the data. They just receive a domain object and update state accordingly. This kept the resiliency logic contained in one place instead of scattered as defensive try/catch blocks across every screen.

Here's the request lifecycle when a user opens an incident, which illustrates how the fallback fits into the normal flow rather than being a special case bolted on afterward:

User selects an incident
  -> View requests detail(incidentId) from ViewModel
    -> ViewModel asks Repository for telemetry and thresholds
      -> Repository queries MySQL (JOIN across agents/policies tables)
        -> if connected: return the ResultSet
        -> if not connected: fall back to mock strategic data
    -> Repository returns a single domain object either way
  -> ViewModel updates UI state
-> View renders the risk report and claim status

Kotlin Coroutines and StateFlow handle the asynchronous plumbing end to end. That combination kept the UI layer free of nested callbacks and made loading, success, and error states straightforward to represent and test, since each state is just a value the ViewModel emits rather than a chain of listener callbacks.

Why "Local Mode" isn't just a demo hack

It would be easy to dismiss the mock data fallback as a shortcut for pitching investors, and honestly, part of its purpose is exactly that. But it also reflects something real about how these systems need to behave in production. An insurance workflow tool that goes dark the moment a database connection blips is not a tool anyone will trust with time-sensitive payout logic. Building the resiliency pattern in from day one, even in a prototype, forced me to think about failure modes early rather than treating them as an afterthought once something breaks in front of a real user.

The technology stack, and why each piece was chosen

Kotlin as the primary language, leaning on Jetpack KTX extensions (things like isVisible) to keep the codebase concise and idiomatic rather than fighting Java-style verbosity.
Material 3 for the UI layer, styled with a navy and teal palette. The goal was to make the app read as serious enterprise fintech rather than a generic consumer mobile app, since the audience is CFOs and General Counsel, not retail users.
Coroutines and StateFlow for asynchronous data handling throughout, which kept state management consistent across every screen instead of mixing patterns.
MySQL 8.0 with JDBC on the backend. I chose direct JDBC over building a full REST API layer because speed of prototyping mattered more than production correctness at this stage, with the explicit understanding that a real deployment would need an authenticated API in between.
Hardcoded compliance engines covering EU AI Act Article 14, DIFC Regulation 10, and the MAS Sandbox Plus. These aren't placeholder text. The audit trail in the incident detail view reflects the actual regulatory language for each jurisdiction, so the compliance story holds up under scrutiny from someone who actually knows these frameworks.

Data flow and the sequence behind a claim

To make the parametric trigger verification concrete, it helps to walk through the full sequence rather than just describing it in prose. When a user selects an incident from the dashboard, the View asks the ViewModel for detail on that specific incident ID. The ViewModel, in turn, asks the Repository to fetch both the telemetry data and the relevant policy thresholds. The Repository issues a SQL query that joins across the agents and policies tables in MySQL.

If the database connection is healthy, the query returns a ResultSet that gets mapped into a domain object. If the connection isn't available, the Repository silently falls back to loading the equivalent mock data, constructed to be realistic enough that a viewer can't tell the difference during a demo. Either way, the ViewModel receives a single, consistent domain object, updates the UI state, and the View renders the risk report along with the current claim status.

The point of walking through this in detail is that the fallback isn't a separate code path that diverges from the "real" one. It's integrated at the point where data enters the system, which means every downstream consumer of that data behaves identically regardless of where the data actually came from.

Security and compliance considerations

A few principles guided the security and compliance design, even at prototype stage:

Data minimization. The telemetry layer only captures agent metadata and financial amounts, deliberately avoiding personally identifiable information wherever possible. This wasn't an afterthought bolted on before a compliance review. It was a constraint from the first schema design, because the whole pitch of the product depends on being trustworthy with sensitive operational data.

Deterministic, inspectable rules. The parametric verification logic is intentionally deterministic. Given the same telemetry and the same policy thresholds, it produces the same payout decision every time. This matters because the entire value proposition of parametric insurance collapses if the trigger logic feels like a black box. If a General Counsel can't explain to a board why a payout did or didn't happen, the product hasn't actually solved the trust problem it set out to solve.

Local resiliency, with an asterisk. In a production environment, direct JDBC access from the client app would be replaced with an authenticated REST API. Using JDBC directly in the prototype was a deliberate trade-off for demo performance and development speed, not a decision I'd defend for a real deployment handling real policyholder data.

Licensing choice: why AGPL-3.0

I licensed the project under the GNU Affero General Public License v3.0 rather than a more permissive option like MIT or Apache 2.0, and this wasn't a default choice made without thinking about it.

The core argument for parametric insurance in this space is that claims shouldn't be a black box. If that's the actual thesis of the product, then the rules engine deciding who gets paid and when should be inspectable by the people affected by it, not locked away inside a proprietary hosted service that nobody outside the company can audit. AGPL specifically closes the loophole that plain GPL leaves open: if someone takes this code, modifies it, and runs it as a hosted service without redistributing the source, standard GPL wouldn't require them to share those changes. AGPL does. That felt directly aligned with what the project is trying to prove, rather than an incidental legal detail.

What I'd change before this went anywhere near production

I want to be upfront about the parts of this that are deliberately prototype-grade, because a technical write-up that only talks about what worked isn't very useful to anyone else building something similar.

The direct JDBC connection from the client is the biggest one. It's fine for a controlled demo running against a database I control, but it would be a serious liability in production. Any real version of this needs an authenticated REST or GraphQL API sitting between the client and the database, with proper access control and rate limiting.

The compliance engine is currently rules-based and hardcoded per jurisdiction. That works for three regions in a demo, but it doesn't scale. A production version would need to externalize that logic into a proper policy engine, something more like a rules-as-configuration system, so that adding a new jurisdiction doesn't mean shipping a new app build.

The mock data, while high fidelity, is still fabricated. Before any real underwriting decisions get made on top of this, the telemetry ingestion pipeline needs to be validated against actual agent logging formats from real deployments, which vary a lot more than a clean mock dataset suggests.

And finally, the "Guardrail Maturity Score" as implemented is a reasonable first pass at quantifying agent behavior, but it hasn't been validated against actual loss data. Turning it into something an actuary would sign off on requires a much longer feedback loop between the score and real claims history, which by definition doesn't exist yet for a category this new.

Lessons from building it

A few things stood out to me while working on this that I think generalize beyond this specific project.

Building resiliency in early, even for a prototype, pays off. The Local Mode fallback started as a way to survive flaky demo wifi, but it ended up shaping how I thought about the whole data layer, and it's a pattern I'd default to again for anything that needs to be presented live.

Designing around named personas rather than generic user stories made a surprising number of small UI decisions obvious that would otherwise have required guesswork. Once I knew a CFO was the one looking at a given screen, questions like "what number goes at the top" answered themselves.

And choosing a license is a design decision, not an administrative one. AGPL wasn't the "safe" default choice for a project hoping to attract commercial interest, but it was the choice that actually matched what the product claims to stand for.

Where it's headed

Recourse is explicitly positioned as a first-mover bet in the agentic AI insurance space, and the roadmap reflects that ambition in stages. The near-term plan is a beachhead in D2C retail, operating inside regulatory sandboxes like MAS and ADGM where the compliance burden of a full launch is lower. From there, the path runs toward securing A-rated capacity and formalizing a broker channel, since parametric insurance at scale needs real underwriting capacity behind it, not just clever software. The longer-term pivot is toward white-labeled, embedded platform distribution, so the parametric rules engine could eventually sit inside other companies' products rather than only existing as a standalone app.

None of that is guaranteed, and a lot of it depends on things outside the codebase, like regulatory appetite and actual underwriting partnerships. But having a working prototype that a CFO, a Head of AI, and a General Counsel can each click through and immediately understand is a meaningfully different starting point than a slide deck.

If you want to look at the actual implementation, including the mock data layer, the compliance engine logic, and the full MVVM structure, the source is on GitHub: github.com/alfinohatta/recourse

Building Northbridge Analytics: Turning Gut Feelings Into Governed Data

Alfino Hatta — Wed, 08 Jul 2026 07:24:38 +0000

Every large organization has the same quiet problem: the best information about risk doesn't live in a database. It lives in Slack threads, hallway conversations, and the "gut feeling" of a regional expert who's watched a particular market or regulatory environment for years. That judgment is valuable, but it's almost never quantified, almost never checked against reality, and almost never connected to an auditable trail of action.

I've watched this play out from the outside, too. A risk committee gets a slide that says "our regional lead is concerned about currency exposure in this market," and that's it. No number, no history of whether that lead has been right before, no comparison to what the actual market is pricing in. The concern might be completely justified. It might also be recency bias from one bad quarter. Nobody in the room has a good way to tell the difference, so the decision either gets made on vibes or gets deferred until it's too late to act cheaply.

I wanted to build something that closes that gap: a tool that forces expert intuition into a number, benchmarks that number against real market data, and keeps a governed record of what happened next. That's how Northbridge Analytics was born, a decision-support platform aimed at CFOs and Chief Risk Officers who need more than anecdotes when they're deciding whether to hedge exposure.

The core idea

Northbridge is built around one simple tension: internal belief vs. external reality.

ICP (Internal Consensus Probability), a weighted probability derived from what your own regional experts believe will happen.
EMP (External Market Probability), the real-time signal coming from prediction markets and financial data.

When those two numbers drift apart, that's a blind spot. Someone inside the organization either knows something the market doesn't, or the organization is dangerously out of touch. Either way, it's worth surfacing, and Northbridge's Divergence Engine does exactly that, flagging significant gaps so they can be reviewed before they become expensive.

I spent a lot of time thinking about what "significant" should mean here. A naive version of this feature would just subtract ICP from EMP and flag anything over some fixed threshold, like ten percentage points. But a ten-point gap on a coin-flip event isn't the same as a ten-point gap on something the market already prices as a near-certainty. So the divergence calculation has to account for where the two numbers sit on the probability curve, not just the raw distance between them. Getting that math right, so that the flags actually correspond to decisions worth making instead of noise, took several iterations.

To make the internal side trustworthy over time, I added reputation weighting based on Brier Skill Scores, so experts who are consistently accurate carry more influence in the consensus than those who are just loud or senior. Calibration, not politics, should decide whose forecast matters most. This part mattered a lot to me personally. Anyone who has sat in a room where the most confident person automatically wins the argument knows how badly that can go. A system that quietly tracks who has actually been right, and lets that history shape influence over time, is a small structural nudge toward better decisions. It doesn't eliminate politics, but it makes the politics have to argue with a track record.

Architecture decisions

I built the client as a native Kotlin Android app, following an MVVM pattern with StateFlow for reactive, lifecycle-aware state management. A few decisions I made early on, and why:

Offline-first with Room. Regional experts don't always have reliable connectivity, whether that's someone traveling, working from a site with poor infrastructure, or just being on a flight when a market moves. Local persistence via Room, which sits on top of SQLite, had to be the source of truth on-device, syncing to the backend when possible. This meant designing the local schema first and treating the network as an enhancement rather than a dependency, which is the opposite of how a lot of apps are built by default.

Retrofit and MySQL as the "Global Ledger." The backend needed to act as a single, centralized record. Every forecast, review, and hedge approval writes to an audit log that can't quietly disappear. I used Retrofit to handle the HTTP layer cleanly and kept the API contracts strict, since a governance tool is only as trustworthy as its ability to prove that nothing has been altered after the fact.

Repository pattern. A RiskRepository sits between the UI and both the local Room database and the remote API, handling sync logic and running the divergence calculations before pushing results back into state. Keeping this logic out of the ViewModels made the sync behavior much easier to reason about and test in isolation, and it meant the UI layer never had to know or care whether a given probability came from the cache or from a fresh network call.

Governed workflows over free-for-all inputs. Once a probability estimate is submitted, it's immutable. If a hedge gets approved, that's logged too. The whole point is an audit trail a risk committee can actually trust. This constraint shaped the database design from day one. Instead of allowing updates to existing rows, corrections and revisions are stored as new entries that reference the original, so the full history of how a belief evolved is always visible rather than overwritten.

Putting it together, the pipeline looks roughly like this: an expert submits a quantified estimate, the app recalculates the weighted internal consensus, the Divergence Engine compares it to the external market signal, and if the gap crosses a threshold, it flags the CFO or CRO for review. An approved hedge gets committed to the ledger, and the whole thing can be exported as a PDF for the board. Every step in that chain is designed to leave a trace, so that six months later someone can reconstruct exactly what was known, by whom, and when.

What I learned

The hardest part wasn't the Kotlin or the sync logic. It was designing for governance rather than just functionality. It's easy to build a form that captures a probability. It's harder to build a system where that probability can never be quietly edited after the fact, where every state transition is logged, and where the audit trail is actually useful to a risk committee months later. That constraint shaped almost every technical decision, from making estimates immutable at the database layer to structuring the role-based access control around who's even allowed to resolve an event.

I also learned how much thought has to go into role design before a single screen gets built. Deciding who can submit a forecast, who can review a divergence flag, and who has the authority to approve a hedge is not a UI problem, it's an organizational modeling problem. I ended up sketching out the actual approval chain of a hypothetical CFO's office on paper before writing any permission logic, because getting that model wrong would mean rebuilding a lot of the backend later.

Another lesson was around how to present uncertainty without letting the interface lie by omission. It would have been easy to just show a single blended probability number on the main screen and call it a day. But a single number hides disagreement. If three experts are clustered tightly around 30% and two are convinced it's 70%, that's a very different situation from five experts who all independently landed on 45%, even though the weighted average might look similar. So the UI needed to expose the spread of opinion, not just the summary statistic, which pushed me to think more carefully about how to visualize distributions on a small mobile screen without overwhelming the user.

What's next

There's plenty on the roadmap. I want to build deeper reporting around "value protected" metrics, so a CFO can see not just that a hedge was approved but what it likely saved the organization in expected terms. I'm also interested in more sophisticated calibration across multiple event cycles, since Brier scores get more meaningful the more forecasting history an expert accumulates, and right now the reputation system is still working with a fairly thin data set. Beyond that, I'd like to extend the client past Android, since a lot of the value of Northbridge only compounds if more of an organization's experts can easily contribute a forecast from whatever device they happen to be using.

But even in its current state, Northbridge does the one thing I set out to do. It turns scattered judgment into a number, and it turns that number into something governed.

If you're curious about the code, architecture diagrams, or want to run it locally, the full project is open source here:

GitHub: github.com/alfinohatta/northbridge

Building Quorra: An Android Decision-Intelligence Platform That Actually Explains Itself

Alfino Hatta — Mon, 06 Jul 2026 16:51:07 +0000

Why I built a "reasoning layer" instead of another dashboard

If you've spent any time in operations, retail, or supply chain, you've seen this pattern before: a company invests heavily in dashboards, everyone's staring at the same charts, and yet decisions still get made on gut feel, or worse, they don't get made at all until it's too late. That gap between having data and acting on it is what pushed me to build Quorra, a Decision-Intelligence (DI) platform for Android.

Quorra isn't trying to be a prettier BI tool. It's trying to solve a much harder problem: turning fragmented operational data from systems like ERP, CRM, and WMS into recommendations that people can actually trust and act on, with a human still firmly in control.

The more I dug into this space, the more I realized that "more data" was never the actual bottleneck for most operations teams. Most facilities I looked at were already drowning in dashboards, KPI widgets, and nightly reports. What they lacked was a layer that could take all of that raw signal and turn it into something closer to a recommendation: not just "here's what happened," but "here's what you should probably do next, and here's why."

That distinction, between reporting and reasoning, became the entire premise of the project.

The problem I kept running into

Every enterprise I looked at had some version of the same three issues:

Decision paralysis. Managers simply can't process data fast enough to act on it in time. By the time a stock-out or a margin problem shows up clearly on a dashboard, the window to act cheaply has usually already closed.
A trust deficit. Black-box AI recommendations get ignored because nobody can see the reasoning behind them. If a system tells an operator to reroute inventory but can't explain why, the natural response is to ignore it, especially when the stakes are financial.
Expensive reversals. Misallocated inventory or a mispriced contract is painful and costly to undo once it's already happened. Once a decision is executed, walking it back is rarely as simple as clicking undo.

Dashboards show you the "what." They almost never show you the "so what do I do about it," and they definitely don't tell you why a recommendation makes sense. I kept coming back to the idea that the real product wasn't a dashboard at all. It was a decision, packaged with enough context that a human could sign off on it confidently and quickly.

That reframing changed almost everything about how I approached the build, from the data model up to the UI.

The approach: Human-on-the-Loop, not full automation

Instead of building something that auto-approves decisions, I designed Quorra around Human-on-the-Loop (HOTL), a design philosophy that sits deliberately between full automation and manual analysis. The system proposes, ranks, and explains, but a person always makes the final call on anything high-stakes. I wanted the app to behave less like an autopilot and more like a very well-prepared analyst standing behind the decision-maker's shoulder.

Every recommendation ships with three things baked in:

A plain-language justification for why it's being suggested, written so a non-technical facility lead can understand it without needing a data science background.
A dual-confidence score, one for data quality and one for model confidence, so the source of uncertainty is never hidden behind a single vague percentage.
A sandbox "what-if" environment to simulate the outcome before committing to it, so the cost of testing an idea is close to zero.

That last piece became one of my favorite parts of the build: a simulation laboratory where a manager can test a scenario, such as "what if this shipment is two days late," before touching anything real. Instead of asking people to trust a black box, the app lets them poke at it, run a few scenarios, and build their own intuition for when the system's recommendations are solid and when they should be second-guessed.

I think this is the part of the project I'd defend most strongly if someone pushed back on it. Full automation is tempting because it looks more impressive in a demo, but in high-stakes operational environments, an unexplainable automated decision is a liability, not a feature. HOTL is slower to demo but far more likely to actually get adopted and trusted by real teams.

What's under the hood

Quorra is a native Android app, and I made a conscious decision early on to keep the stack fully native rather than reaching for a cross-platform framework. Given how much of the UI depends on custom visualizations and role-based rendering, native Kotlin gave me more control over performance and layout precision than I would have had otherwise.

Frontend and UI

Kotlin 2.1.0 as the core language for the entire app.
Jetpack Compose for a declarative, "command center" style interface that could adapt fluidly based on user role and context.
Material 3 for the enterprise-grade visual system, giving the app a consistent, professional feel without having to design a component library from scratch.
Custom Canvas API charts for trendlines and gauges. Off-the-shelf chart libraries weren't going to give me the visual density and customization the recommendation views needed, so I built these by hand.

Backend and data

Room for offline-first local persistence, since facility environments don't always have reliable connectivity and the app needs to keep functioning when the network drops.
Retrofit 2.11 for type-safe API sync with enterprise backends, handling the messier job of reconciling remote system data with what's stored locally.
Coroutines and StateFlow for reactive state management throughout the app, which turned out to be essential once the UI needed to update in near real time as new recommendations came in.
Gson for serializing the more complex data models, particularly the nested reasoning objects attached to each recommendation.

Tooling

KSP for annotation processing with Room, which cut build times noticeably compared to the older KAPT approach.
Gradle Kotlin DSL for the build configuration, mostly for the stronger typing and better IDE support.
MockK and JUnit for testing, especially around the reconciliation logic between local and remote data.

The architecture follows a fairly classic MVVM-plus-repository pattern, but the repository layer does more work than usual. It functions as the single source of truth reconciling local Room data with remote sync, which matters a lot when the app needs to keep functioning in low-connectivity facilities. In practice, this meant spending a disproportionate amount of time on conflict resolution logic: what happens when a recommendation gets updated on the server while a facility lead is mid-review on a phone with no signal. Getting that right was far less glamorous than building the simulation lab, but it mattered just as much for the app to be trustworthy in the field.

Designing for four very different users

One thing that shaped a lot of early decisions was realizing Quorra needed to serve genuinely different personas, not just "the user" as a single abstract concept. Early prototypes tried to build one interface that served everyone, and it was a mess: too cluttered for an operator who just needed to glance and act, too shallow for an approver who needed full context before signing off on a six-figure decision.

Persona	Role	What they need
Operator	Facility lead	Monitor the feed, catch stock-outs or risk, simulate recovery paths
Approver	Audit or finance lead	Review high-stakes escalations, sign off on capital commitments
Configurer	Data or IT manager	Manage the underlying "playbook" logic, watch data source health
Admin	System administrator	Set governance rules, manage cross-facility permissions

Splitting the app by persona meant building UI that reactively adapts based on role. An Approver sees sign-off controls that an Operator never even renders, and vice versa, rather than the same screen with certain buttons simply hidden or disabled. It also meant baking governance into the architecture from day one: automated escalation rules trigger based on financial thresholds or confidence deficits, rather than being bolted on later as an afterthought.

This ended up paying off in a way I didn't fully anticipate. Because each persona's needs were modeled explicitly, adding new governance rules later (for example, a new escalation threshold for a specific facility type) was a matter of adjusting configuration rather than rewriting UI logic. Treating personas as first-class citizens in the data model, not just in the design mockups, turned out to be one of the better architectural decisions I made.

The audit ledger: making trust auditable, not just claimed

A recommendation engine is only as trustworthy as its paper trail. Quorra keeps an immutable audit ledger that records every decision, which persona made it, and the reasoning behind it at the time it was made. This wasn't a compliance checkbox I added late in the process. It was part of the initial data model, because I wanted trust in the system to be something that could be independently verified rather than something the app simply asserted about itself.

Combined with an "Operational Best Practices" rule I baked directly into the logic, never auto-approve anything flagged as high risk, always require a written reasoning note for overrides, the goal was to make the system's trustworthiness something you can verify, not something you have to take on faith. If an approver overrides a recommendation, that override and its justification live permanently in the ledger, visible to anyone reviewing that decision later.

In hindsight, this is probably the single feature that would matter most in a real enterprise deployment. Dashboards and recommendation engines are common. A genuinely auditable decision trail, one that survives scrutiny from finance or compliance teams months after the fact, is much rarer, and it's the kind of thing that only works if it's designed in from the start rather than added after the fact.

What I'd tell someone starting a similar project

If you're building something in this decision-support space, a few things I learned along the way:

Explainability isn't a UI feature, it's an architecture decision. You can't bolt "why" onto a recommendation after the fact. The reasoning needs to be a first-class object in your data model from the start, stored, versioned, and retrievable, not just rendered as a tooltip in the interface.
Offline-first is not optional for operational tools. Facilities and warehouses don't always have great connectivity, and a DI tool that stops working when the network drops isn't a DI tool, it's a liability that people will stop trusting the first time it fails them at a critical moment.
Simulation is more valuable than prediction. People trust "let me test this" far more than they trust "trust me, this will work." Giving users a low-stakes way to explore a recommendation before committing to it does more for adoption than any amount of model accuracy tuning.
Design for the least technical persona first. If the interface makes sense to the facility operator glancing at it between tasks, it will almost certainly still make sense to the more technical personas. The reverse is rarely true.
Governance is cheaper to build early than to retrofit. Escalation thresholds, audit trails, and override rules are much easier to weave into the core data model at the start than to graft onto a system that was never designed to be scrutinized.

Try it out

Quorra is open source and built entirely in Kotlin. If you want to dig into the architecture, the persona-based UI logic, or the simulation lab implementation, the full code is on GitHub:

👉 github.com/alfinohatta/Quorra

Feedback, issues, and pull requests are all welcome, especially if you've tackled similar explainability or offline-sync challenges in your own projects. If you end up poking around the repo, I'd genuinely love to hear what you'd have built differently.

Building Corroborate.ai: An Auditable Way to Decide What an AI Actually "Knows"

Alfino Hatta — Sat, 04 Jul 2026 23:32:16 +0000

Why I built a knowledge arbitration engine instead of just another memory layer

If you've spent any real time building with LLMs, you've probably run into the same wall I did. Memory systems today are very good at storing things and surprisingly bad at explaining why they believe what they believe. Most of the popular options, things like Mem0, Zep, and similar tools, tend to boil the whole problem down to a single model call. The model looks at a handful of facts, picks a winner, and the system moves on with its life. There's no audit trail, no way to reason about why one claim beat another, and honestly, no real concept that a fact could be true in one place and false in another.

That last part is what actually bothered me enough to start building something new.

Think about it for a second. A pricing rule can be perfectly legal in Germany and illegal in the United States. A regulation can be accurate today and obsolete in six months. Two sources can both look credible on paper and still flatly contradict each other. Collapsing all of that nuance into a single black box LLM judgment felt like exactly the wrong foundation for anything used in insurance, legal, or banking contexts, domains where you eventually have to explain your reasoning to a regulator, not just satisfy a user with a plausible sounding answer.

So I built Corroborate.ai, an Android reference client for a knowledge arbitration engine that treats truth not as a single yes or no output, but as a function of context. The mechanism at the center of it all is something I call a Confidence Auction.

The problem with letting one model call decide what's true

Before I get into how Corroborate.ai works, I want to spend a little more time on why I think the "single model call" approach is such a fragile pattern, because it's genuinely everywhere right now.

When you ask an LLM something like "is this claim true," you're really asking it to do three jobs at once. First, it has to retrieve or recall relevant information. Second, it has to weigh the credibility of that information against competing information. Third, it has to make a final judgment call and phrase that judgment confidently, because that's how these models are trained to communicate. The problem is that all three of those jobs happen inside a single opaque forward pass. You get an answer, but you don't get the reasoning that produced it, and you definitely don't get anything you could hand to a compliance officer or an auditor and say "here's why the system believed this."

For a lot of consumer use cases, that's a perfectly acceptable trade off. Nobody needs an audit trail for a chatbot recommending a recipe. But the moment you're dealing with regulated industries, that opacity turns into liability. If your system tells an insurance agent that a policy exclusion applies, and it turns out to be wrong, "the LLM said so" is not an answer anyone wants to give a regulator, a client, or a court.

I wanted a system where the reasoning was visible by construction, not bolted on after the fact as an explanation generated by yet another LLM call trying to rationalize a decision it already made.

The core idea: deterministic scoring instead of a single vibe check

Instead of asking an LLM to render a verdict, Corroborate.ai runs every candidate claim through a deterministic scorer ensemble. Each claim is evaluated along several independent dimensions, and each of those dimensions is computed by its own dedicated scorer rather than a single model guessing at all of them simultaneously.

The dimensions are:

Source Reliability (Sr). How trustworthy is the origin of this claim? A claim from a verified regulatory filing should not carry the same weight as one scraped from an anonymous forum post.
Recency Decay (St). How stale is this information? Facts age, and some age faster than others. A claim about a tax rate from three years ago should decay differently than a claim about a scientific constant.
Corroboration (Sc). How many independent sources back this claim up? A single unverified assertion should never be treated the same as a claim that multiple unrelated sources agree on.
Regional Authority (Sa). Does this claim actually hold in the jurisdiction relevant to the user? This is the dimension that captures the Germany versus United States pricing example I mentioned earlier.

Once each of these scores is computed, they get combined using a geometric mean rather than a simple arithmetic average. This was a deliberate and, honestly, a somewhat contentious design decision when I was sketching it out on paper.

Here's why it matters. With an arithmetic mean, a claim can compensate for a terrible score on one dimension by having a great score on another. A claim that's wildly out of date but happens to come from a hundred corroborating sources could still average out to looking trustworthy. A geometric mean does not let you get away with that. If any single dimension collapses toward zero, the entire combined score collapses with it. In practice, that means a claim that's extremely recent and well corroborated but legally invalid in the user's region gets correctly punished instead of sneaking through because its other scores were strong. One bad dimension cannot be quietly averaged away.

The Contradiction Guard: knowing when to say "I'm not sure"

Here's the part of the system I'm probably proudest of. When two competing claims land within a 0.10 confidence delta of each other after scoring, the system does not just pick whichever one is marginally higher and move on. Instead, it triggers something called the Contradiction Guard, which returns both candidates to the caller as genuinely ambiguous, along with their individual scores and provenance.

This might sound like a small implementation detail, but I think it's actually the philosophical center of the entire project. Most AI systems are optimized to always produce a confident sounding answer, because that's what feels useful in a demo. But a confident wrong answer is worse than an honest "these two things are in tension and here's why." In regulated domains especially, an honest admission of ambiguity, backed by transparent scoring, is a feature, not a failure. It gives a human reviewer exactly what they need to make the final call themselves, instead of quietly inheriting a hidden coin flip from the model.

How a claim actually gets resolved, end to end

It's worth walking through the full request lifecycle, because the architecture reflects the same philosophy as the scoring logic. Nothing gets to happen silently.

The Android client sends a query to an API Gateway, along with metadata about the requesting tenant and the relevant region.
The gateway routes the request to the correct regional partition. Data residency is not an afterthought bolted onto the routing layer later. It's baked into the design from the very first hop, since a claim resolved under the wrong jurisdiction's data isn't just inconvenient, it can be actively wrong.
A Resolution Engine retrieves the most relevant semantic candidates for the query and hands them off to the Scorer Ensemble.
Each scorer computes its dimension independently, and the results are fused together using the geometric mean described above.
If the confidence delta between the top two candidates is too small, the Contradiction Guard kicks in and the client receives an ambiguous response with both candidates and their full scoring breakdown attached. Otherwise, the client receives a single resolved claim, complete with provenance information and a reference into a signed, append only audit log.

That audit log deserves its own mention. It's a Merkle anchored, append only log, meaning every resolution event produces a signed record that cannot be quietly edited or deleted after the fact. If you ever need to demonstrate provenance to satisfy something like the EU AI Act, or simply to answer an internal question about why the system behaved a certain way six months ago, that log is designed to give you a real, tamper evident answer instead of a best guess reconstructed from memory.

The stack behind the client

On the Android side, the client itself is built with a fairly modern and, I think, pretty clean set of tools:

Kotlin as the primary language throughout the app.
Jetpack Compose paired with Material 3 for the entire UI layer, which made it much easier to represent the scoring breakdowns and ambiguous claim states visually instead of burying them in plain text.
Retrofit, OkHttp, and Kotlinx Serialization handling networking and data mapping between the client and the backend services.

On the backend integration side, the architecture assumes a Neo4j graph database for modeling relationships between claims, since so much of what makes a claim credible or not depends on its relationships to other claims and sources. Semantic search over candidate claims is handled through Qdrant as a vector store, and raw encrypted episodes, meaning the underlying source material a claim was derived from, live in S3 style object storage. All of it ultimately gets anchored by the Merkle audit log so that nothing in the pipeline is invisible after the fact.

Compliance was not an afterthought

Given that the intended use cases sit in insurance, legal, and banking, I made the decision early on to build regulatory behavior directly into the system rather than treating it as something to bolt on right before a launch. Two pieces of that I'm especially glad I got right from the start.

GDPR Article 17 erasure cascades. When a user requests deletion, the system does not simply delete a row in a database and call it done. Linked episodes tied to that user are hard deleted. Claims that still have independent corroboration from other, unrelated sources get PII stripped rather than destroyed outright, since the underlying fact might still be legitimately known and referenced from elsewhere even after this particular user's data is gone. Claims that have no independent corroboration left after the user's data is removed get hard deleted entirely, since keeping them around would mean keeping information that only existed because of the deleted user in the first place.

Role based access control. The system defines three roles. AGENT covers basic resolve and ingest actions, the kind of everyday operations most users of the system will perform. VERIFIER is meant for human in the loop review, letting a qualified person step in and confirm or override an ambiguous resolution. ADMIN is reserved for configuration changes and erasure operations, keeping the most sensitive capabilities gated behind the appropriate permission level rather than leaving them open to anyone with basic access.

Some of the harder design decisions along the way

Building this wasn't a straight line, and a few decisions took longer to settle on than I expected going in.

Choosing the geometric mean over a simpler weighted average was one of them. It's mathematically stricter, and stricter math means more claims end up flagged as ambiguous rather than confidently resolved. Early on, that felt like it might make the system less useful, since users generally want answers, not more questions. But the more I thought about the target use cases, the clearer it became that a system used in insurance or legal contexts should be biased toward honest uncertainty rather than false confidence. A slightly less "decisive" system that's honest about its limits is more trustworthy, and ultimately more useful, than one that always has an answer ready.

Deciding how aggressive to make the erasure cascade was another one. It would have been simpler to just hard delete everything tied to a user on request and call it compliant. But that approach ignores the reality that facts can be independently corroborated by other sources that have nothing to do with the user requesting deletion. Building the PII stripping path instead of a blanket deletion took more engineering effort, but it respects both the user's right to be forgotten and the integrity of facts that other, unrelated sources still legitimately support.

What I'd tell someone building something similar

If you're building any kind of AI memory or retrieval system that has to survive real contact with a compliance team, or honestly even one that just needs to earn genuine user trust, my biggest takeaway is this. Resist the temptation to let a single LLM call be your source of truth. It's fast, it's easy to prototype, and it demos beautifully. But it gives you nothing to audit and nothing to explain when someone eventually asks why the system believed something. Building a deterministic, inspectable scoring layer took a lot more upfront design work than just wiring up a prompt, but it's the difference between a system you can defend with evidence and one you can only apologize for after the fact.

What's next

The project is still evolving, and there's plenty left on the roadmap. I want to expand the Neo4j and Qdrant integration further, refine the heuristics behind the Contradiction Guard so its ambiguity threshold can adapt a bit more intelligently to context, and generally keep hardening the compliance tooling as I learn more about what regulated industries actually need in practice. If any of this resonates with a problem you're facing, or you just want to dig into the internals, the full source and architecture details are on GitHub.

Check out the repo here: github.com/alfinohatta/Corroborate.ai

Corroborate.ai is licensed under AGPL-3.0. Contributions and feedback are welcome, especially on the Confidence Auction mechanics.