DEV Community

Cover image for AI-Assisted Data Reconciliation at Scale: Patterns for Distributed Systems
Alok Ranjan Daftuar
Alok Ranjan Daftuar

Posted on • Originally published at aloknecessary.github.io

AI-Assisted Data Reconciliation at Scale: Patterns for Distributed Systems

In any sufficiently large distributed system, data reconciliation is the dark matter of engineering — invisible, pervasive, and holding everything together through mechanisms nobody fully understands.

Rule-based reconciliation works until it doesn't. Rule engines break on ambiguity, cannot handle semantic equivalence across schema versions, and generate false positives at scale that overwhelm operations teams. AI — specifically embedding-based similarity and LLM classification — fills the gap. Not as a replacement, but as a layer that handles what rules cannot.


Where Traditional Reconciliation Breaks Down

Eventual consistency windows — a naive reconciliation job that diffs at a point in time generates thousands of false positives that are self-healing within seconds. The rule engine cannot distinguish transient inconsistency from legitimate divergence.

Cross-service schema drift — Service A stores an address as { street, city, state, zip }. Service B stores it as { addressLine1, municipality, postalCode }. Semantically equivalent. A field-level comparator flags every record as mismatched.

Semantic equivalence in free-text"Acme Corporation" vs "ACME Corp." vs "Acme Corp (formerly Roadrunner Supplies)". Rule-based systems cannot reason about semantic identity at scale.

Volume-driven false positive fatigue — at millions of records per day, even 0.1% false positives generate thousands of alerts. Real issues get buried. The reconciliation system becomes theater.


The Architecture: Rules First, AI at the Boundary

The pattern is not AI-first. It is rules-first, AI at the boundary:

  1. Deterministic mismatch detection — checksums, field comparisons, primary key matching
  2. High-confidence matches/mismatches — auto-resolve or route to correction (no AI needed)
  3. Ambiguous cases → AI classification layer:
    • Embedding similarity — detect semantic equivalence across schema variations
    • LLM classification — reason about why a mismatch exists

Embedding-Based Similarity

Serialize records into schema-agnostic text representations before embedding. Compute cosine similarity. Calibrate thresholds against labeled data:

  • ≥ 0.95 → auto-resolve as equivalent
  • 0.80 – 0.95 → route to LLM classification
  • < 0.80 → high-confidence mismatch, route to correction

The thresholds are not universal — calibrate against 500–1000 manually classified record pairs from your actual data.


LLM Classification for the Ambiguous Band

For the 5–15% of mismatches that fall in the ambiguous range, an LLM reasons about context that vector distance cannot capture. Classifications: equivalent, stale_copy, legitimate_divergence, data_corruption.

Cost management: route only the ambiguous band to the LLM. Batch where latency allows. Cache results for record pairs re-evaluated in subsequent cycles.


Where AI Should Never Be Trusted

  • Financial and compliance records — dollar amount disagreements are correctness errors, not semantic questions
  • Primary key and identity resolution — AI suggestions acceptable; auto-resolution without human sign-off is not
  • Any decision that must be explainable to a regulator — "87% confidence" is not an audit-compliant explanation

The Key Insight

AI in reconciliation is a judgment layer, not a trust layer. It handles ambiguous cases that rules cannot, reduces volume reaching human review, and provides structured reasoning. The deterministic foundation must remain intact.

A reconciliation system you cannot audit is worse than one that generates false positives. Build the observability before you build the AI.


Read the Full Article

This is a summary of my deep dive into AI-assisted data reconciliation. The full article covers the complete architecture with implementation examples:

👉 AI-Assisted Data Reconciliation at Scale — Full Article

The full article includes:

  • Where traditional reconciliation breaks down (4 failure modes)
  • Full architecture diagram with rules-first, AI-at-boundary pattern
  • Embedding-based similarity implementation (Python, OpenAI embeddings)
  • LLM classification prompt pattern with structured JSON output
  • Observation window pattern for filtering eventual consistency false positives
  • Hard boundaries where AI should never auto-resolve
  • Observability patterns with structured logging
  • Production deployment checklist

Top comments (0)