DEV Community

Cover image for Lightning-Fast Address Matching …Now with a Smart Hybrid (Deterministic + LLM) Engine: A Fun Experiment
David Au Yeung
David Au Yeung

Posted on • Edited on

Lightning-Fast Address Matching …Now with a Smart Hybrid (Deterministic + LLM) Engine: A Fun Experiment

Introduction

Data cleaning is never sexy. Every CRM export, vendor feed, or merger spreadsheet comes with "ABC COMPANY LTD." vs "ABC Co. Limited", or "Flat 12B 8/F 25 Main St., HK" vs "25 Main Street, 8th Floor, Central, Hong Kong".
Yet accounting still needs a unique customer list, and marketing still wants a de-duplicated address book

"Can we dedupe 1 million addresses with millisecond latency, yet still be right when 'Flat 12B, 8/F' ≈ '8th Floor Flat 12B'?"

Yes. Do it deterministically first; ask an LLM only when you are unsure.

If you’re new to LLM chatbots, check out my earlier post:
Step-by-Step Guide: Write Your First AI Storyteller with Ollama (llama3.2) and Semantic Kernel in C#
(Have your Chatbot helper ready before continuing.)

1. Why a Hybrid?

Requirement Pure RegEx / Jaccard Pure LLM Hybrid (RegEx + Jaccard ⇒ LLM)
CPU / RAM ✔ tiny ✘ tens of GB ✔ tiny → medium (only on the 3 % "hard" cases)
Latency per pair < 2 ms ≥ 400 ms (token stream) < 5 ms P95
Deterministic ✘ (sampling) Could be ✔ for 90 % up
Handles gnarly cases ("Tower III Central Plaza ↔ Tower 3 Central Plaza")
if  score ≥ 0.85            ⇒  accept
else if score ≤ 0.30        ⇒  reject
else                        ⇒  ask LLM ("YES" / "NO")
Enter fullscreen mode Exit fullscreen mode

Deterministic logic solves almost everything; the LLM is the tie-breaker.

2. Deterministic Layer

Regular-Expressions + Jaccard

Normalise

  • Lower-case, strip punctuation & accents
  • Expand abbreviations (“st.” → “street”, “15F” → “15 floor”)
  • Canonicalise ordinals (“1st” → “first”, “15th” → “15”)

Extract critical tokens with RegEx

  • buildingNumber, unitNumber, floor

Early exits

  • Any critical token mismatches? ⇒ return DIFF (0.1)

Jaccard on the remaining token sets

  • |A ∩ B| / |A ∪ B|
  • Add a small bonus (+0.2 / +0.1) when critical tokens match

Return (isMatch, confidence)

  • The whole pass is O(n + m), lives happily on the stack, and costs zero.

3. Probabilistic Layer

Tiny Local LLM (Ollama + qwen3:4b)

We tried:

  • llama3.2-3B – struggled with address-specific rules.
  • qwen3:4B – smaller but fine-tuned on instructions; much crisper “YES/NO”.

Prompts:

You are a postal-address expert …
Answer ONLY with YES or NO.
Enter fullscreen mode Exit fullscreen mode
You are an expert in address comparison. Analyze these two addresses carefully:

Address A: {sourceAddress}
Address B: {targetAddress}

Consider:
- They may use different formatting or abbreviations
- Unit/Flat/Apartment numbers must match exactly
- Building numbers must match exactly
- Street names should be the same (accounting for abbreviations)
- Floor numbers must match if specified

Do these addresses refer to the SAME physical location?
Answer with exactly ONE WORD: "YES" or "NO"
Enter fullscreen mode Exit fullscreen mode

Running inside an Ollama container keeps data on-prem, latency ≈ 300 ms, cost = 0.

  1. SmartAddressValidator
using System.Text.RegularExpressions;

namespace MyPlaygroundApp.Utils
{
    public class SmartAddressValidator
    {
        private readonly IChatbot? _intelligenceProvider;

        // Store validation results for retrieval
        private readonly List<ValidationResult> _results = new List<ValidationResult>();

        public SmartAddressValidator(IChatbot? intelligenceProvider = null) => _intelligenceProvider = intelligenceProvider;

        // Get all validation results
        public IReadOnlyList<ValidationResult> GetAllResults() => _results.AsReadOnly();

        // Clear results history
        public void ClearResults() => _results.Clear();

        // Main validation method
        public async Task<bool> AreSameLocationAsync(string sourceAddress, string targetAddress)
        {
            // 1. deterministic validation
            var deterministicResult = ValidateWithConfidence(sourceAddress, targetAddress);

            Console.WriteLine($"Deterministic validation: {deterministicResult.IsMatch} (confidence: {deterministicResult.Confidence:F2})");

            bool finalMatch;
            bool usedAI = false;

            // If high confidence match, return immediately
            if (deterministicResult.IsMatch && deterministicResult.Confidence >= 0.85)
            {
                finalMatch = true;
            }
            // If very low confidence and clearly different, return false
            else if (!deterministicResult.IsMatch && deterministicResult.Confidence <= 0.3)
            {
                finalMatch = false;
            }
            // For medium confidence (0.3-0.85), use deterministic if no AI available
            else if (_intelligenceProvider is null)
            {
                finalMatch = deterministicResult.Confidence > 0.5;
            }
            // Use AI for uncertain cases
            else
            {
                usedAI = true;
                string prompt =
                    $"""
                    You are an expert in address comparison. Analyze these two addresses carefully:

                    Address A: {sourceAddress}
                    Address B: {targetAddress}

                    Consider:
                    - They may use different formatting or abbreviations
                    - Unit/Flat/Apartment numbers must match exactly
                    - Building numbers must match exactly
                    - Street names should be the same (accounting for abbreviations)
                    - Floor numbers must match if specified

                    Do these addresses refer to the SAME physical location?
                    Answer with exactly ONE WORD: "YES" or "NO"
                    """;

                string answer = await _intelligenceProvider.AskQuestion(prompt);
                finalMatch = answer.Contains("YES", StringComparison.OrdinalIgnoreCase);
            }

            // Create validation result
            var result = new ValidationResult(sourceAddress, targetAddress)
            {
                IsMatch = finalMatch,
                Confidence = deterministicResult.Confidence,
                SourceNormalized = Normalize(sourceAddress),
                TargetNormalized = Normalize(targetAddress),
                UsedAI = usedAI
            };

            // Add to results list
            _results.Add(result);

            return result.IsMatch;
        }

        // Alternative method name for convenience
        public async Task<bool> ValidateMatchAsync(string sourceAddress, string targetAddress)
            => await AreSameLocationAsync(sourceAddress, targetAddress);

        // Get detailed validation result - retrieves the most recent validation for these addresses
        public async Task<ValidationResult?> GetValidationDetailsAsync(string sourceAddress, string targetAddress)
        {
            // Check if we have a cached result for these addresses
            var cachedResult = _results
                .Where(r => r.SourceAddress == sourceAddress && r.TargetAddress == targetAddress)
                .LastOrDefault();

            if (cachedResult != null)
            {
                return cachedResult;
            }

            // If not cached, perform validation and return the result
            await AreSameLocationAsync(sourceAddress, targetAddress);

            // Return the result that was just added
            return _results.Last();
        }

        // Get validation history for specific addresses
        public IEnumerable<ValidationResult> GetValidationHistory(string sourceAddress, string targetAddress)
        {
            return _results.Where(r =>
                (r.SourceAddress == sourceAddress && r.TargetAddress == targetAddress) ||
                (r.SourceAddress == targetAddress && r.TargetAddress == sourceAddress));
        }

        private static (bool IsMatch, double Confidence) ValidateWithConfidence(string a, string b)
        {
            string nA = Normalize(a);
            string nB = Normalize(b);

            Console.WriteLine($"Normalized Source: {nA}");
            Console.WriteLine($"Normalized Target: {nB}");

            // Exact match after normalization
            if (nA == nB) return (true, 1.0);

            // Extract and compare components
            var compA = ExtractComponents(nA);
            var compB = ExtractComponents(nB);

            Console.WriteLine($"Components Source: Building={compA.BuildingNumber}, Unit={compA.UnitNumber}, Floor={compA.Floor}");
            Console.WriteLine($"Components Target: Building={compB.BuildingNumber}, Unit={compB.UnitNumber}, Floor={compB.Floor}");

            // Special handling for building number ranges
            bool buildingNumbersMatch = AreBuildingNumbersEquivalent(compA.BuildingNumber, compB.BuildingNumber);

            // Critical components must match
            if (!string.IsNullOrEmpty(compA.BuildingNumber) && !string.IsNullOrEmpty(compB.BuildingNumber))
            {
                if (!buildingNumbersMatch)
                    return (false, 0.1);
            }

            if (!string.IsNullOrEmpty(compA.UnitNumber) && !string.IsNullOrEmpty(compB.UnitNumber))
            {
                if (compA.UnitNumber != compB.UnitNumber)
                    return (false, 0.1);
            }

            if (!string.IsNullOrEmpty(compA.Floor) && !string.IsNullOrEmpty(compB.Floor))
            {
                if (compA.Floor != compB.Floor)
                    return (false, 0.1);
            }

            // Token-based comparison for the rest
            var setA = TokenSet(compA.RemainingText);
            var setB = TokenSet(compB.RemainingText);

            var intersection = new HashSet<string>(setA);
            intersection.IntersectWith(setB);

            var union = new HashSet<string>(setA);
            union.UnionWith(setB);

            if (union.Count == 0) return (true, 0.9); // Both empty

            double jaccard = (double)intersection.Count / union.Count;

            // Add bonus for matching critical components
            double bonus = 0;
            if (!string.IsNullOrEmpty(compA.BuildingNumber) && buildingNumbersMatch)
                bonus += 0.2;
            if (!string.IsNullOrEmpty(compA.UnitNumber) && compA.UnitNumber == compB.UnitNumber)
                bonus += 0.2;
            if (!string.IsNullOrEmpty(compA.Floor) && compA.Floor == compB.Floor)
                bonus += 0.1;

            double finalScore = Math.Min(1.0, jaccard + bonus);

            // More lenient thresholds
            if (finalScore >= 0.6) return (true, finalScore);
            if (finalScore >= 0.4) return (true, 0.5 + finalScore * 0.3); // Uncertain but lean towards match

            return (false, finalScore);
        }

        // Check if building numbers are equivalent (handles ranges)
        private static bool AreBuildingNumbersEquivalent(string buildingA, string buildingB)
        {
            if (buildingA == buildingB) return true;

            // Check if one is a range and the other might be the same range differently formatted
            var rangePattern = @"^(\d+)-(\d+)$";
            var matchA = Regex.Match(buildingA, rangePattern);
            var matchB = Regex.Match(buildingB, rangePattern);

            if (matchA.Success && matchB.Success)
            {
                return matchA.Groups[1].Value == matchB.Groups[1].Value &&
                       matchA.Groups[2].Value == matchB.Groups[2].Value;
            }

            return false;
        }

        // Enhanced abbreviation dictionary
        private static readonly Dictionary<string, string> _abbreviations = new(StringComparer.OrdinalIgnoreCase)
        {
            // Streets
            {"st", "street"}, {"rd", "road"}, {"ave", "avenue"}, {"blvd", "boulevard"},
            {"dr", "drive"}, {"ln", "lane"}, {"ct", "court"}, {"pl", "place"},
            {"cir", "circle"}, {"ter", "terrace"}, {"way", "way"}, {"pkwy", "parkway"},

            // Directions
            {"n", "north"}, {"s", "south"}, {"e", "east"}, {"w", "west"},
            {"ne", "northeast"}, {"nw", "northwest"}, {"se", "southeast"}, {"sw", "southwest"},

            // Units
            {"apt", "apartment"}, {"unit", "unit"}, {"ste", "suite"}, {"rm", "room"},
            {"fl", "floor"}, {"f", "floor"}, {"bldg", "building"}, {"flat", "flat"},

            // Regions
            {"hk", "hong kong"}, {"h.k.", "hong kong"}, {"ny", "new york"}, {"nyc", "new york"},

            // Numbers
            {"1st", "first"}, {"2nd", "second"}, {"3rd", "third"}, {"4th", "fourth"},
            {"5th", "fifth"}, {"6th", "sixth"}, {"7th", "seventh"}, {"8th", "eighth"},
            {"9th", "ninth"}, {"10th", "tenth"}, {"11th", "11"}, {"12th", "12"},
            {"13th", "13"}, {"14th", "14"}, {"15th", "15"}, {"16th", "16"},
            {"17th", "17"}, {"18th", "18"}, {"19th", "19"}, {"20th", "20"}
        };

        // Regex patterns
        private static readonly Regex _ordinals = new(@"(\d+)(st|nd|rd|th)\b", RegexOptions.IgnoreCase);
        private static readonly Regex _floor = new(@"(\d+)\s*(?:/f|f\b|/|th\s*)?floor", RegexOptions.IgnoreCase);
        private static readonly Regex _floorPrefix = new(@"^(\d+)f\b", RegexOptions.IgnoreCase);
        private static readonly Regex _range = new(@"(\d+)\s*(?:-|to|~)\s*(\d+)", RegexOptions.IgnoreCase);
        private static readonly Regex _unitPattern = new(@"(?:apt|apartment|unit|flat|suite|ste|#)\s*(\w+)", RegexOptions.IgnoreCase);
        private static readonly Regex _buildingNumber = new(@"^(\d+(?:-\d+)?)\s+", RegexOptions.Compiled);
        private static readonly Regex _punct = new(@"[.,/#!$%^&*;:{}=_`~()]", RegexOptions.Compiled);

        private static string Normalize(string src)
        {
            if (string.IsNullOrWhiteSpace(src)) return "";

            // 1. lower-case & trim
            var s = src.ToLowerInvariant().Trim();

            // 2. Normalize unit/apartment references BEFORE removing punctuation
            s = Regex.Replace(s, @"#\s*(\w+)", "unit $1", RegexOptions.IgnoreCase);

            // 3. Handle ranges BEFORE removing dashes
            // Convert "100 to 200" → "100-200", "100 - 200" → "100-200", etc.
            s = Regex.Replace(s, @"(\d+)\s*(?:to|-|~)\s*(\d+)", "$1-$2", RegexOptions.IgnoreCase);

            // 4. Unify common punctuation patterns
            s = s.Replace("'s ", " ");  // Queen's → Queen
            s = s.Replace("'", "");     // Remove remaining apostrophes
            s = s.Replace("--", " ");   // Double dash to space

            // 5. Replace dashes with spaces EXCEPT in number ranges
            s = Regex.Replace(s, @"(?<!(\d))-(?!(\d))", " ");  // Only replace dash if not between digits

            // 6. remove other punctuation
            s = _punct.Replace(s, " ");

            // 7. Handle floor notations
            // Convert "15F" -> "15 floor" when it's at the beginning or preceded by space
            s = _floorPrefix.Replace(s, "$1 floor");
            s = Regex.Replace(s, @"\s(\d+)f\b", " $1 floor", RegexOptions.IgnoreCase);
            s = Regex.Replace(s, @"(\d+)\s*/\s*f\b", "$1 floor", RegexOptions.IgnoreCase);
            s = _floor.Replace(s, "$1 floor");

            // 8. unify ordinals: "1st" → "first" for single digits, "15th" → "15" for double digits
            s = _ordinals.Replace(s, m =>
            {
                var num = m.Groups[1].Value;
                if (_abbreviations.TryGetValue(m.Value, out var full))
                    return full;
                return num;
            });

            // 9. replace abbreviations
            var parts = s.Split(' ', StringSplitOptions.RemoveEmptyEntries);
            for (int i = 0; i < parts.Length; i++)
            {
                var p = parts[i].Trim();
                if (_abbreviations.TryGetValue(p, out var full)) parts[i] = full;
            }
            s = string.Join(' ', parts);

            // 10. collapse multiple spaces
            s = Regex.Replace(s, @"\s{2,}", " ");

            return s.Trim();
        }

        private class AddressComponents
        {
            public string BuildingNumber { get; set; } = "";
            public string UnitNumber { get; set; } = "";
            public string Floor { get; set; } = "";
            public string RemainingText { get; set; } = "";
        }

        private static AddressComponents ExtractComponents(string normalized)
        {
            var comp = new AddressComponents();
            var working = normalized;

            // Extract building number (updated to handle ranges properly)
            var buildingMatch = Regex.Match(working, @"^(\d+(?:-\d+)?)\s+");
            if (buildingMatch.Success)
            {
                comp.BuildingNumber = buildingMatch.Groups[1].Value;
                working = working.Substring(buildingMatch.Length).Trim();
            }

            // Extract unit/apartment/flat number (search anywhere in string)
            var unitMatches = Regex.Matches(working, @"(?:apt|apartment|unit|flat|suite|ste)\s+(\w+)", RegexOptions.IgnoreCase);
            if (unitMatches.Count > 0)
            {
                comp.UnitNumber = unitMatches[0].Groups[1].Value.ToLower();
                working = Regex.Replace(working, @"(?:apt|apartment|unit|flat|suite|ste)\s+\w+", " ", RegexOptions.IgnoreCase).Trim();
            }

            // Extract floor (search anywhere in string)
            var floorMatch = Regex.Match(working, @"(\d+)\s+floor", RegexOptions.IgnoreCase);
            if (floorMatch.Success)
            {
                comp.Floor = floorMatch.Groups[1].Value;
                working = Regex.Replace(working, @"\d+\s+floor", " ", RegexOptions.IgnoreCase).Trim();
            }

            // Check if building number is at the end (like "Main Street 25")
            if (string.IsNullOrEmpty(comp.BuildingNumber))
            {
                var endBuildingMatch = Regex.Match(working, @"\s+(\d+(?:-\d+)?)$");
                if (endBuildingMatch.Success)
                {
                    comp.BuildingNumber = endBuildingMatch.Groups[1].Value;
                    working = working.Substring(0, endBuildingMatch.Index).Trim();
                }
            }

            comp.RemainingText = working;
            return comp;
        }

        private static HashSet<string> TokenSet(string s) =>
            s.Split(' ', StringSplitOptions.RemoveEmptyEntries)
             .Where(token => !string.IsNullOrWhiteSpace(token))
             .ToHashSet(StringComparer.OrdinalIgnoreCase);
    }

    // Result class for detailed validation information
    public class ValidationResult
    {
        public ValidationResult(string sourceAddress, string targetAddress)
        {
            this.SourceAddress = sourceAddress;
            this.TargetAddress = targetAddress;
            this.Timestamp = DateTime.UtcNow;
        }

        public string SourceAddress { get; set; }
        public string TargetAddress { get; set; }
        public bool IsMatch { get; set; }
        public double Confidence { get; set; }
        public string SourceNormalized { get; set; } = "";
        public string TargetNormalized { get; set; } = "";
        public bool UsedAI { get; set; }
        public DateTime Timestamp { get; set; }
    }
}
Enter fullscreen mode Exit fullscreen mode

Code Walk-through

flowchart TD
    A[raw addresses] -->|normalise| B(RegEx extraction)
    B --> C{Critical mismatch?}
    C -->|conf=0.1| F[DIFF]
    C -->|conf≥0.85| D[Jaccard + bonus]
    D -->|conf≥0.85| E[MATCH]
    D -->|conf≤0.30| F
    D -->|else| G[LLM ask]
    G --> H{LLM response YES/NO}
    H -- YES --> E
    H -- NO --> F
Enter fullscreen mode Exit fullscreen mode

Key snippets:

Normalisation pipeline

s = s.ToLowerInvariant()
     .Replace("'", "")             // O'Connor → OConnor
     .Replace("--", " ")
     .Replace("-",  " ")
     .RegexReplace(@"\s{2,}", " ")
     .MapTokens(_abbreviations);
Enter fullscreen mode Exit fullscreen mode

Jaccard score

var inter = setA.Intersect(setB).Count();
var union = setA.Union(setB).Count();
double score = (double)inter / union + bonus;   // clamp to 1.0 later
Enter fullscreen mode Exit fullscreen mode

LLM handshake

string answer = await _intelligenceProvider.AskQuestion(prompt);
bool final = answer.Contains("YES", StringComparison.OrdinalIgnoreCase);
Enter fullscreen mode Exit fullscreen mode

5. Full Demo (▶ copy-paste)

public static async Task Main(string[] args)
{
  Console.OutputEncoding = Encoding.UTF8;

  // 1. Wire an LLM (optional – pass null to stay deterministic)
  IChatbot? chat = new Chatbot(
      endpoint: "http://localhost:11434",
      systemMessage: "You are a postal-address expert. When comparing addresses, you must be precise about unit numbers, floor numbers, and building numbers. Answer ONLY with YES or NO.",
      modelId: "qwen3:4b"
  );

  // 2. Create the validator
  var validator = new SmartAddressValidator(chat);

  // 3. Comprehensive test cases - 50% matches, 50% non-matches
  var pairs = new (string A, string B, bool Expected, string Description)[]
  {
      // Run the 200-pair benchmark
      // Basic formatting differences
      ("123 Main Street", "123 Main St", true, "Basic abbreviation"),
      ("456 First Avenue", "456 1st Ave", true, "Ordinal + abbreviation"),
      ("789 North Park Road", "789 N Park Rd", true, "Direction abbreviation"),
      // Add your testing cases here...  
  };

  Console.WriteLine("Smart Address Validator Extended Test Results");
  Console.WriteLine("============================================\n");

  int correct = 0;
  int total = 0;
  int truePositives = 0;
  int trueNegatives = 0;
  int falsePositives = 0;
  int falseNegatives = 0;

  foreach (var (a, b, expected, description) in pairs)
  {
      total++;
      bool result = await validator.AreSameLocationAsync(a, b);
      bool isCorrect = result == expected;
      if (isCorrect) correct++;

      // Track confusion matrix
      if (expected && result) truePositives++;
      else if (!expected && !result) trueNegatives++;
      else if (!expected && result) falsePositives++;
      else if (expected && !result) falseNegatives++;

      string status = isCorrect ? "✓" : "✗✗✗";
      string expectedStr = expected ? "MATCH" : "DIFF";
      string resultStr = result ? "MATCH" : "DIFF";

      Console.WriteLine($"{status} Expected: {expectedStr}, Got: {resultStr} - {description}");
      Console.WriteLine($"  A: {a}");
      Console.WriteLine($"  B: {b}");

      // Show details for failures
      if (!isCorrect)
      {
          var details = await validator.GetValidationDetailsAsync(a, b);
          Console.WriteLine($"  Details: Confidence={details.Confidence:F2}, UsedAI={details.UsedAI}");
      }
      Console.WriteLine();
  }

  Console.WriteLine($"\nAccuracy: {correct}/{total} ({(double)correct / total * 100:F1}%)");

  // Confusion matrix results
  Console.WriteLine("\nResults by category:");
  Console.WriteLine($"True Positives (correctly matched): {truePositives}");
  Console.WriteLine($"True Negatives (correctly different): {trueNegatives}");
  Console.WriteLine($"False Positives (incorrectly matched): {falsePositives}");
  Console.WriteLine($"False Negatives (incorrectly different): {falseNegatives}");

  // Calculate precision and recall
  double precision = truePositives > 0 ? (double)truePositives / (truePositives + falsePositives) : 0;
  double recall = truePositives > 0 ? (double)truePositives / (truePositives + falseNegatives) : 0;
  double f1Score = precision + recall > 0 ? 2 * (precision * recall) / (precision + recall) : 0;

  Console.WriteLine($"\nMetrics:");
  Console.WriteLine($"Precision: {precision:F3}");
  Console.WriteLine($"Recall: {recall:F3}");
  Console.WriteLine($"F1 Score: {f1Score:F3}");
}
Enter fullscreen mode Exit fullscreen mode

6. Results

Metric Value
Accuracy 37 / 50 = 74 %
Precision 0.667
Recall 0.870
F1-score 0.755

Confusion-matrix breakdown

Predicted MATCH Predicted DIFF
Actual MATCH 20 (True Positive) 3 (False Negative)
Actual DIFF 10 (False Positive) 17 (True Negative)

The hybrid engine still captures the majority of true matches (high recall), but precision has room for improvement—most of the errors are false positives (10/13 mistakes). Tweaking the deterministic thresholds or tightening the LLM's "YES" bias should lift both precision and overall accuracy.

Challenge for You!

How to improve the performance metrics?
Consider the following strategies:

  • Experiment with a different LLM (Large Language Model).
  • Adjust the thresholds to optimize precision, recall, or overall performance.

References

  1. Jaccard Similarity
  2. Classification: Accuracy, recall, precision, and related metrics

Top comments (1)

Collapse
 
auyeungdavid_2847435260 profile image
David Au Yeung

I keep discovering the potential uses of LLMs. Please follow me :)