Introduction
Data cleaning is never sexy. Every CRM export, vendor feed, or merger spreadsheet comes with "ABC COMPANY LTD." vs "ABC Co. Limited", or "Flat 12B 8/F 25 Main St., HK" vs "25 Main Street, 8th Floor, Central, Hong Kong".
Yet accounting still needs a unique customer list, and marketing still wants a de-duplicated address book
"Can we dedupe 1 million addresses with millisecond latency, yet still be right when 'Flat 12B, 8/F' ≈ '8th Floor Flat 12B'?"
Yes. Do it deterministically first; ask an LLM only when you are unsure.
If you’re new to LLM chatbots, check out my earlier post:
Step-by-Step Guide: Write Your First AI Storyteller with Ollama (llama3.2) and Semantic Kernel in C#
(Have your Chatbot helper ready before continuing.)
1. Why a Hybrid?
Requirement | Pure RegEx / Jaccard | Pure LLM | Hybrid (RegEx + Jaccard ⇒ LLM) |
---|---|---|---|
CPU / RAM | ✔ tiny | ✘ tens of GB | ✔ tiny → medium (only on the 3 % "hard" cases) |
Latency per pair | < 2 ms | ≥ 400 ms (token stream) | < 5 ms P95 |
Deterministic | ✔ | ✘ (sampling) | Could be ✔ for 90 % up |
Handles gnarly cases ("Tower III Central Plaza ↔ Tower 3 Central Plaza") | ✘ | ✔ | ✔ |
if score ≥ 0.85 ⇒ accept
else if score ≤ 0.30 ⇒ reject
else ⇒ ask LLM ("YES" / "NO")
Deterministic logic solves almost everything; the LLM is the tie-breaker.
2. Deterministic Layer
Regular-Expressions + Jaccard
Normalise
- Lower-case, strip punctuation & accents
- Expand abbreviations (“st.” → “street”, “15F” → “15 floor”)
- Canonicalise ordinals (“1st” → “first”, “15th” → “15”)
Extract critical tokens with RegEx
- buildingNumber, unitNumber, floor
Early exits
- Any critical token mismatches? ⇒ return DIFF (0.1)
Jaccard on the remaining token sets
- |A ∩ B| / |A ∪ B|
- Add a small bonus (+0.2 / +0.1) when critical tokens match
Return (isMatch, confidence)
- The whole pass is O(n + m), lives happily on the stack, and costs zero.
3. Probabilistic Layer
Tiny Local LLM (Ollama + qwen3:4b)
We tried:
- llama3.2-3B – struggled with address-specific rules.
- qwen3:4B – smaller but fine-tuned on instructions; much crisper “YES/NO”.
Prompts:
You are a postal-address expert …
Answer ONLY with YES or NO.
You are an expert in address comparison. Analyze these two addresses carefully:
Address A: {sourceAddress}
Address B: {targetAddress}
Consider:
- They may use different formatting or abbreviations
- Unit/Flat/Apartment numbers must match exactly
- Building numbers must match exactly
- Street names should be the same (accounting for abbreviations)
- Floor numbers must match if specified
Do these addresses refer to the SAME physical location?
Answer with exactly ONE WORD: "YES" or "NO"
Running inside an Ollama container keeps data on-prem, latency ≈ 300 ms, cost = 0.
-
SmartAddressValidator
using System.Text.RegularExpressions;
namespace MyPlaygroundApp.Utils
{
public class SmartAddressValidator
{
private readonly IChatbot? _intelligenceProvider;
// Store validation results for retrieval
private readonly List<ValidationResult> _results = new List<ValidationResult>();
public SmartAddressValidator(IChatbot? intelligenceProvider = null) => _intelligenceProvider = intelligenceProvider;
// Get all validation results
public IReadOnlyList<ValidationResult> GetAllResults() => _results.AsReadOnly();
// Clear results history
public void ClearResults() => _results.Clear();
// Main validation method
public async Task<bool> AreSameLocationAsync(string sourceAddress, string targetAddress)
{
// 1. deterministic validation
var deterministicResult = ValidateWithConfidence(sourceAddress, targetAddress);
Console.WriteLine($"Deterministic validation: {deterministicResult.IsMatch} (confidence: {deterministicResult.Confidence:F2})");
bool finalMatch;
bool usedAI = false;
// If high confidence match, return immediately
if (deterministicResult.IsMatch && deterministicResult.Confidence >= 0.85)
{
finalMatch = true;
}
// If very low confidence and clearly different, return false
else if (!deterministicResult.IsMatch && deterministicResult.Confidence <= 0.3)
{
finalMatch = false;
}
// For medium confidence (0.3-0.85), use deterministic if no AI available
else if (_intelligenceProvider is null)
{
finalMatch = deterministicResult.Confidence > 0.5;
}
// Use AI for uncertain cases
else
{
usedAI = true;
string prompt =
$"""
You are an expert in address comparison. Analyze these two addresses carefully:
Address A: {sourceAddress}
Address B: {targetAddress}
Consider:
- They may use different formatting or abbreviations
- Unit/Flat/Apartment numbers must match exactly
- Building numbers must match exactly
- Street names should be the same (accounting for abbreviations)
- Floor numbers must match if specified
Do these addresses refer to the SAME physical location?
Answer with exactly ONE WORD: "YES" or "NO"
""";
string answer = await _intelligenceProvider.AskQuestion(prompt);
finalMatch = answer.Contains("YES", StringComparison.OrdinalIgnoreCase);
}
// Create validation result
var result = new ValidationResult(sourceAddress, targetAddress)
{
IsMatch = finalMatch,
Confidence = deterministicResult.Confidence,
SourceNormalized = Normalize(sourceAddress),
TargetNormalized = Normalize(targetAddress),
UsedAI = usedAI
};
// Add to results list
_results.Add(result);
return result.IsMatch;
}
// Alternative method name for convenience
public async Task<bool> ValidateMatchAsync(string sourceAddress, string targetAddress)
=> await AreSameLocationAsync(sourceAddress, targetAddress);
// Get detailed validation result - retrieves the most recent validation for these addresses
public async Task<ValidationResult?> GetValidationDetailsAsync(string sourceAddress, string targetAddress)
{
// Check if we have a cached result for these addresses
var cachedResult = _results
.Where(r => r.SourceAddress == sourceAddress && r.TargetAddress == targetAddress)
.LastOrDefault();
if (cachedResult != null)
{
return cachedResult;
}
// If not cached, perform validation and return the result
await AreSameLocationAsync(sourceAddress, targetAddress);
// Return the result that was just added
return _results.Last();
}
// Get validation history for specific addresses
public IEnumerable<ValidationResult> GetValidationHistory(string sourceAddress, string targetAddress)
{
return _results.Where(r =>
(r.SourceAddress == sourceAddress && r.TargetAddress == targetAddress) ||
(r.SourceAddress == targetAddress && r.TargetAddress == sourceAddress));
}
private static (bool IsMatch, double Confidence) ValidateWithConfidence(string a, string b)
{
string nA = Normalize(a);
string nB = Normalize(b);
Console.WriteLine($"Normalized Source: {nA}");
Console.WriteLine($"Normalized Target: {nB}");
// Exact match after normalization
if (nA == nB) return (true, 1.0);
// Extract and compare components
var compA = ExtractComponents(nA);
var compB = ExtractComponents(nB);
Console.WriteLine($"Components Source: Building={compA.BuildingNumber}, Unit={compA.UnitNumber}, Floor={compA.Floor}");
Console.WriteLine($"Components Target: Building={compB.BuildingNumber}, Unit={compB.UnitNumber}, Floor={compB.Floor}");
// Special handling for building number ranges
bool buildingNumbersMatch = AreBuildingNumbersEquivalent(compA.BuildingNumber, compB.BuildingNumber);
// Critical components must match
if (!string.IsNullOrEmpty(compA.BuildingNumber) && !string.IsNullOrEmpty(compB.BuildingNumber))
{
if (!buildingNumbersMatch)
return (false, 0.1);
}
if (!string.IsNullOrEmpty(compA.UnitNumber) && !string.IsNullOrEmpty(compB.UnitNumber))
{
if (compA.UnitNumber != compB.UnitNumber)
return (false, 0.1);
}
if (!string.IsNullOrEmpty(compA.Floor) && !string.IsNullOrEmpty(compB.Floor))
{
if (compA.Floor != compB.Floor)
return (false, 0.1);
}
// Token-based comparison for the rest
var setA = TokenSet(compA.RemainingText);
var setB = TokenSet(compB.RemainingText);
var intersection = new HashSet<string>(setA);
intersection.IntersectWith(setB);
var union = new HashSet<string>(setA);
union.UnionWith(setB);
if (union.Count == 0) return (true, 0.9); // Both empty
double jaccard = (double)intersection.Count / union.Count;
// Add bonus for matching critical components
double bonus = 0;
if (!string.IsNullOrEmpty(compA.BuildingNumber) && buildingNumbersMatch)
bonus += 0.2;
if (!string.IsNullOrEmpty(compA.UnitNumber) && compA.UnitNumber == compB.UnitNumber)
bonus += 0.2;
if (!string.IsNullOrEmpty(compA.Floor) && compA.Floor == compB.Floor)
bonus += 0.1;
double finalScore = Math.Min(1.0, jaccard + bonus);
// More lenient thresholds
if (finalScore >= 0.6) return (true, finalScore);
if (finalScore >= 0.4) return (true, 0.5 + finalScore * 0.3); // Uncertain but lean towards match
return (false, finalScore);
}
// Check if building numbers are equivalent (handles ranges)
private static bool AreBuildingNumbersEquivalent(string buildingA, string buildingB)
{
if (buildingA == buildingB) return true;
// Check if one is a range and the other might be the same range differently formatted
var rangePattern = @"^(\d+)-(\d+)$";
var matchA = Regex.Match(buildingA, rangePattern);
var matchB = Regex.Match(buildingB, rangePattern);
if (matchA.Success && matchB.Success)
{
return matchA.Groups[1].Value == matchB.Groups[1].Value &&
matchA.Groups[2].Value == matchB.Groups[2].Value;
}
return false;
}
// Enhanced abbreviation dictionary
private static readonly Dictionary<string, string> _abbreviations = new(StringComparer.OrdinalIgnoreCase)
{
// Streets
{"st", "street"}, {"rd", "road"}, {"ave", "avenue"}, {"blvd", "boulevard"},
{"dr", "drive"}, {"ln", "lane"}, {"ct", "court"}, {"pl", "place"},
{"cir", "circle"}, {"ter", "terrace"}, {"way", "way"}, {"pkwy", "parkway"},
// Directions
{"n", "north"}, {"s", "south"}, {"e", "east"}, {"w", "west"},
{"ne", "northeast"}, {"nw", "northwest"}, {"se", "southeast"}, {"sw", "southwest"},
// Units
{"apt", "apartment"}, {"unit", "unit"}, {"ste", "suite"}, {"rm", "room"},
{"fl", "floor"}, {"f", "floor"}, {"bldg", "building"}, {"flat", "flat"},
// Regions
{"hk", "hong kong"}, {"h.k.", "hong kong"}, {"ny", "new york"}, {"nyc", "new york"},
// Numbers
{"1st", "first"}, {"2nd", "second"}, {"3rd", "third"}, {"4th", "fourth"},
{"5th", "fifth"}, {"6th", "sixth"}, {"7th", "seventh"}, {"8th", "eighth"},
{"9th", "ninth"}, {"10th", "tenth"}, {"11th", "11"}, {"12th", "12"},
{"13th", "13"}, {"14th", "14"}, {"15th", "15"}, {"16th", "16"},
{"17th", "17"}, {"18th", "18"}, {"19th", "19"}, {"20th", "20"}
};
// Regex patterns
private static readonly Regex _ordinals = new(@"(\d+)(st|nd|rd|th)\b", RegexOptions.IgnoreCase);
private static readonly Regex _floor = new(@"(\d+)\s*(?:/f|f\b|/|th\s*)?floor", RegexOptions.IgnoreCase);
private static readonly Regex _floorPrefix = new(@"^(\d+)f\b", RegexOptions.IgnoreCase);
private static readonly Regex _range = new(@"(\d+)\s*(?:-|to|~)\s*(\d+)", RegexOptions.IgnoreCase);
private static readonly Regex _unitPattern = new(@"(?:apt|apartment|unit|flat|suite|ste|#)\s*(\w+)", RegexOptions.IgnoreCase);
private static readonly Regex _buildingNumber = new(@"^(\d+(?:-\d+)?)\s+", RegexOptions.Compiled);
private static readonly Regex _punct = new(@"[.,/#!$%^&*;:{}=_`~()]", RegexOptions.Compiled);
private static string Normalize(string src)
{
if (string.IsNullOrWhiteSpace(src)) return "";
// 1. lower-case & trim
var s = src.ToLowerInvariant().Trim();
// 2. Normalize unit/apartment references BEFORE removing punctuation
s = Regex.Replace(s, @"#\s*(\w+)", "unit $1", RegexOptions.IgnoreCase);
// 3. Handle ranges BEFORE removing dashes
// Convert "100 to 200" → "100-200", "100 - 200" → "100-200", etc.
s = Regex.Replace(s, @"(\d+)\s*(?:to|-|~)\s*(\d+)", "$1-$2", RegexOptions.IgnoreCase);
// 4. Unify common punctuation patterns
s = s.Replace("'s ", " "); // Queen's → Queen
s = s.Replace("'", ""); // Remove remaining apostrophes
s = s.Replace("--", " "); // Double dash to space
// 5. Replace dashes with spaces EXCEPT in number ranges
s = Regex.Replace(s, @"(?<!(\d))-(?!(\d))", " "); // Only replace dash if not between digits
// 6. remove other punctuation
s = _punct.Replace(s, " ");
// 7. Handle floor notations
// Convert "15F" -> "15 floor" when it's at the beginning or preceded by space
s = _floorPrefix.Replace(s, "$1 floor");
s = Regex.Replace(s, @"\s(\d+)f\b", " $1 floor", RegexOptions.IgnoreCase);
s = Regex.Replace(s, @"(\d+)\s*/\s*f\b", "$1 floor", RegexOptions.IgnoreCase);
s = _floor.Replace(s, "$1 floor");
// 8. unify ordinals: "1st" → "first" for single digits, "15th" → "15" for double digits
s = _ordinals.Replace(s, m =>
{
var num = m.Groups[1].Value;
if (_abbreviations.TryGetValue(m.Value, out var full))
return full;
return num;
});
// 9. replace abbreviations
var parts = s.Split(' ', StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < parts.Length; i++)
{
var p = parts[i].Trim();
if (_abbreviations.TryGetValue(p, out var full)) parts[i] = full;
}
s = string.Join(' ', parts);
// 10. collapse multiple spaces
s = Regex.Replace(s, @"\s{2,}", " ");
return s.Trim();
}
private class AddressComponents
{
public string BuildingNumber { get; set; } = "";
public string UnitNumber { get; set; } = "";
public string Floor { get; set; } = "";
public string RemainingText { get; set; } = "";
}
private static AddressComponents ExtractComponents(string normalized)
{
var comp = new AddressComponents();
var working = normalized;
// Extract building number (updated to handle ranges properly)
var buildingMatch = Regex.Match(working, @"^(\d+(?:-\d+)?)\s+");
if (buildingMatch.Success)
{
comp.BuildingNumber = buildingMatch.Groups[1].Value;
working = working.Substring(buildingMatch.Length).Trim();
}
// Extract unit/apartment/flat number (search anywhere in string)
var unitMatches = Regex.Matches(working, @"(?:apt|apartment|unit|flat|suite|ste)\s+(\w+)", RegexOptions.IgnoreCase);
if (unitMatches.Count > 0)
{
comp.UnitNumber = unitMatches[0].Groups[1].Value.ToLower();
working = Regex.Replace(working, @"(?:apt|apartment|unit|flat|suite|ste)\s+\w+", " ", RegexOptions.IgnoreCase).Trim();
}
// Extract floor (search anywhere in string)
var floorMatch = Regex.Match(working, @"(\d+)\s+floor", RegexOptions.IgnoreCase);
if (floorMatch.Success)
{
comp.Floor = floorMatch.Groups[1].Value;
working = Regex.Replace(working, @"\d+\s+floor", " ", RegexOptions.IgnoreCase).Trim();
}
// Check if building number is at the end (like "Main Street 25")
if (string.IsNullOrEmpty(comp.BuildingNumber))
{
var endBuildingMatch = Regex.Match(working, @"\s+(\d+(?:-\d+)?)$");
if (endBuildingMatch.Success)
{
comp.BuildingNumber = endBuildingMatch.Groups[1].Value;
working = working.Substring(0, endBuildingMatch.Index).Trim();
}
}
comp.RemainingText = working;
return comp;
}
private static HashSet<string> TokenSet(string s) =>
s.Split(' ', StringSplitOptions.RemoveEmptyEntries)
.Where(token => !string.IsNullOrWhiteSpace(token))
.ToHashSet(StringComparer.OrdinalIgnoreCase);
}
// Result class for detailed validation information
public class ValidationResult
{
public ValidationResult(string sourceAddress, string targetAddress)
{
this.SourceAddress = sourceAddress;
this.TargetAddress = targetAddress;
this.Timestamp = DateTime.UtcNow;
}
public string SourceAddress { get; set; }
public string TargetAddress { get; set; }
public bool IsMatch { get; set; }
public double Confidence { get; set; }
public string SourceNormalized { get; set; } = "";
public string TargetNormalized { get; set; } = "";
public bool UsedAI { get; set; }
public DateTime Timestamp { get; set; }
}
}
Code Walk-through
flowchart TD
A[raw addresses] -->|normalise| B(RegEx extraction)
B --> C{Critical mismatch?}
C -->|conf=0.1| F[DIFF]
C -->|conf≥0.85| D[Jaccard + bonus]
D -->|conf≥0.85| E[MATCH]
D -->|conf≤0.30| F
D -->|else| G[LLM ask]
G --> H{LLM response YES/NO}
H -- YES --> E
H -- NO --> F
Key snippets:
Normalisation pipeline
s = s.ToLowerInvariant()
.Replace("'", "") // O'Connor → OConnor
.Replace("--", " ")
.Replace("-", " ")
.RegexReplace(@"\s{2,}", " ")
.MapTokens(_abbreviations);
Jaccard score
var inter = setA.Intersect(setB).Count();
var union = setA.Union(setB).Count();
double score = (double)inter / union + bonus; // clamp to 1.0 later
LLM handshake
string answer = await _intelligenceProvider.AskQuestion(prompt);
bool final = answer.Contains("YES", StringComparison.OrdinalIgnoreCase);
5. Full Demo (▶ copy-paste)
public static async Task Main(string[] args)
{
Console.OutputEncoding = Encoding.UTF8;
// 1. Wire an LLM (optional – pass null to stay deterministic)
IChatbot? chat = new Chatbot(
endpoint: "http://localhost:11434",
systemMessage: "You are a postal-address expert. When comparing addresses, you must be precise about unit numbers, floor numbers, and building numbers. Answer ONLY with YES or NO.",
modelId: "qwen3:4b"
);
// 2. Create the validator
var validator = new SmartAddressValidator(chat);
// 3. Comprehensive test cases - 50% matches, 50% non-matches
var pairs = new (string A, string B, bool Expected, string Description)[]
{
// Run the 200-pair benchmark
// Basic formatting differences
("123 Main Street", "123 Main St", true, "Basic abbreviation"),
("456 First Avenue", "456 1st Ave", true, "Ordinal + abbreviation"),
("789 North Park Road", "789 N Park Rd", true, "Direction abbreviation"),
// Add your testing cases here...
};
Console.WriteLine("Smart Address Validator Extended Test Results");
Console.WriteLine("============================================\n");
int correct = 0;
int total = 0;
int truePositives = 0;
int trueNegatives = 0;
int falsePositives = 0;
int falseNegatives = 0;
foreach (var (a, b, expected, description) in pairs)
{
total++;
bool result = await validator.AreSameLocationAsync(a, b);
bool isCorrect = result == expected;
if (isCorrect) correct++;
// Track confusion matrix
if (expected && result) truePositives++;
else if (!expected && !result) trueNegatives++;
else if (!expected && result) falsePositives++;
else if (expected && !result) falseNegatives++;
string status = isCorrect ? "✓" : "✗✗✗";
string expectedStr = expected ? "MATCH" : "DIFF";
string resultStr = result ? "MATCH" : "DIFF";
Console.WriteLine($"{status} Expected: {expectedStr}, Got: {resultStr} - {description}");
Console.WriteLine($" A: {a}");
Console.WriteLine($" B: {b}");
// Show details for failures
if (!isCorrect)
{
var details = await validator.GetValidationDetailsAsync(a, b);
Console.WriteLine($" Details: Confidence={details.Confidence:F2}, UsedAI={details.UsedAI}");
}
Console.WriteLine();
}
Console.WriteLine($"\nAccuracy: {correct}/{total} ({(double)correct / total * 100:F1}%)");
// Confusion matrix results
Console.WriteLine("\nResults by category:");
Console.WriteLine($"True Positives (correctly matched): {truePositives}");
Console.WriteLine($"True Negatives (correctly different): {trueNegatives}");
Console.WriteLine($"False Positives (incorrectly matched): {falsePositives}");
Console.WriteLine($"False Negatives (incorrectly different): {falseNegatives}");
// Calculate precision and recall
double precision = truePositives > 0 ? (double)truePositives / (truePositives + falsePositives) : 0;
double recall = truePositives > 0 ? (double)truePositives / (truePositives + falseNegatives) : 0;
double f1Score = precision + recall > 0 ? 2 * (precision * recall) / (precision + recall) : 0;
Console.WriteLine($"\nMetrics:");
Console.WriteLine($"Precision: {precision:F3}");
Console.WriteLine($"Recall: {recall:F3}");
Console.WriteLine($"F1 Score: {f1Score:F3}");
}
6. Results
Metric | Value |
---|---|
Accuracy | 37 / 50 = 74 % |
Precision | 0.667 |
Recall | 0.870 |
F1-score | 0.755 |
Confusion-matrix breakdown
Predicted MATCH | Predicted DIFF | |
---|---|---|
Actual MATCH | 20 (True Positive) | 3 (False Negative) |
Actual DIFF | 10 (False Positive) | 17 (True Negative) |
The hybrid engine still captures the majority of true matches (high recall), but precision has room for improvement—most of the errors are false positives (10/13 mistakes). Tweaking the deterministic thresholds or tightening the LLM's "YES" bias should lift both precision and overall accuracy.
Challenge for You!
How to improve the performance metrics?
Consider the following strategies:
- Experiment with a different LLM (Large Language Model).
- Adjust the thresholds to optimize precision, recall, or overall performance.
Top comments (1)
I keep discovering the potential uses of LLMs. Please follow me :)