DEV Community: Adem Akdoğan

Deep Dive into String Similarity: From Edit Distance to Fuzzy Matching Theory and Practice in Python

Adem Akdoğan — Fri, 27 Feb 2026 13:41:31 +0000

How similar are two strings? It sounds like a simple question, but the answer runs far deeper than you might think. In this article, we'll explore the world of string similarity algorithms, uncover the mathematics behind each one, and see how they solve real-world problems from catching typos to matching financial transactions.

1. Why String Similarity Matters

In the real world, data is rarely perfect. Users make typos. Databases accumulate duplicates. The same customer might appear as "John Smith", "JOHN SMITH", and "Jon Smyth" across three different records. An e-commerce platform might list "iPhone 15 Pro Max" in one catalog and "Apple iPhone15 ProMax" in another. If you rely on exact string matching, these records will never be linked and that's a problem.
This is where string similarity and fuzzy matching come in. These techniques allow us to quantify how "close" two strings are to each other, even when they're not identical. The applications are everywhere: spell checkers suggest corrections when you type "algortihm" instead of "algorithm." Search engines understand that "pythn" probably means "Python." CRM systems merge duplicate customer records despite minor variations in spelling. In bioinformatics, researchers compare DNA sequences to find evolutionary relationships. Fraud detection systems flag suspiciously similar addresses. OCR post-processing corrects misread characters. And record linkage systems connect data across databases that were never designed to talk to each other.
All of these problems boil down to one fundamental question, How similar are these two strings? There are dozens of algorithms that answer this question, and each one has strengths suited to different scenarios. In this article, we'll examine the most important ones in depth.

2. Edit Distance: Measuring Similarity Through Transformation

The edit distance family of algorithms asks a deceptively simple question: what is the minimum number of operations needed to transform one string into another? The definition of "operation" varies by algorithm - it might include insertion, deletion, substitution, or transposition - but the core idea remains the same. The fewer operations required, the more similar the strings are.

2.1 Levenshtein Distance

The Levenshtein Distance, named after Soviet mathematician Vladimir Levenshtein who introduced it in 1965, is the most fundamental and widely used string similarity metric. It measures the minimum number of single-character insertions, deletions, and substitutions required to change one string into the other.
The algorithm works using dynamic programming. It constructs a matrix where each cell represents the edit distance between prefixes of the two input strings. The matrix is filled row by row, with each cell calculated as the minimum of three possible operations: inserting a character (cell above + 1), deleting a character (cell to the left + 1), or substituting a character (diagonal cell + 0 if characters match, +1 if they don't). The final answer sits in the bottom-right corner of the matrix.
Let's walk through a classic example. To transform "kitten" into "sitting", we need exactly 3 operations: substitute 'k' with 's', substitute 'e' with 'i', and insert 'g' at the end. The dynamic programming matrix makes this clear:

   "" s i t t i n g
""[ 0 1 2 3 4 5 6 7 ]
k [ 1 1 2 3 4 5 6 7 ]
i [ 2 2 1 2 3 4 5 6 ]
t [ 3 3 2 1 2 3 4 5 ]
t [ 4 4 3 2 1 2 3 4 ]
e [ 5 5 4 3 2 2 3 4 ]
n [ 6 6 5 4 3 3 2 3 ]

The value in the bottom-right cell is 3 - that's our Levenshtein Distance.
In practice, raw distance values are hard to compare across string pairs of different lengths. A distance of 3 between two 7-character strings is quite different from a distance of 3 between two 100-character strings. This is why we often use the normalized similarity:
1 - (distance / max(len(s1), len(s2))).
For our example, that gives us 1 - (3/7) = 0.5714, meaning the strings are about 57% similar. For a simple typo like "algorithm" vs "algoritm", the normalized similarity is 1 - (1/9) = 0.8889`- 89% similar, which intuitively makes sense.
The time complexity is O(m × n) where m and n are the string lengths, and with optimization, the space complexity can be reduced to O(min(m, n)) since we only need two rows of the matrix at any time. This makes Levenshtein excellent for short to medium strings names, cities, product titles but potentially slow for very long sequences like paragraphs or DNA strings.

2.2 Damerau-Levenshtein Distance

The standard Levenshtein Distance treats the transposition of two adjacent characters as two separate operations (a deletion and an insertion, or two substitutions). But research has shown that approximately 80% of real-world typographical errors fall into just four categories: wrong character (substitution), extra character (insertion), missing character (deletion), and two adjacent characters swapped (transposition). The Damerau-Levenshtein Distance addresses this by adding transposition as a fourth primitive operation, each costing just 1.
Consider transforming "flom" into "molf." Under standard Levenshtein, this requires multiple operations. Under Damerau-Levenshtein, it's simply one transposition of 'f' and 'm' a distance of just 1. This matters enormously in practice. When a user types "hte" instead of "the," a spell checker using Damerau-Levenshtein recognizes this as a single error, while standard Levenshtein sees two.
There's an important subtlety here: Damerau-Levenshtein has a restricted variant called OSA (Optimal String Alignment). The key difference is that OSA imposes a constraint that no substring may be edited more than once. In the full Damerau-Levenshtein, you can transpose two characters and then further edit the result for example, transforming "CA" to "ABC" takes just 2 operations (transpose CA → AC, then insert B). But under OSA, once you transpose CA → AC, you can't edit that substring again, so the path becomes CA → A → AB → ABC = 3 operations. The full Damerau-Levenshtein satisfies the triangle inequality (making it a true metric), while OSA does not. However, OSA runs in O(m × n) time compared to O(m × n × max(m, n)) for the unrestricted version, so the choice depends on whether you need the mathematical guarantees or prefer faster computation.

2.3 Indel Distance and Hamming Distance

Two other members of the edit distance family deserve brief mention. The Indel Distance only allows insertions and deletions no substitutions. Its name comes from the bioinformatics term "insertion/deletion." It's closely related to the Longest Common Subsequence (LCS): Indel = len(s1) + len(s2) - 2 × LCS_length. For "kitten" vs "sitting," the LCS length is 4, giving an Indel distance of 6 + 7–8 = 5.
The Hamming Distance is the simplest of all, it counts the number of positions where two equal-length strings differ. "karolin" vs "kathrin" has a Hamming distance of 3 (positions 3, 4, and 5 differ). It's only defined for strings of equal length, which limits its general applicability, but it's extremely useful for error-detecting codes, hash comparison, and fixed-length encoded data like barcodes.

3. Character Matching: Jaro and Jaro-Winkler

While edit distance algorithms count the operations needed for transformation, the Jaro family takes a fundamentally different approach. Developed by Matthew A. Jaro in 1989 for the U.S. Census Bureau's record linkage work, the Jaro Similarity is designed specifically for comparing short strings like names and addresses.

3.1 How Jaro Similarity Works

The algorithm operates in three steps. First, it defines a matching window :
characters from the two strings are considered matching only if they're identical and their positions differ by no more than floor(max(|s1|, |s2|) / 2) - 1. Second, it counts the transpositions matching characters that appear in different order in the two strings. Finally, these values are combined using the formula:
Jaro = (1/3) × (m/|s1| + m/|s2| + (m - t)/m)
where m is the number of matching characters and t is half the number of transpositions.
Let's work through "MARTHA" vs "MARHTA." The matching window is floor(6/2) - 1 = 2. All six characters match within this window, so m = 6. However, 'T' and 'H' are in different positions that's one transposition pair, giving t = 1. Plugging into the formula: Jaro = (1/3) × (6/6 + 6/6 + 5/6) = (1/3) × 2.833 = 0.9444.

3.2 The Winkler Extension

n 1990, William E. Winkler extended Jaro's work with a key insight: when two strings share a common prefix, they're more likely to refer to the same entity. This is based on the practical observation that people rarely misspell the first few characters of a name. The Jaro-Winkler formula adds a bonus proportional to the common prefix length (up to 4 characters):
Jaro-Winkler = Jaro + (L × P × (1 - Jaro))
where L is the common prefix length (max 4) and P is a scaling factor (typically 0.1).
For "MARTHA" vs "MARHTA," the common prefix is "MAR" (L = 3). So Jaro-Winkler = 0.9444 + (3 × 0.1 × (1–0.9444)) = 0.9444 + 0.0167 = 0.9611. The prefix bonus pushed the score from 94.4% to 96.1%. This might seem like a small difference, but at scale when you're comparing millions of name pairs it meaningfully improves ranking accuracy.
Jaro-Winkler is the go-to metric for name matching, address comparison, and customer record deduplication. If your problem involves comparing names or short identifiers where the beginning of the string is typically more reliable, Jaro-Winkler is almost certainly better than Levenshtein.

4. Sequence-Based Metrics: LCS and Alignment Algorithms

Some string similarity problems require looking at the longest shared structure between two strings rather than counting edits. This is where sequence-based metrics shine.

4.1 Longest Common Subsequence (LCS)

The LCS finds the longest sequence of characters that appears in both strings in the same order but not necessarily contiguously. This distinction is important. For "ABCBDAB" and "BDCABA," the LCS is "BCBA" (length 4). The characters B, C, B, and A appear in both strings in the same relative order, even though they're not adjacent in either string. The LCS distance is then max(|s1|, |s2|) - LCS_length.
There's a related but distinct concept: the Longest Common Substring (LCSstr), which requires the matching characters to be contiguous. For "the quick brown fox" vs "the quick red fox," the LCS subsequence would be much longer than the LCSstr substring ("the quick "), because the subsequence can skip over non-matching sections while the substring cannot.
LCS computation uses dynamic programming and runs in O(m × n) time, but modern implementations use bit-parallel techniques (processing 64 or 128 characters at once using integer bitmasks) to achieve dramatic speedups on practical inputs.

4.2 Smith-Waterman and Needleman-Wunsch

These two algorithms come from bioinformatics, where they were designed for aligning DNA and protein sequences, but they're equally powerful for general string comparison.
Needleman-Wunsch (1970) performs global alignment: it aligns two entire sequences from start to finish, maximizing the overall similarity score. It uses a scoring system where matches earn positive points, mismatches earn negative points, and gaps (insertions/deletions) incur a penalty. The algorithm fills a dynamic programming matrix and traces back through it to find the optimal alignment. It's ideal when comparing sequences of similar length that you expect to be similar throughout.
Smith-Waterman (1981) performs local alignment: instead of aligning entire sequences, it finds the most similar region within two sequences. The key difference in implementation is that negative scores in the matrix are reset to zero, which allows the algorithm to "restart" and find locally optimal alignments. The result is the highest-scoring subsequence alignment, which is perfect for finding conserved domains in proteins or matching a short query against a long document.
The distinction matters in practice. If you're comparing "the quick brown fox" to "fox brown quick the," Needleman-Wunsch will give a global score reflecting the overall rearrangement, while Smith-Waterman will identify "fox" or "brown" as strongly matching local regions regardless of the surrounding text.

5. Set-Based Metrics: Looking at Token Overlap

All the algorithms we've discussed so far are sequence-sensitive, the order of characters or tokens matters. But in many real-world scenarios, order doesn't matter at all. "Hello World" and "World Hello" mean the same thing. "New York City" and "City York New" refer to the same place. Set-based metrics address this by treating strings as bags of tokens (typically words) and measuring the overlap between these sets.

5.1 Jaccard Similarity

The Jaccard Index, perhaps the most intuitive set similarity metric, measures the size of the intersection divided by the size of the union of two sets:
Jaccard(A, B) = |A ∩ B| / |A ∪ B|
For "hello world" vs "world hello," both token sets are {"hello", "world"}, so the intersection and union are identical Jaccard = 1.0, a perfect match. For "the cat sat" vs "the dog sat," the intersection is {"the", "sat"} (size 2) and the union is {"the", "cat", "sat", "dog"} (size 4), giving Jaccard = 0.5. Half the words are shared, so we get 50% similarity perfectly intuitive.
In practice, many implementations use multiset (Counter) semantics, where word frequency matters. Instead of simple sets, they use intersection = Σ min(count_A(x), count_B(x)) and union = Σ max(count_A(x), count_B(x)). This is important when the same word appears multiple times, as in "the the cat" vs "the dog."

5.2 Sørensen-Dice, Cosine, Tversky, and Overlap

The Sørensen-Dice coefficient is closely related to Jaccard but gives more weight to the intersection: Dice = 2|A∩B| / (|A| + |B|). For the "the cat sat" vs "the dog sat" example, Dice = 2×2 / (3+3) = 0.667, compared to Jaccard's 0.5. Dice is always greater than or equal to Jaccard, and the relationship is Dice = 2 × Jaccard / (1 + Jaccard).
Cosine Similarity takes a different approach entirely. Instead of treating strings as sets, it converts them into term frequency vectors and measures the cosine of the angle between these vectors. For "data science is great" vs "science data is wonderful," we construct vectors over the vocabulary {data, science, is, great, wonderful}. String 1 becomes [1,1,1,1,0] and string 2 becomes [1,1,1,0,1]. The dot product is 3, each vector has magnitude 2, so Cosine = 3/4 = 0.75. Cosine's key advantage over Jaccard is that it considers term frequency (how often a word appears) and is normalized by vector magnitude, making it less sensitive to document length which is why it's the dominant metric in information retrieval and document classification.
The Tversky Index generalizes Jaccard with two asymmetry parameters (α, β): Tversky = |A∩B| / (|A∩B| + α|A\B| + β|B\A|). Setting α = β = 1 gives Jaccard; α = β = 0.5 gives Dice. The asymmetry is useful when comparing a search query (A) against a candidate document (B): you might want to penalize query terms missing from the candidate more heavily than extra terms in the candidate.
The Overlap Coefficient normalizes by the smaller set: |A∩B| / min(|A|, |B|). This measures whether the smaller set is a subset of the larger one. For "hello" vs "hello world test," the overlap is 1/1 = 1.0 - the smaller set is entirely contained in the larger one.

6. Fuzzy Matching: Combining Algorithms for Practical Use

While the individual algorithms above are powerful, real-world string matching often requires combining them intelligently. This is exactly what fuzzy matching libraries do.
The basic ratio computes character-level normalized similarity on a 0–100 scale. For "fuzzy wuzzy" vs "wuzzy fuzzy," it returns just 45.45 because at the character level, these strings are quite different; the characters don't align well positionally.
Token Sort Ratio solves this by tokenizing both strings, sorting the tokens alphabetically, rejoining them, and then computing the ratio. Both "fuzzy wuzzy" and "wuzzy fuzzy" become "fuzzy wuzzy" after sorting, yielding a perfect score of 100.0. This makes it ideal for scenarios where the same information appears in different order like "Ahmet Yılmaz" vs "Yılmaz Ahmet."

Token Set Ratio goes further by decomposing strings into three components: common tokens, tokens unique to string 1, and tokens unique to string 2. It then computes ratios between various combinations of these components and returns the maximum. This handles cases where one string has extra words that don't disqualify it from matching. "The quick brown fox" vs "the quick fox jumped" shares {"quick", "the", "fox"} as common tokens, with "brown" and "jumped" as extras. The ratio between just the common tokens will be very high.

Partial Ratio uses a sliding window approach. It takes the shorter string and slides it across the longer string, computing the ratio at each position and returning the maximum. For "test" vs "this is a test string," the 4-character window slides across positions until it lands on "test" within the longer string yielding a score of 100.0. This is invaluable when searching for whether one string is a substring of another, a common pattern in search and autocomplete.

WRatio (Weighted Ratio) is the most sophisticated fuzz function. It examines the length ratio between the two strings and automatically selects the best strategy. If the strings are similar in length, it uses ratio or token_sort_ratio. If one is much shorter than the other, it switches to partial_ratio. It also considers token_set_ratio for reordered content and applies weights to the results. WRatio embodies the "just give me the best answer" philosophy call one function and get an intelligent similarity score regardless of the input pattern.

QRatio (Quick Ratio) is the minimalist counterpart: it lowercases both strings and returns the basic ratio. No token sorting, no partial matching just speed.

There is no single "best" string similarity algorithm - the right choice depends entirely on your problem. Here's a practical decision guide.
For simple typo detection in short strings, Levenshtein is the most direct and well-understood option. For name matching where the beginning of the string is typically reliable, Jaro-Winkler's prefix bonus gives it a clear edge. When word order varies but the content is the same, Token Sort Ratio eliminates order sensitivity. When you need substring matching is one string contained within another? Partial Ratio's sliding window is purpose-built for the task. For document-level similarity where word frequency matters more than position, Cosine Similarity is the standard. And when you're dealing with transposition errors like "hte" instead of "the," Damerau-Levenshtein recognizes these as single-cost operations. If you're unsure which algorithm to use, WRatio is a strong default it automatically adapts to the input. But understanding the individual algorithms lets you make informed choices when performance or accuracy in specific scenarios matters.

7. Putting It Into Practice with Python: The HyperFuzz Library

Understanding the theory is essential, but ultimately we need to apply these algorithms in code. This is where HyperFuzz [link] comes in - an open-source library that implements all of the algorithms discussed in this article in Rust and exposes them to Python through PyO3. Because all computation happens in native Rust, HyperFuzz achieves dramatically better performance than pure Python implementations while maintaining a clean, Pythonic API.
Getting started is straightforward:
`

pip install hyperfuzz

The distance module provides all the edit distance and character matching metrics:

from hyperfuzz import distance

Levenshtein - the classic edit distance
distance.levenshtein_distance("kitten", "sitting")                   → 3
distance.levenshtein_normalized_similarity("kitten", "sitting")      → 0.5714
distance.levenshtein_normalized_similarity("algorithm", "algoritm")  → 0.8889

 Jaro & Jaro-Winkler - character matching with prefix bonus
distance.jaro_similarity("MARTHA", "MARHTA")                        → 0.9444
distance.jaro_winkler_similarity("MARTHA", "MARHTA")                → 0.9611

 Damerau-Levenshtein vs OSA - transposition handling
distance.damerau_levenshtein_distance("CA", "ABC")                  → 2
distance.osa_distance("CA", "ABC")                                  → 3

 Other distance metrics
distance.indel_distance("kitten", "sitting")                        → 5
distance.lcs_seq_distance("kitten", "sitting")                      → 3

Notice how damerau_levenshtein_distance("CA", "ABC") returns 2 while osa_distance("CA", "ABC") returns 3. This is exactly the theoretical difference we discussed earlier: Damerau-Levenshtein allows multiple edits to the same substring (transpose CA→AC, then insert B), while OSA does not.
The fuzz module provides the complete family of fuzzy ratios:

python
from hyperfuzz import fuzz

# Basic ratio - character-level similarity (0-100 scale)
fuzz.ratio("fuzzy wuzzy", "wuzzy fuzzy")                # → 45.45
fuzz.ratio("hello", "hello")                             # → 100.0

# Token Sort - eliminates word order sensitivity
fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")     # → 100.0
fuzz.token_sort_ratio("New York City", "City York New")  # → 100.0

# Token Set - handles extra words gracefully
fuzz.token_set_ratio("the cat", "the cat sat on mat")   # → 100.0

# Partial Ratio - substring matching via sliding window
fuzz.partial_ratio("test", "this is a test string")      # → 100.0

# WRatio - automatic best-strategy selection
fuzz.w_ratio("New York City", "new york city")           # → 100.0

# QRatio - quick lowercase ratio
fuzz.q_ratio("hello world", "hello world")               # → 100.0

To understand why the basic ratio for "fuzzy wuzzy" vs "wuzzy fuzzy" is just 45.45, consider what happens at the character level: the strings don't align well position-by-position. The 'f' in position 0 of string 1 corresponds to 'w' in string 2, and so on. But token_sort_ratio sorts the tokens alphabetically first both strings become "fuzzy wuzzy" and then the ratio is a trivial 100.0.
Similarly, partial_ratio("test", "this is a test string") returns 100.0 because the algorithm slides a 4-character window across the longer string. When the window reaches position 10 aligning with "test" inside "this is a test string" it finds a perfect match.

python
from hyperfuzz import (
    jaccard_similarity,
    sorensen_dice_similarity,
    cosine_similarity,
    overlap_similarity,
    tversky_similarity,
)

# Jaccard — intersection over union of token sets
jaccard_similarity("hello world", "world hello")         # → 1.0
jaccard_similarity("the cat sat", "the dog sat")         # → 0.5

# Sørensen-Dice — more weight on intersection
sorensen_dice_similarity("the cat sat", "the dog sat")   # → 0.6667

# Cosine — vector-based similarity, frequency-aware
cosine_similarity("data science", "science data")        # → 1.0

# Overlap — subset detection
overlap_similarity("hello", "hello world test")          # → 1.0

# Tversky — asymmetric similarity
tversky_similarity("the cat", "the cat sat on a mat")

The Jaccard result for "the cat sat" vs "the dog sat" is 0.5 because the token sets share {"the", "sat"} (2 tokens) out of a union {"the", "cat", "sat", "dog"} (4 tokens): 2/4 = 0.5. The Dice coefficient for the same pair is 0.6667 because it uses a different formula: 2×2 / (3+3) = 4/6. Both are correct they simply weight the intersection differently.

python
from hyperfuzz import (
    smith_waterman_score,
    needleman_wunsch_score,
    lcs_str_similarity,
)

# Smith-Waterman - best local alignment score
smith_waterman_score("the quick brown fox", "the quick red fox")

# Needleman-Wunsch - global alignment score
needleman_wunsch_score("ACGT", "ACGT")

# Longest Common Substring similarity
lcs_str_similarity("the quick brown fox", "the quick red fox")

When you need to compare thousands or millions of string pairs, calling functions one by one becomes a bottleneck not because of algorithm speed, but because of Python function call overhead. HyperFuzz solves this with batch variants that use Rust's Rayon thread pool for native multi-core parallelism:

python
from hyperfuzz import jaccard_similarity_batch

pairs = [
    ("hello world", "world hello"),
    ("foo bar", "baz qux"),
    ("test", "test"),
    ("Python", "Rust"),
    ("New York", "new york"),
]

scores = jaccard_similarity_batch(pairs)  # → [1.0, 0.0, 1.0, 0.0, 1.0]

The batch functions release Python's GIL (Global Interpreter Lock) through PyO3, allowing Rust threads to run at full efficiency across all CPU cores. For large scale deduplication or matching tasks, this can mean orders-of-magnitude speedups compared to sequential Python loops.

7.1 Real-World Example: Financial Transaction Matching

Let's put everything together with a practical example. Imagine you work at a fintech company and need to match bank transaction descriptions against your accounting system records. The bank might record a transaction as "AMAZON MARKETPLACE EU SARL" while your accounting system has "Amazon Marketplace EU."

python
from hyperfuzz import fuzz, jaccard_similarity

bank_record = "AMAZON MARKETPLACE EU SARL"
accounting  = "Amazon Marketplace EU"

print(f"ratio:            {fuzz.ratio(bank_record, accounting):.1f}")
print(f"token_sort_ratio: {fuzz.token_sort_ratio(bank_record, accounting):.1f}")
print(f"token_set_ratio:  {fuzz.token_set_ratio(bank_record, accounting):.1f}")
print(f"partial_ratio:    {fuzz.partial_ratio(bank_record, accounting):.1f}")
print(f"w_ratio:          {fuzz.w_ratio(bank_record, accounting):.1f}")
print(f"jaccard:          {jaccard_similarity(bank_record, accounting):.4f}")

In this scenario, token_set_ratio and partial_ratio will produce the highest scores. Token Set Ratio recognizes that "AMAZON", "MARKETPLACE", and "EU" are common tokens the extra "SARL" is isolated as a difference but doesn't penalize the core match. Partial Ratio slides the shorter string across the longer one and finds that "Amazon Marketplace EU" is almost entirely contained within "AMAZON MARKETPLACE EU SARL." The basic ratio, by contrast, will be lower because the raw character-level alignment is thrown off by the extra word and case differences.
Understanding which metric to use isn't just academic it directly affects whether your system correctly matches 95% or 99.5% of transactions. In production, many systems run multiple metrics in parallel and use the results as features in a scoring model, rather than relying on a single threshold from a single metric.

Conclusion

String similarity algorithms are indispensable tools in modern software engineering. As we've seen throughout this article, the landscape is rich and varied: Levenshtein and its variants measure edit distance at the character level. Jaro-Winkler excels at name matching with its prefix bonus. Jaccard and Cosine compare token sets and frequency vectors, ignoring word order entirely. Smith-Waterman and Needleman-Wunsch bring the power of sequence alignment from bioinformatics. And the fuzzy matching family ratio, token sort/set, partial ratio, WRatio intelligently combines these primitives into practical, ready-to-use tools.
The key takeaway is that there is no universal "best" algorithm. The right choice depends on your data and your problem. But understanding how each algorithm works what it measures, what it ignores, and where it excels empowers you to make informed decisions that directly impact the accuracy and performance of your systems.
Libraries like HyperFuzz make these algorithms accessible in Python while delivering native Rust performance. Whether you're deduplicating customer records, building a search engine, matching financial transactions, or analyzing biological sequences, the combination of Python's ease of use and Rust's computational speed gives you the best of both worlds.

Project Repo : https://github.com/ademakdogan/hyperfuzz
Github : https://github.com/ademakdogan
Linkedin : https://www.linkedin.com/in/adem-akdo%C4%9Fan-948334177/

The Science of Prompt Optimization and Automated Refinement

Adem Akdoğan — Fri, 06 Feb 2026 17:33:40 +0000

In the rapidly evolving landscape of Large Language Models (LLMs), the prompt has evolved far beyond a simple text input. It has become the instruction set, the compiler, and the interface for modern AI applications. As we transition from playful chat interfaces to deterministic production pipelines, the "art" of prompt engineering is being forced to mature into a rigorous "science". We can no longer afford to treat prompts as magic spells; we must treat them as code components that require optimization, versioning, and architectural stability.

The Hidden Technical Debt of Sub-Optimal Prompts

When deploying LLMs in production—specifically for structured tasks like Named Entity Recognition (NER), complex data transformation, or classification—the quality of the prompt dictates the Unit Economics and Reliability of the entire system. A sub-optimal prompt is not merely a cosmetic issue; it represents significant technical debt that manifests in three critical dimensions.

1. The Economics of Token Consumption and Latency
Every single token in your system message acts as a recurring tax on your infrastructure. It is easy to overlook the impact of a verbose prompt during the prototyping phase, but at scale, the implications are severe. Consider a prompt that carries just 500 unnecessary tokens of "fluff" or redundant instructions. If your application processes one million requests per month, you are effectively processing 500 million phantom tokens. On high-performance models like GPT-4o or Claude 3.5 Sonnet, this directly translates into thousands of dollars in wasted compute every month.

Beyond direct financial costs, there is a strictly linear correlation between input size and latency. The Time-to-First-Token (TTFT) and total generation time are physically constrained by the attention mechanism's need to process the context window. Bloated prompts increase the cognitive load on the model, forcing it to attend to irrelevant information. This "attention dilution" frequently results in slower inference times and a sluggish user experience. In real-time applications, the difference between a concise, optimized prompt and a verbose one can often be measured in hundreds of milliseconds of latency per call.

2. The Accuracy-Variance Trade-off and Semantic Drift
A more subtle but dangerous issue is the phenomenon of semantic drift. When a prompt is written loosely (e.g., "Please extract the names from this text"), it relies heavily on the model's vast probabilistic training priors rather than specific adherence to your constraints. While this might work for standard inputs, it introduces a high degree of variance. The model is essentially guessing your intent based on what it has seen in its training data, rather than following a strict logic path.

This reliance on "vibes" rather than explicit instructions makes the system fragile to edge cases. A prompt that works perfectly for 90% of standard inputs may fail catastrophically when encountering null values, unusual formatting, or unexpected characters. Furthermore, scientific analysis suggests that over-constrained or conflicting instructions—common artifacts of manual prompt editing—actually increase the probability of hallucinations. As the model attempts to reconcile ambiguity in the prompt with the input data, it often fabricates information to satisfy what it perceives as the user's contradictory requirements.

3. Prompt Versioning: Treating English as Code
In a robust MLOps pipeline, prompts must be treated with the same rigor as compiled code. A prompt is not static text; it is a function definition where the words are the parameters. Changing a single adjective in a prompt can alter the high-dimensional vector space in which the model operates, potentially changing the output schematic entirely. This fragility demands that we adopt "Prompt as Code" methodologies.

This means every prompt must be immutable and versioned, ideally distinguished by a hash of its content. We must implement regression testing where strictly defined "Golden Datasets" are used to validate performance changes. A change that improves readability for a human developer might degrade performance for the model, or worse, introduce a regression in a previously solved edge case. Therefore, deployment strategies should mirror standard software engineering practices, including A/B testing different prompt versions (V1 vs. V2) against varying metrics like Parse Success Rate, F1 Score, and Token Efficiency.

Prompt Optimizer: Automated Iterative Refinement

The current industry standard for prompt engineering—a cycle of "Write, Test, Edit, Repeat"—is fundamentally inefficient and prone to human bias. Humans are notoriously poor at high-dimensional optimization problems. We tend to fix one instruction (e.g., "extract emails correctly") while accidentally breaking another (e.g., "don't include brackets in the output"). To address this, we developed Prompt Optimizer, a tool designed to solve this problem by treating prompt engineering as an algorithmic optimization task rather than a creative writing exercise.

The Problem with Stochastic Manual Tuning
When a human tunes a prompt, they are often hill-climbing blindly. They make a change based on a single failure case, potentially degrading the prompt's performance on the broader dataset. We needed a system that could look at the aggregate performance across a batch of data and make statistically significant adjustments.

The Solution: A Mentor-Agent Feedback Loop
prompt-optimizer implements a feedback loop inspired by Reinforcement Learning from Human Feedback (RLHF), but it automates the "Human" component using a specialized "Mentor" LLM. The architecture consists of three distinct components working in a continuous loop:

The Agent (The Actor): This model attempts to solve the task using the current version of the prompt P(t). It processes a batch of inputs and generates outputs.
The Evaluator (The Critic): This component compares the Agent's output against a Ground Truth (JSON) dataset. It calculates precise metrics, including Accuracy Score (Exact match or semantic similarity) and identifies specific formatting errors or hallucinations.
The Mentor (The Optimizer): This model analyzes the diff between the expected output and the actual output. It looks at the specific failure modes—why did the Agent fail? Was it a formatting error? Did it miss a calculation?—and generates a new prompt P(t+1) specifically designed to correct these errors.

Experiments show that accuracy is often not enough. A prompt that is accurate but 2000 tokens long is not production-ready. A unique feature of prompt-optimizer is its selection algorithm. When two iterations achieve identical accuracy (e.g., both reach 100% on the test set), the system strictly selects the shortest prompt. The optimization function effectively becomes Max(Accuracy) subject to Min(Tokens). This ensures that the final production prompt is not only highly accurate but also cost-optimized for long-term deployment.

Technical Deep Dive & Usage

The project is structured to work with any OpenRouter, OpenAI, or Anthropic model, allowing developers to optimize prompts for specific model architectures.

Defining the Schema
Instead of relying on fragile regex parsing or hopeful instructions, prompt-optimizer uses Pydantic models to define the contract. This serves as the ground truth for what the Agent is expected to produce, enforcing strict typing and structure.

class ExtractionSchema(BaseModel):
    client_name: str
    total_gross: float
    # The system uses this type hint to enforce strict JSON output

The core loop is designed to be model-agnostic. It takes your input data and ground truth, and iteratively refines the prompt. One of the most challenging tasks for LLMs is simultaneous extraction and transformation—for example, reading a raw invoice string and calculating a total that isn't explicitly stated.

Consider an input like "Vendor: TechCorp, Base: $1000, Tax: 10%". The goal is to extract the Vendor and calculate the Total ($1100). A human might write a simple prompt like "Read the text and calculate the total." This often fails because it lacks specific guidance on order of operations or output format.

The Prompt Optimizer, however, treats this as a learning problem. After 3 iterations of seeing failures, it might generate a highly specific instruction: "Extract the 'Vendor'. Identify 'Base' and 'Tax' values. Calculate 'Total' as Base * (1 + Tax). Return strictly standard JSON." This level of precision is discovered through trial and error, not guessed by a human.

Quick Start
Running the optimizer is straightforward. You define your data and run the CLI command:

# Clone
git clone https://github.com/ademakdogan/prompt-optimizer.git

# Run with your dataset
make optimize DATA=resources/my_dataset.json SAMPLES=5 LOOPS=5

The system output provides a real-time view of the learning process, showing exactly how the Mentor is correcting the Agent:

Iteration 1: Accuracy 38.3% (Initial guess)
Iteration 2: Accuracy 65.0% (Mentor corrected format)
Iteration 3: Accuracy 93.3% (Mentor corrected calculation logic)

Conclusion

As we build more complex Agentic workflows, we cannot rely on "vibes-based" prompting. The difference between a demo and a production application often lies in the reliability of the prompts powering it. Tools like prompt-optimizer represent the necessary shift towards Automated Prompt Engineering (APE), where we define the outcome (data) and let the system architecture search for the optimal instruction (prompt).

Github Link: https://github.com/ademakdogan/prompt-optimizer
Linkedin: link

Practical Image Process with OpenCV

Adem Akdoğan — Thu, 18 May 2023 07:51:34 +0000

*** If you want to read my article on the medium, you can click here.

Image processing is basically the process that provides us to achieve the features from images. Image processing is applied for both images and videos. These are the procedures used frequently in order to make training more successful in deep learning structures.

Image Processing

The image processing begins with the recognition of data by computers. Firstly, a matrix is created for data in image format. Each pixel value in the image is processed into this matrix. For example, a matrix of size 200x200 is created for a picture of size 200x200. If this image is colored, this dimension becomes 200x200x3 (RGB). In fact, every manipulation in image processing is a matrix operation. Suppose that a blur operation is desired on the image. A particular filter moves over the entire matrix that it making changes on either all of the matrix elements or part of the matrix elements. As a result of this process, the required part or the whole of the image becomes blurred.

The processing of images is needed in many cases [1]. Generally, these operations are applied on the image format data that will be used in deep learning models. For example, it does not matter that the data is colored in some projects. In this case, using color images for training will cause performance losses. One of the most widely used deep learning structures of image processing is Convolutional Neural Networks. This network determines the required attributes for training with Convolutional layer on the image. At this point, only certain parts of the images that will be used for training may need to be processed. The prominence of more rounded lines rather than the sharp lines in the pictures can sometimes improve the success of the training.
In such cases, image processing techniques are used. You can click to get more information about image processing [9].

The same logic is based on the operations of image optimization programs used in daily life in addition to the situations described above. There are many process in image processing such as improving the quality of the images, making restorations on the images, removing the noise, Histogram equalization.

OpenCV

OpenCV is one of the most popular libraries used for image processing [2]. There are many companies that use OpenCV, such as Microsoft, Intel, Google, Yahoo. OpenCV supports a wide variety of programming languages such as Java, C++, Python and Matlab. All of the samples in this work that they are coded in Python.

import cv2
from matplotlib import pyplot as plt
import numpy as np

Firstly, The libraries are imported. There are some functions in OpenCV that don’t work stably in every version. One of these functions is “imshow”. This function provides us to see the changes in the image as a result of our operations. The matplotlib library will be used as an alternative solution in this work for those who have such problems.

The processes to be performed will be applied on the image shown above (Figure 1). The image is read initially so that it can be processed.

img_path = "/Users/..../opencv/road.jpeg"
img = cv2.imread(img_path)
print(img.shape)

>>>(960, 1280, 3)

The dimensions of the image are 960 x 1280 pixels in Figure 2. We see the result of 960x1280x3 when we want to print the dimensions after reading process. So a matrix was created up to the dimensions of the image and this matrix is assigned the values of each pixel of the image. There are 3 dimensions from RGB because the image is colorful.

If we want to convert the image in black and white, cvtColor function is used.

gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

If we want to see the change that occurs as a result of this function, we use the imshow function from matplotlib.

gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
plt.imshow(gray_image)
plt.show()
print(gray_image.shape)

>>>(960, 1280)

As shown in Figure 2, we have converted our image to black and white. When we check their dimensions, there are no more 3 dimensions due to RGB.
When you look at the matrix values of the image, we see that it consists of values between 0 and 255. In some cases, we may want this matrix to consist only of values 0 and 255 [3]. The threshold function is used in such cases.

(thresh, blackAndWhiteImage) = cv2.threshold(gray_image, 20, 255, cv2.THRESH_BINARY)
(thresh, blackAndWhiteImage) = cv2.threshold(gray_image, 80, 255, cv2.THRESH_BINARY)
(thresh, blackAndWhiteImage) = cv2.threshold(gray_image, 160, 255, cv2.THRESH_BINARY)
(thresh, blackAndWhiteImage) = cv2.threshold(gray_image, 200, 255, cv2.THRESH_BINARY)
plt.imshow(blackAndWhiteImage)
plt.show()

The first parameter required by the threshold function in OpenCV is the image to be processed. The following parameter is the threshold value. The third parameter is the value that we want to assign the matrix elements that exceed the threshold value. The effects of four different threshold values can be seen in Figure 3. In the first image (Image 1), the threshold value was determined as 20. All values above 20 are assigned to 255. The remaining values are set to 0. This allowed only black or very dark colors to be black and all other shades to be directly white. The threshold values of the Image 2 and Image 3 were given 80 and 160. Finally, the threshold value was determined as 200 in the Image 4. Unlike the Image 1, white and very light colors were assigned as 255, while all the remaining values were set to 0 in Image 4. The threshold values must be set specifically for each image and for each case.

Another method used in image processing is blurring . This can be accomplished with more than one function.

output2 = cv2.blur(gray_image, (10, 10))
plt.imshow(output2)
plt.show()

output2 = cv2.GaussianBlur(gray_image, (9, 9), 5)
plt.imshow(output2)
plt.show()

As seen in Figure 4 and Figure 5, the black and white image is blurred with the specified blur filters and blur degrees. This process is usually used to remove noise in images. Also, in some cases, training is badly affected due to the sharp lines in the images. It is available in cases where it is used for this reason.

In some cases, the data may need to be rotated for augmentation, or images to be used as data may be skewed. The following functions can be used in this cases.

(h, w) = img.shape[:2]
center = (w / 2, h / 2)
M = cv2.getRotationMatrix2D(center, 13, scale  =1.1)
rotated = cv2.warpAffine(gray_image, M, (w, h))
plt.imshow(rotated)
plt.show()

First of all, the center of the image is determined and the rotation is performed by this center. The first parameter of the getRotationMatrix2D function is the calculated center values. The second parameter is the angle value. Finally the third parameter is the scaling value to be applied after the rotation. If this value is set to 1, it will rotate the same image only according to the given angle without any scaling.

Sample 1

The methods mentioned above are often used together in projects. Let’s make a sample project for a better understanding of these structures and process.
Let’s say we want to train autonomous driving pilot for vehicles [4]. When the image in Figure 1 is examined for this problem, our autonomous pilot should be able to understand the path and lanes. We can use the OpenCV for this problem. Since the color does not matter in this problem, the image is converted to black and white. The matrix elements set the values 0 and 255 by the determined threshold value. As mentioned above in the explanation of the threshold function, selection of the threshold value is critical for this function. The threshold value is set at 200 for this problem. We can clear up other details as it will be enough to focus on roadsides and lanes. In order to get rid of noise, the blurring is performed with GaussianBlur function. The parts up to here can be examined in detail from Figure 1 to 5.

After these processes, Canny edge detection is applied.

img = cv2.imread(img_path)
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(thresh, output2) = cv2.threshold(gray_image, 200, 255, cv2.THRESH_BINARY)
output2 = cv2.GaussianBlur(output2, (5, 5), 3)
output2 = cv2.Canny(output2, 180, 255)
plt.imshow(output2)
plt.show()

The first parameter that the Canny function takes is the image to which the operation will be applied. The second parameter is the low threshold value and the third parameter is the high threshold value. The image is scanned pixel by pixel for edge detection. As soon as there is a value lower than the low threshold value, the first side of the edge is detected. When a higher value is found than the higher threshold value, the other side is determined and the edge is created. For this reason, the threshold parameter values are determined for each image and for each problem. In order to better observe the GaussianBlur effect, let’s do the same actions without blurring this time.

img = cv2.imread(img_path)
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(thresh, output2) = cv2.threshold(gray_image, 200, 255, cv2.THRESH_BINARY)
output2 = cv2.Canny(output2, 180, 255)
plt.imshow(output2)
plt.show()

When the GaussianBlur function is not implemented, the noise is clearly visible in Figure 8. These noises may not be a problem for our project, but they will have a great impact on training success in different projects and situations. After this stage, the processes are performed on the real (standard) image based on the determined edges. HoughLinesP and line functions are used for this.

lines = cv2.HoughLinesP(output2, 1, np.pi/180,30)
for line in lines:
    x1,y1,x2,y2 = line[0]
    cv2.line(img,(x1,y1),(x2,y2),(0,255,0),4)
plt.imshow(img)

As seen in the picture in Figure 9, road boundaries and lanes were nicely achieved. However, when the Figure 9 is carefully examined, some problems will be noticed. Although there was no problem in determining lane and road boundaries, clouds were also perceived as road boundaries. The masking method should be used to prevent these problems [5].

def mask_of_image(image):
    height = image.shape[0]
    polygons = np.array([[(0,height),(2200,height),(250,100)]])
    mask = np.zeros_like(image)
    cv2.fillPoly(mask,polygons,255)
    masked_image = cv2.bitwise_and(image,mask)
    return masked_image

We can do the masking process with the mask_of_image function. First of all, the area to be masked is determined as a polygon. The parameter values are completely data-specific values.

The mask (Figure 10) will be applied on the real picture. There is no process to the regions corresponding to the black area in the real image. However, all of the above processes are applied to the areas corresponding to the white area.

As shown in Figure 11, as a result of the masking process, we have solved the problem that we saw in the clouds.

Sample 2

We solved the lane recognition problem with HougLinesP. Let’s assume that this problem applies to circular shapes [6].

Let’s create an image processing that recognizes the coins in Figure 12. In this case, the methods used in the lane recognition project will also be used here.

img = cv2.imread("/Users/.../coin.png")
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(thresh, output2) = cv2.threshold(gray_image, 120, 255, cv2.THRESH_BINARY)
output2 = cv2.GaussianBlur(output2, (5, 5), 1)
output2 = cv2.Canny(output2, 180, 255)
plt.imshow(output2, cmap = plt.get_cmap("gray"))

circles = cv2.HoughCircles(output2,cv2.HOUGH_GRADIENT,1,10,                       param1=180,param2=27,minRadius=20,maxRadius=60)
circles = np.uint16(np.around(circles))
for i in circles[0,:]:
    # draw the outer circle
    cv2.circle(img,(i[0],i[1]),i[2],(0,255,0),2)
    # draw the center of the circle
    cv2.circle(img,(i[0],i[1]),2,(0,0,255),3)

plt.imshow(img)

As a result of the image processing, it i can be reached in Figure 13.
The image is converted to black and white. Then the threshold function is applied. The GaussianBlur and Canny edge detection functions are used.
Finally, circles are drawn with HoughCircles function.

Image processing is also applied to text in image format.

Let’s say we want to train our system with the text seen in Figure 14.We want all words or some specific words to be identified by our model as a result of the training. We may need to teach the position information of the words to the system. OpenCV is also used in such problems. First of all, the image (in Figure 14) is converted into text. An optical character recognition engine called Tesseract is used for this[7].

data = pytesseract.image_to_data(img, output_type=Output.DICT, config = "--psm 6")
n_boxes = len(data['text'])
for i in range(n_boxes):
    (x, y, w, h) = (data['left'][i], data['top'][i], data['width'][i], data['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

plt.imshow(img)
plt.show()

The image is achieved shown in Figure 15 by combining the information obtained with the help of Tesseract with OpenCV. Each word and each block of words are enclosed in circle. It also be possible to manipulate only certain words in the frame by manipulating the information from Tesseract. In addition, image processing can be applied to clear the text from the noises. However, when the GaussianBlur function used in other examples is applied for the text, it will adversely affect the quality and legibility of the text.Therefore, the medianBlur function will be used instead of the GaussianBlur function.

img = cv2.imread(img_path)
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
output2 = cv2.medianBlur(gray_image, ksize=5)
plt.imshow(output2)
plt.show()

When the image is examined in Figure 14, the dashed lines are clearly visible below some words. In this case, the optical character recognition engines may misread some words. As a result of the medianBlur process in Figure 16, it can be seen that these dashed lines are gone.

Note: The dimensions of the matrices of black-and-white images must be checked. Most of the time there are RGB dimensions, even if it is black and white. This may cause you to get dimension errors in some functions in OpenCV.

Erode and Dilate functions can also be used to get rid of the noise of the text in the image format.

kernel = np.ones((3,3),np.uint8)
output2 = cv2.dilate(gray_image,kernel,iterations = 3)
plt.imshow(output2)
plt.show()

When looking at the text in Figure 14, it will be seen that there are some point-shaped noises. It can be seen that these noises are significantly eliminated with the use of the dilated function in Figure 17. The thinning rate on the article can be changed by changing the created filter and iterations parameter values. These values must be determined correctly in order to preserve the readability of the text. Erode function, in contrast to the dilated function, provides thickening of the text.

kernel = np.ones((3,3),np.uint8)
output2 = cv2.erode(gray_image,kernel,iterations = 3)
plt.imshow(output2)
plt.show()

The font thicknesses were increased with the Erode function, as seen in Figure 18. It is a method used to increase the quality of the articles written in fine fonts in general. Another point to be noted here is that our articles are black and our background is white. If the background was black and the text was white, the processes of these functions would be displaced.

OpenCV is used to increase the quality of some images. For instance histogram values of images with poor contrast are spread over a narrow area.
In order to improve the contrast of this image, it is necessary to spread histogram values over a wide area. The equalizeHist function is used for these operations. Let’s make histogram equalization for the image in Figure 19.

The histogram of the original image (Figure 19) can be seen in Figure 20.
The visibility of the objects in the image is low.

equ = cv2.equalizeHist(gray_image)
plt.imshow(equ)

The image whose histogram is equalized with the equalizeHist function can be seen in Figure 21. The quality and clarity of the image has increased. In addition, histogram graphic of the image whose histogram equalization has been done in Figure 22. It can be seen that the values collected in one area in Figure 20 spread over a larger area after histogram equalization. These histogram values can be checked for each image. The quality of the image can be increased by making histogram equalization when it is necessary.

Github: https://github.com/ademakdogan

Linkedin : https://www.linkedin.com/in/adem-akdo%C4%9Fan-948334177/

References

[1]P.Erbao, Z.Guotong, “Image Processing Technology Research of On-Line Thread Processing”, 2012 International Conference on Future Electrical Power and Energy System, April 2012.

[2]H.Singh, **Practical Machine Learning and Image Processing, **pp.63–88, January 2019.

[3]R.H.Moss, S.E.Watkins, T.Jones, D.Apel, “Image thresholding in the high resolution target movement monitor”, Proceedings of SPIE — The International Society for Optical Engineering, March 2009.

[4]Y.Xu, L.Zhang, “Research on Lane Detection Technology Based on OPENCV”, Conference: 2015 3rd International Conference on Mechanical Engineering and Intelligent Systems, January 2015.

[5]F.J.M.Lizan, F.Llorens, M.Pujol, R.R.Aldeguer, C.Villagrá, “Working with OpenCV and Intel Image Proccessing Libraries. Proccessing image data tools”, Informática Industrial e Inteligencia Artificial, July 2002.

[6]Q.R.Zhang, P.Peng, Y.M.Jin, “Cherry Picking Robot Vision Recognition System Based on OpenCV”, MATEC Web of Conferences, January 2016.

[7]R.Smith, “An Overview of the Tesseract OCR Engine”, Conference: Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, Volume: 2, October 2007.

[8]https://www.mathworks.com/help/examples/images/win64/GetAxesContainingImageExample_01.png

[9]https://neptune.ai/blog/image-processing-python