My Python tests passed. Production still broke.
Pushed a fix to production Friday afternoon. All 47 unit tests green. Felt good about it.
Monday morning: 200+ error logs. The code worked fine in tests but crashed on real data.
What happened
Built a data parser that cleaned product prices from scraped HTML. Test suite covered edge cases. Empty strings, None values, malformed numbers. Everything passed locally.
Production data had something my tests never checked: Unicode characters mixed with numbers. $19.⁹⁹ instead of $19.99. The superscript 9s came from copying prices that had footnote markers.
My regex assumed normal digits.
# What I tested
def clean_price(text):
if not text:
return 0.0
match = re.search(r'\d+\.?\d*', text)
return float(match.group()) if match else 0.0
# Test that passed
assert clean_price("$19.99") == 19.99
assert clean_price("") == 0.0
assert clean_price(None) == 0.0
The regex matched 19. and stopped at the superscript 9. Converted to 19.0 instead of 19.99. Every price got rounded down.
Fun times.
What I changed
Added Unicode normalization before parsing. Also started logging failed parses instead of silently returning 0 (which was dumb in hindsight).
import unicodedata
def clean_price(text):
if not text:
return 0.0
# Normalize Unicode (superscripts → normal digits)
normalized = unicodedata.normalize('NFKD', text)
match = re.search(r'\d+\.?\d*', normalized)
if match:
return float(match.group())
else:
print(f"Failed to parse: {text}")
return 0.0
NFKD normalization converts superscript/subscript digits to regular ASCII. ⁹ becomes 9.
Also added a test with actual production data I grabbed from the error logs.
def test_unicode_prices():
assert clean_price("$19.⁹⁹") == 19.99
assert clean_price("$24.⁹⁵") == 24.95
Fixed the immediate problem. Still annoyed it took a whole weekend of error logs to figure out.
Top comments (0)