DEV Community

Cover image for 10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)
benzsevern
benzsevern

Posted on • Originally published at bensevern.dev

10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)

Every data engineer writes throwaway scripts to fix the same problems. Phone numbers in 15 formats. Dates that aren't dates. "N/A" and "null" and "" all meaning the same thing. The scripts are always slightly different, never reusable, and break when the data changes.

Here are the 10 problems we see in every dataset, what they actually look like, and how to fix each one.

1. Phone numbers in 15 formats

Your CRM has phone numbers entered by humans over a decade. No two look the same.

Before After
(555) 123-4567 +15551234567
555.123.4567 +15551234567
+1-555-123-4567 +15551234567
5551234567 +15551234567
1 (555) 123-4567 +15551234567
goldenflow transform contacts.csv
Enter fullscreen mode Exit fullscreen mode

Zero-config mode detects phone columns and normalizes to E.164 automatically. Every downstream system — Twilio, Salesforce, your matching pipeline — expects E.164. Do it once at the source.

2. Mojibake from bad CSV exports

Someone exported from Excel on a Windows machine, someone else opened it on a Mac, and now your text looks like this:

Before After
café cafe
it’s it's
résumé resume
naïve naive

This happens when UTF-8 bytes get decoded as Latin-1. The fix is to re-encode:


from goldenflow import GoldenFlowConfig, TransformSpec

config = GoldenFlowConfig(transforms=[
    TransformSpec(column="description", ops=["fix_mojibake"]),
])
result = goldenflow.transform_df(df, config=config)
Enter fullscreen mode Exit fullscreen mode

fix_mojibake tries the Latin-1 → UTF-8 round-trip. If the text wasn't actually garbled, it returns the original unchanged.

3. European vs US decimal formats

You're merging data from a German office and a US office. One writes 1.234,56 (one thousand two hundred thirty-four and 56 cents). The other writes 1,234.56. Same number, incompatible formats.

Before (European) After
1.234,56 1234.56
99,99 99.99
10.000,00 10000.0
3.14 3.14

comma_decimal is smart about this — if there's no comma in the value, it parses as-is (so 3.14 stays 3.14, not 314). It only applies European conversion when a comma is actually present.

4. Gmail dots and +tags hiding duplicates

Gmail ignores dots in the local part and everything after a +. These are all the same inbox:

Your dedup pipeline sees four different emails. Your customer sees one inbox and four of the same marketing email.

config = GoldenFlowConfig(transforms=[
    TransformSpec(column="email", ops=["email_normalize"]),
])
Enter fullscreen mode Exit fullscreen mode

This only strips dots for Gmail/Googlemail domains — other providers treat dots as significant.

5. "Bob" and "Robert" are the same person

You're matching records across systems. One has "Bob Smith" and the other has "Robert Smith". Fuzzy matching gives you a 60% score — not enough to auto-match, too high to ignore. You end up manually reviewing thousands of maybe-matches.

Before After
Bob Robert
Bill William
Jim James
Mike Michael
Dick Richard
Liz Elizabeth
Peggy Margaret

nickname_standardize maps 50+ common English nicknames to their formal equivalents. Run it before dedup and that 60% match becomes 100%. This was built specifically to improve GoldenMatch results — normalizing nicknames before blocking and scoring eliminates an entire class of false negatives.

6. Country names in 6 variants

Your international dataset has country names entered by humans in different languages, abbreviations, and conventions:

Before After
United States US
United States of America US
USA US
United Kingdom GB
Great Britain GB
Deutschland DE

country_standardize maps common country names, abbreviations, and local names to ISO 3166-1 alpha-2 codes. Unknown values pass through unchanged — you can spot them by filtering for values longer than 2 characters.

7. SSNs stored as raw digits

Someone dumped SSNs into a spreadsheet without formatting. Now they're sitting in a column as 123456789 — no dashes, no masking, ready to be accidentally emailed to the wrong person.

Transform Before After
ssn_format 123456789 123-45-6789
ssn_mask 123456789 ***-**-6789
ssn_mask 123-45-6789 ***-**-6789
goldenflow transform employees.csv -c config.yaml
# config.yaml:
# transforms:
#   - column: ssn
#     ops: [ssn_mask]
Enter fullscreen mode Exit fullscreen mode

ssn_format normalizes to the standard dashed format. ssn_mask redacts everything except the last four digits. Both handle raw digits, dashed formats, and spaced formats. Invalid SSNs (wrong digit count) are preserved as-is so you can find and fix them.

8. Dates that aren't dates

Your date column has actual dates, free text someone typed into a date field, and blanks masquerading as dates:

Before date_validate date_iso8601
2024-03-15 true 2024-03-15
March 15, 2024 true 2024-03-15
03/15/2024 true 2024-03-15
not a date false not a date
TBD false TBD

Run date_validate first to flag the garbage, then date_iso8601 to normalize the real dates. The invalid values pass through unchanged — they don't silently become null or today's date.

9. HTML tags and emoji in scraped data

You scraped product descriptions or customer reviews. The raw data is full of markup and emoji that will break your NLP pipeline:

Before After
<p>Great product!</p> Great product!
<b>Bold</b> claim Bold claim
Love this! \U0001f600\U0001f44d Love this!
Check <a href="...">here</a> Check here

remove_html_tags strips all HTML tags. remove_emojis strips emoji characters. Chain them: ops: ["remove_html_tags", "remove_emojis", "collapse_whitespace"].

10. "N/A", "null", "none", "-", ""

The classic. Five different ways to say "this field is empty" and none of them are actual nulls. Your aggregations count them as values. Your joins fail silently.

Before After
N/A null
null null
none null
NA null
- null
(empty string) null

# Zero-config handles this automatically
result = goldenflow.transform_df(df)
Enter fullscreen mode Exit fullscreen mode

null_standardize is one of the auto-apply transforms — zero-config mode runs it on every string column without you asking. It catches the variants that trip up every pipeline.


All of this is in GoldenFlow v1.1.0

pip install goldenflow
goldenflow transform your_data.csv
Enter fullscreen mode Exit fullscreen mode

v1.1.0 ships 76 transforms across 11 categories — including the email, identifier, and URL modules that are new in this release. If you're already using GoldenCheck for data profiling, the --from-findings flag now actually works (it was silently broken in v1.0.0 — the finding-to-transform mapping used the wrong check names).

Full release notes: v1.1.0 on GitHub


Originally published at https://bensevern.dev

Top comments (0)