DEV Community

Cover image for I built an email cleaner. CSV parsing took longer than the actual validators.
Damian Forzani
Damian Forzani

Posted on

I built an email cleaner. CSV parsing took longer than the actual validators.

I've been building databridge.so by myself for a while. It's an email list cleaner that explains every decision. Most cleaners give you back "74/100, risky" and that's all you get. You cannot audit it. So I built one where every row in the cleaned CSV carries the actual reason it was flagged.

A few things I expected to be quick that weren't.

CSV parsing ate the most time

This is the part I'd skip in any project brief. "Oh, we'll just use a CSV library." Then you get a real export from a real CRM and it has:

  • A UTF-8 BOM at the start that breaks header parsing if you don't strip it
  • Curly quotes from a Google Sheets export instead of straight quotes
  • Mixed line endings (CRLF and LF in the same file, somehow)
  • Ragged rows where the header has 5 columns and some rows have 4 or 6
  • The file is labeled UTF-8 but is actually latin1
  • Commas inside cells without proper escaping
  • A row that contains a literal newline inside a quoted cell

Most cleaners just reject malformed files. I spent weeks getting the parser to repair what's repairable and clearly flag what isn't. It is by far the least glamorous part of the codebase and it taught me the most.

"Risky" without a reason is useless

The thing that bothered me most about every cleaner I tested was this. You upload 10,000 emails. You get back 6,200 valid, 2,300 risky, 1,500 invalid. Cool. Now your boss asks "why were these 2,300 flagged?" and the answer is a confidence score with no context.

I wanted the answer to be in the export. So the cleaned CSV has a column for the verdict and a column for the reason, per row. "Domain has no MX." "Likely typo on gmial.com." "Role-based address (info@)." Stuff a non technical person can read.

Things that surprised me

A few smaller things worth mentioning:

Typo repair is easy to do too aggressively. Fine to suggest gmial.com becomes gmail.com. Not fine to suggest a random 4-letter domain becomes anything, the false positive rate is too high. I gated repair to longer domains and a small set of high-confidence target providers.

Auto-repair on the file level is way more work than you think. Not just parsing — you have to be confident enough to say "this broken row, here's what I think it should be" without overwriting actual user data. I lean conservative: repair when there's one obvious fix, flag otherwise.

Stack

Next.js, Supabase Postgres, Railway workers for the heavy validation, Upstash for queue and cache, Resend for transactional mail. Nothing exotic.

Three open questions

Genuinely curious what people think:

  1. Single deliverability score vs per-check breakdown. A single number flattens signal. bad TLD + role-based is worse than the sum of its parts. But people want one number. I expose both and let the caller pick. Not sure that's right.

  2. Plus aliases in dedup. user+tag@gmail.com vs user@gmail.com. Outreach users want them collapsed, lifecycle people want them separate. I keep them separate by default but get this question a lot.

  3. Async vs sync for small batch jobs. Right now batches are async with a webhook on completion. Some users want a sync response for under 100 rows. Worth supporting both shapes or is that just complexity?

if you want to poke at it. Happy to take pushback in the comments.

databridge.so

Top comments (0)