<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Damian Forzani</title>
    <description>The latest articles on DEV Community by Damian Forzani (@damianfz).</description>
    <link>https://dev.to/damianfz</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944115%2F98c80bc5-9ba8-4328-8f23-e5a20d5b12e9.png</url>
      <title>DEV Community: Damian Forzani</title>
      <link>https://dev.to/damianfz</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/damianfz"/>
    <language>en</language>
    <item>
      <title>I built an email cleaner. CSV parsing took longer than the actual validators.</title>
      <dc:creator>Damian Forzani</dc:creator>
      <pubDate>Thu, 21 May 2026 13:16:10 +0000</pubDate>
      <link>https://dev.to/damianfz/i-built-an-email-cleaner-csv-parsing-took-longer-than-the-actual-validators-58jb</link>
      <guid>https://dev.to/damianfz/i-built-an-email-cleaner-csv-parsing-took-longer-than-the-actual-validators-58jb</guid>
      <description>&lt;p&gt;I've been building &lt;a href="https://databridge.so" rel="noopener noreferrer"&gt;databridge.so&lt;/a&gt; by myself for a while. It's an email list cleaner that explains every decision. Most cleaners give you back "74/100, risky" and that's all you get. You cannot audit it. So I built one where every row in the cleaned CSV carries the actual reason it was flagged.&lt;/p&gt;

&lt;p&gt;A few things I expected to be quick that weren't.&lt;/p&gt;

&lt;h2&gt;
  
  
  CSV parsing ate the most time
&lt;/h2&gt;

&lt;p&gt;This is the part I'd skip in any project brief. "Oh, we'll just use a CSV library." Then you get a real export from a real CRM and it has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A UTF-8 BOM at the start that breaks header parsing if you don't strip it&lt;/li&gt;
&lt;li&gt;Curly quotes from a Google Sheets export instead of straight quotes&lt;/li&gt;
&lt;li&gt;Mixed line endings (CRLF and LF in the same file, somehow)&lt;/li&gt;
&lt;li&gt;Ragged rows where the header has 5 columns and some rows have 4 or 6&lt;/li&gt;
&lt;li&gt;The file is labeled UTF-8 but is actually latin1&lt;/li&gt;
&lt;li&gt;Commas inside cells without proper escaping&lt;/li&gt;
&lt;li&gt;A row that contains a literal newline inside a quoted cell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most cleaners just reject malformed files. I spent weeks getting the parser to repair what's repairable and clearly flag what isn't. It is by far the least glamorous part of the codebase and it taught me the most.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Risky" without a reason is useless
&lt;/h2&gt;

&lt;p&gt;The thing that bothered me most about every cleaner I tested was this. You upload 10,000 emails. You get back 6,200 valid, 2,300 risky, 1,500 invalid. Cool. Now your boss asks "why were these 2,300 flagged?" and the answer is a confidence score with no context.&lt;/p&gt;

&lt;p&gt;I wanted the answer to be in the export. So the cleaned CSV has a column for the verdict and a column for the reason, per row. "Domain has no MX." "Likely typo on gmial.com." "Role-based address (info@)." Stuff a non technical person can read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that surprised me
&lt;/h2&gt;

&lt;p&gt;A few smaller things worth mentioning:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typo repair is easy to do too aggressively.&lt;/strong&gt; Fine to suggest &lt;code&gt;gmial.com&lt;/code&gt; becomes &lt;code&gt;gmail.com&lt;/code&gt;. Not fine to suggest a random 4-letter domain becomes anything, the false positive rate is too high. I gated repair to longer domains and a small set of high-confidence target providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-repair on the file level is way more work than you think.&lt;/strong&gt; Not just parsing — you have to be confident enough to say "this broken row, here's what I think it should be" without overwriting actual user data. I lean conservative: repair when there's one obvious fix, flag otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;Next.js, Supabase Postgres, Railway workers for the heavy validation, Upstash for queue and cache, Resend for transactional mail. Nothing exotic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three open questions
&lt;/h2&gt;

&lt;p&gt;Genuinely curious what people think:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single deliverability score vs per-check breakdown.&lt;/strong&gt; A single number flattens signal. &lt;code&gt;bad TLD + role-based&lt;/code&gt; is worse than the sum of its parts. But people want one number. I expose both and let the caller pick. Not sure that's right.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plus aliases in dedup.&lt;/strong&gt; &lt;code&gt;user+tag@gmail.com&lt;/code&gt; vs &lt;code&gt;user@gmail.com&lt;/code&gt;. Outreach users want them collapsed, lifecycle people want them separate. I keep them separate by default but get this question a lot.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Async vs sync for small batch jobs.&lt;/strong&gt; Right now batches are async with a webhook on completion. Some users want a sync response for under 100 rows. Worth supporting both shapes or is that just complexity?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;if you want to poke at it. Happy to take pushback in the comments.&lt;/p&gt;

&lt;p&gt;databridge.so&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>api</category>
      <category>typescript</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
