Six months ago we migrated from HubSpot to Salesforce. The migration itself went fine. Data mapped correctly, custom fields transferred, nothing broke. We celebrated for about three days.
Then our sales team started complaining. "Why do i have two records for the same company?" "Why is this contact listed three times?" "I just called someone and they said another rep already reached out this morning."
We pulled a report. 40,000 duplicate contacts. Out of roughly 95,000 total records. More than 40% of our database was duplicates.
And the thing is, most of those duplicates already existed in HubSpot. We just hadnt noticed because HubSpot's dedup was handling some of it silently. When we moved to Salesforce, all the silent duplicates became visible and the mess that had been building for three years landed on our desk at once.
Why CRM migrations create duplicate nightmares
The duplicate problem in CRM migrations comes from multiple sources and they compound in ways that are hard to predict.
Pre-existing duplicates. Every CRM accumulates duplicates over time. Reps create new contacts instead of finding existing ones. Marketing imports lists that overlap with existing data. Web forms create new records even when the person already exists. According to Salesforce research, the average CRM database degrades at about 30% per year.
Merge conflicts during migration. When mapping fields between two systems, name fields might split differently. HubSpot might have "Full Name" as one field. Salesforce might have "First Name" and "Last Name" as separate fields. The migration tool splits "Dr. Sarah Jane Smith-Williams" into first name "Dr. Sarah Jane" and last name "Smith-Williams." Meanwhile another record already exists with first name "Sarah" and last name "Smith-Williams." These dont get flagged as duplicates.
Email variations. The same person might have sarah@company.com in one record and s.williams@company.com in another. Both are valid emails for the same person. But automated dedup based on email wont catch it because the emails are different.
Company name inconsistencies. "Acme Corp" "Acme Corporation" "ACME" "Acme Inc." All the same company. All creating separate account records.
What 40,000 duplicates actually costs
This isnt just a cosmetic problem. Duplicate records have real financial impact.
Sales team productivity. Our reps were spending an average of 30 minutes a day dealing with duplicate-related issues. Finding the right record, merging duplicates they stumbled on, apologizing to prospects who got contacted twice. For a team of 12 reps, thats 6 hours of wasted time per day. Thats like having a full-time employee who does nothing but clean up data.
Email marketing costs. We were paying for 95,000 contacts in our email platform. If 40,000 were duplicates, we were overpaying by roughly 42%. At our per-contact rate, that was about $800/month in wasted email costs.
Reporting accuracy. Our pipeline reports were inflated. Lead counts were wrong. Attribution was broken. When the same person exists as three different leads, your funnel metrics are fiction.
A Gartner study estimated that poor data quality costs organizations an average of $12.9 million annually. For a company our size, duplicates alone were probably a six-figure problem.
The dedup approach that doesn't work
Our first attempt at fixing this was Salesforce's built-in duplicate management. You set up matching rules (match on email, match on name + company) and it flags potential duplicates.
The problem: it found about 8,000 duplicates based on exact email match. Thats helpful, but it missed the other 32,000 that had different emails, slightly different names, or variations in company names. Exact matching catches the easy duplicates and misses the hard ones.
Our second attempt was a manual review project. We assigned two ops people to go through flagged duplicates and merge them. After a week they had processed about 2,000 records and were losing their minds. At that rate, the project would take five months and cost more than just living with the duplicates.
Third attempt: we bought a Salesforce dedup app from the AppExchange. $200/month. It was better than the built-in tools but still relied heavily on exact matching. It caught maybe 60% of our duplicates. The other 40% (the ones with name variations, different emails, partial information) still required manual review.
Why fuzzy matching changes everything
The breakthrough came when we stopped trying to find exact matches and started looking for fuzzy matches with confidence scores.
Instead of asking "is this record identical to that record?" we asked "how similar are these records, and how confident are we that they represent the same entity?"
A fuzzy dedup approach looks at multiple fields simultaneously:
- Name similarity (using algorithms like Jaro-Winkler that can handle "Sarah Williams" matching "S. Williams")
- Company similarity ("Acme Corp" matching "Acme Corporation Inc")
- Phone number matching (ignoring formatting differences)
- Address similarity (handling abbreviations and format variations)
- Email domain matching (two records at @acmecorp.com are more likely to be from the same company)
Each field contributes to an overall confidence score. Two records might not match on any single field exactly, but when you combine name similarity of 85%, same company domain, and a phone number thats off by one digit, the confidence that theyre the same person is very high.
This is exactly the problem I built DataReconIQ to solve. Upload your export, select which columns to compare, and it returns clustered duplicates with confidence scores. Multi-field fuzzy dedup without writing any code.
The dedup playbook
After going through this mess, heres the process i'd recommend for anyone doing a CRM migration or tackling an existing duplicate problem.
Step 1: Export and baseline. Export your entire contact database. Count total records. This is your "before" number.
Step 2: Exact dedup first. Remove exact duplicates (same email, same phone, identical names). This is the easy stuff and reduces your dataset for the harder matching.
Step 3: Fuzzy matching. Run fuzzy matching on the remaining records using name, company, and any other identifying fields. Get confidence scores.
Step 4: Auto-merge high confidence. Records with 95%+ confidence can usually be auto-merged. These are obvious duplicates that just have minor formatting differences.
Step 5: Human review for medium confidence. Records in the 70-94% range need a human to look at them. But instead of reviewing 40,000 records, you're reviewing maybe 3,000-5,000. Much more manageable.
Step 6: Ignore low confidence. Records below 70% similarity are probably not duplicates. Set them aside.
Step 7: Ongoing monitoring. Set up rules to prevent new duplicates from being created. This is the step most teams skip, which is why the problem comes back.
Prevention is easier than cleanup
Honestly, the best advice i can give is: dont let it get to 40,000 duplicates in the first place. Run dedup quarterly. Set up duplicate prevention rules in your CRM. Train reps to search before creating new records.
But if you're already sitting on a mountain of duplicates (and statistically, you probably are), the approach above works. We went from 95,000 records to 62,000 clean records. Our sales team is faster. Our reporting is accurate. Our email costs dropped.
The migration created the crisis but the duplicates had been building for years. The migration just made them impossible to ignore. And honestly, thats probably the silver lining. Better to face the problem than to keep pretending your data is clean.
According to Validity's State of CRM Data report, 44% of companies estimate they lose over 10% of annual revenue due to poor CRM data quality. Duplicates are the single biggest contributor to that loss.
If you're planning a CRM migration, budget time for dedup. If you just finished one and the numbers look suspiciously high, pull a duplicate report. You might not like what you find, but you'll be glad you looked.
Top comments (0)