Snappy Tools

Posted on May 30

Deduplication Techniques for Text, Lists, and Data: When to Use Which

#webdev #productivity #javascript #tools

Removing duplicates is one of those tasks that sounds trivial until you encounter it at scale. Here's a systematic guide to deduplication — from quick browser-based tools to programmatic approaches for large datasets.

The quick browser approach

For plain text lists — a list of email addresses, domain names, keywords, URLs, or IP addresses — the fastest approach is to paste into an online tool, enable deduplication, and copy the result. No code, no terminal, no file I/O.

Remove duplicate lines handles the most common cases: exact-match deduplication, case-insensitive deduplication, whitespace-trimmed deduplication, and sorted output. For most day-to-day deduplication tasks (cleaning a keyword list, deduplicating a subscriber export), this is the right tool.

When to step up to programmatic approaches: when the list has millions of entries, when you need fuzzy matching (not just exact), or when deduplication is part of a larger data pipeline.

Deduplication in JavaScript

Array of strings:

const unique = [...new Set(arr)];

Set maintains insertion order and handles exact-match deduplication in one line. For large arrays, this is O(n) — significantly faster than filter-based approaches.

Case-insensitive deduplication:

const seen = new Set();
const unique = arr.filter(item => {
  const key = item.toLowerCase().trim();
  if (seen.has(key)) return false;
  seen.add(key);
  return true;
});

This preserves the first occurrence of each value (in its original casing) while filtering out case-variant duplicates.

Array of objects (by a key):

const seen = new Set();
const unique = arr.filter(obj => {
  if (seen.has(obj.id)) return false;
  seen.add(obj.id);
  return true;
});

With lodash:

import _ from 'lodash';
const unique = _.uniqBy(arr, 'id');
// or
const unique = _.uniq(arr); // primitives

Deduplication in Python

List of strings:

unique = list(dict.fromkeys(items))  # preserves order (Python 3.7+)
# or
unique = list(set(items))  # doesn't preserve order

dict.fromkeys() uses the items as dictionary keys (which must be unique) and preserves the original insertion order — the recommended approach when order matters.

Case-insensitive:

seen = set()
unique = []
for item in items:
    key = item.lower().strip()
    if key not in seen:
        seen.add(key)
        unique.append(item)

Pandas DataFrame:

import pandas as pd

# Remove duplicate rows
df = df.drop_duplicates()

# Remove duplicates by specific column
df = df.drop_duplicates(subset=['email'])

# Keep last occurrence instead of first
df = df.drop_duplicates(subset=['email'], keep='last')

# See which rows are duplicates
df[df.duplicated(subset=['email'])]

Deduplication in SQL

Find duplicates:

SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

Delete duplicates, keeping the row with the lowest id:

DELETE FROM users
WHERE id NOT IN (
    SELECT MIN(id)
    FROM users
    GROUP BY email
);

PostgreSQL: Using ROW_NUMBER:

WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) as rn
  FROM users
)
DELETE FROM ranked WHERE rn > 1;

PostgreSQL: Deduplicate into a new table:

CREATE TABLE users_clean AS
SELECT DISTINCT ON (email) *
FROM users
ORDER BY email, created_at DESC;

DISTINCT ON is PostgreSQL-specific and very efficient for this pattern.

Command-line deduplication

Sort and deduplicate:

sort file.txt | uniq > unique.txt

uniq only removes consecutive duplicates, so you must sort first. sort -u does both in one step:

sort -u file.txt > unique.txt

Deduplicate without sorting (preserve order):

awk '!seen[$0]++' file.txt > unique.txt

This is the canonical awk approach for order-preserving deduplication. The seen array tracks lines; !seen[$0]++ prints a line only the first time it's seen.

Case-insensitive:

awk '!seen[tolower($0)]++' file.txt > unique.txt

For very large files (millions of lines):

LC_ALL=C sort -u file.txt > unique.txt

LC_ALL=C disables locale-aware sorting, which is significantly faster for ASCII text.

Fuzzy deduplication

Exact-match deduplication misses near-duplicates: "new york", "New York", "new-york", "new york city". For fuzzy matching, you need similarity metrics:

Python: Levenshtein distance

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

# Group items with > 80% similarity
threshold = 0.8
seen = []
unique = []
for item in items:
    if not any(similarity(item, s) > threshold for s in seen):
        seen.append(item)
        unique.append(item)

This is O(n²) — fine for hundreds of items, slow for millions. For large-scale fuzzy deduplication, consider MinHash/LSH (Locality-Sensitive Hashing) or the dedupe Python library.

Python: Levenshtein package (faster)

import Levenshtein

distance = Levenshtein.distance("new york", "new-york")  # = 1
ratio = Levenshtein.ratio("new york", "New York")        # = 0.94

Deduplication of structured data

For JSON arrays, CSV files, and database tables, exact-match deduplication by key is usually appropriate. The key decision is which record to keep when duplicates differ:

Keep first: Use for data where the earliest version is authoritative (first submission, first signup)
Keep last: Use for data where the latest version supersedes earlier ones (most recent update, last activity)
Merge: Complex — combine fields from multiple records. Usually done in application code, not SQL.

Summary: choosing the right approach

Situation	Tool
Plain text list, one-off	Browser deduplication tool
Array in JavaScript	`new Set()`
List in Python	`dict.fromkeys()`
DataFrame column	`df.drop_duplicates(subset=[...])`
SQL table by key	`DELETE ... WHERE id NOT IN (SELECT MIN(id)...GROUP BY ...)`
Large file	`sort -u` or `LC_ALL=C sort -u`
Fuzzy matching	Levenshtein / SequenceMatcher

The right tool depends on scale, order-preservation requirements, and whether you need exact or fuzzy matching.

DEV Community