DEV Community

Snappy Tools
Snappy Tools

Posted on

Deduplication Techniques for Text, Lists, and Data: When to Use Which

Removing duplicates is one of those tasks that sounds trivial until you encounter it at scale. Here's a systematic guide to deduplication — from quick browser-based tools to programmatic approaches for large datasets.

The quick browser approach

For plain text lists — a list of email addresses, domain names, keywords, URLs, or IP addresses — the fastest approach is to paste into an online tool, enable deduplication, and copy the result. No code, no terminal, no file I/O.

Remove duplicate lines handles the most common cases: exact-match deduplication, case-insensitive deduplication, whitespace-trimmed deduplication, and sorted output. For most day-to-day deduplication tasks (cleaning a keyword list, deduplicating a subscriber export), this is the right tool.

When to step up to programmatic approaches: when the list has millions of entries, when you need fuzzy matching (not just exact), or when deduplication is part of a larger data pipeline.

Deduplication in JavaScript

Array of strings:

const unique = [...new Set(arr)];
Enter fullscreen mode Exit fullscreen mode

Set maintains insertion order and handles exact-match deduplication in one line. For large arrays, this is O(n) — significantly faster than filter-based approaches.

Case-insensitive deduplication:

const seen = new Set();
const unique = arr.filter(item => {
  const key = item.toLowerCase().trim();
  if (seen.has(key)) return false;
  seen.add(key);
  return true;
});
Enter fullscreen mode Exit fullscreen mode

This preserves the first occurrence of each value (in its original casing) while filtering out case-variant duplicates.

Array of objects (by a key):

const seen = new Set();
const unique = arr.filter(obj => {
  if (seen.has(obj.id)) return false;
  seen.add(obj.id);
  return true;
});
Enter fullscreen mode Exit fullscreen mode

With lodash:

import _ from 'lodash';
const unique = _.uniqBy(arr, 'id');
// or
const unique = _.uniq(arr); // primitives
Enter fullscreen mode Exit fullscreen mode

Deduplication in Python

List of strings:

unique = list(dict.fromkeys(items))  # preserves order (Python 3.7+)
# or
unique = list(set(items))  # doesn't preserve order
Enter fullscreen mode Exit fullscreen mode

dict.fromkeys() uses the items as dictionary keys (which must be unique) and preserves the original insertion order — the recommended approach when order matters.

Case-insensitive:

seen = set()
unique = []
for item in items:
    key = item.lower().strip()
    if key not in seen:
        seen.add(key)
        unique.append(item)
Enter fullscreen mode Exit fullscreen mode

Pandas DataFrame:

import pandas as pd

# Remove duplicate rows
df = df.drop_duplicates()

# Remove duplicates by specific column
df = df.drop_duplicates(subset=['email'])

# Keep last occurrence instead of first
df = df.drop_duplicates(subset=['email'], keep='last')

# See which rows are duplicates
df[df.duplicated(subset=['email'])]
Enter fullscreen mode Exit fullscreen mode

Deduplication in SQL

Find duplicates:

SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
Enter fullscreen mode Exit fullscreen mode

Delete duplicates, keeping the row with the lowest id:

DELETE FROM users
WHERE id NOT IN (
    SELECT MIN(id)
    FROM users
    GROUP BY email
);
Enter fullscreen mode Exit fullscreen mode

PostgreSQL: Using ROW_NUMBER:

WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) as rn
  FROM users
)
DELETE FROM ranked WHERE rn > 1;
Enter fullscreen mode Exit fullscreen mode

PostgreSQL: Deduplicate into a new table:

CREATE TABLE users_clean AS
SELECT DISTINCT ON (email) *
FROM users
ORDER BY email, created_at DESC;
Enter fullscreen mode Exit fullscreen mode

DISTINCT ON is PostgreSQL-specific and very efficient for this pattern.

Command-line deduplication

Sort and deduplicate:

sort file.txt | uniq > unique.txt
Enter fullscreen mode Exit fullscreen mode

uniq only removes consecutive duplicates, so you must sort first. sort -u does both in one step:

sort -u file.txt > unique.txt
Enter fullscreen mode Exit fullscreen mode

Deduplicate without sorting (preserve order):

awk '!seen[$0]++' file.txt > unique.txt
Enter fullscreen mode Exit fullscreen mode

This is the canonical awk approach for order-preserving deduplication. The seen array tracks lines; !seen[$0]++ prints a line only the first time it's seen.

Case-insensitive:

awk '!seen[tolower($0)]++' file.txt > unique.txt
Enter fullscreen mode Exit fullscreen mode

For very large files (millions of lines):

LC_ALL=C sort -u file.txt > unique.txt
Enter fullscreen mode Exit fullscreen mode

LC_ALL=C disables locale-aware sorting, which is significantly faster for ASCII text.

Fuzzy deduplication

Exact-match deduplication misses near-duplicates: "new york", "New York", "new-york", "new york city". For fuzzy matching, you need similarity metrics:

Python: Levenshtein distance

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

# Group items with > 80% similarity
threshold = 0.8
seen = []
unique = []
for item in items:
    if not any(similarity(item, s) > threshold for s in seen):
        seen.append(item)
        unique.append(item)
Enter fullscreen mode Exit fullscreen mode

This is O(n²) — fine for hundreds of items, slow for millions. For large-scale fuzzy deduplication, consider MinHash/LSH (Locality-Sensitive Hashing) or the dedupe Python library.

Python: Levenshtein package (faster)

import Levenshtein

distance = Levenshtein.distance("new york", "new-york")  # = 1
ratio = Levenshtein.ratio("new york", "New York")        # = 0.94
Enter fullscreen mode Exit fullscreen mode

Deduplication of structured data

For JSON arrays, CSV files, and database tables, exact-match deduplication by key is usually appropriate. The key decision is which record to keep when duplicates differ:

  • Keep first: Use for data where the earliest version is authoritative (first submission, first signup)
  • Keep last: Use for data where the latest version supersedes earlier ones (most recent update, last activity)
  • Merge: Complex — combine fields from multiple records. Usually done in application code, not SQL.

Summary: choosing the right approach

Situation Tool
Plain text list, one-off Browser deduplication tool
Array in JavaScript new Set()
List in Python dict.fromkeys()
DataFrame column df.drop_duplicates(subset=[...])
SQL table by key DELETE ... WHERE id NOT IN (SELECT MIN(id)...GROUP BY ...)
Large file sort -u or LC_ALL=C sort -u
Fuzzy matching Levenshtein / SequenceMatcher

The right tool depends on scale, order-preservation requirements, and whether you need exact or fuzzy matching.

Top comments (0)