Removing duplicates is one of those tasks that sounds trivial until you encounter it at scale. Here's a systematic guide to deduplication — from quick browser-based tools to programmatic approaches for large datasets.
The quick browser approach
For plain text lists — a list of email addresses, domain names, keywords, URLs, or IP addresses — the fastest approach is to paste into an online tool, enable deduplication, and copy the result. No code, no terminal, no file I/O.
Remove duplicate lines handles the most common cases: exact-match deduplication, case-insensitive deduplication, whitespace-trimmed deduplication, and sorted output. For most day-to-day deduplication tasks (cleaning a keyword list, deduplicating a subscriber export), this is the right tool.
When to step up to programmatic approaches: when the list has millions of entries, when you need fuzzy matching (not just exact), or when deduplication is part of a larger data pipeline.
Deduplication in JavaScript
Array of strings:
const unique = [...new Set(arr)];
Set maintains insertion order and handles exact-match deduplication in one line. For large arrays, this is O(n) — significantly faster than filter-based approaches.
Case-insensitive deduplication:
const seen = new Set();
const unique = arr.filter(item => {
const key = item.toLowerCase().trim();
if (seen.has(key)) return false;
seen.add(key);
return true;
});
This preserves the first occurrence of each value (in its original casing) while filtering out case-variant duplicates.
Array of objects (by a key):
const seen = new Set();
const unique = arr.filter(obj => {
if (seen.has(obj.id)) return false;
seen.add(obj.id);
return true;
});
With lodash:
import _ from 'lodash';
const unique = _.uniqBy(arr, 'id');
// or
const unique = _.uniq(arr); // primitives
Deduplication in Python
List of strings:
unique = list(dict.fromkeys(items)) # preserves order (Python 3.7+)
# or
unique = list(set(items)) # doesn't preserve order
dict.fromkeys() uses the items as dictionary keys (which must be unique) and preserves the original insertion order — the recommended approach when order matters.
Case-insensitive:
seen = set()
unique = []
for item in items:
key = item.lower().strip()
if key not in seen:
seen.add(key)
unique.append(item)
Pandas DataFrame:
import pandas as pd
# Remove duplicate rows
df = df.drop_duplicates()
# Remove duplicates by specific column
df = df.drop_duplicates(subset=['email'])
# Keep last occurrence instead of first
df = df.drop_duplicates(subset=['email'], keep='last')
# See which rows are duplicates
df[df.duplicated(subset=['email'])]
Deduplication in SQL
Find duplicates:
SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
Delete duplicates, keeping the row with the lowest id:
DELETE FROM users
WHERE id NOT IN (
SELECT MIN(id)
FROM users
GROUP BY email
);
PostgreSQL: Using ROW_NUMBER:
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) as rn
FROM users
)
DELETE FROM ranked WHERE rn > 1;
PostgreSQL: Deduplicate into a new table:
CREATE TABLE users_clean AS
SELECT DISTINCT ON (email) *
FROM users
ORDER BY email, created_at DESC;
DISTINCT ON is PostgreSQL-specific and very efficient for this pattern.
Command-line deduplication
Sort and deduplicate:
sort file.txt | uniq > unique.txt
uniq only removes consecutive duplicates, so you must sort first. sort -u does both in one step:
sort -u file.txt > unique.txt
Deduplicate without sorting (preserve order):
awk '!seen[$0]++' file.txt > unique.txt
This is the canonical awk approach for order-preserving deduplication. The seen array tracks lines; !seen[$0]++ prints a line only the first time it's seen.
Case-insensitive:
awk '!seen[tolower($0)]++' file.txt > unique.txt
For very large files (millions of lines):
LC_ALL=C sort -u file.txt > unique.txt
LC_ALL=C disables locale-aware sorting, which is significantly faster for ASCII text.
Fuzzy deduplication
Exact-match deduplication misses near-duplicates: "new york", "New York", "new-york", "new york city". For fuzzy matching, you need similarity metrics:
Python: Levenshtein distance
from difflib import SequenceMatcher
def similarity(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
# Group items with > 80% similarity
threshold = 0.8
seen = []
unique = []
for item in items:
if not any(similarity(item, s) > threshold for s in seen):
seen.append(item)
unique.append(item)
This is O(n²) — fine for hundreds of items, slow for millions. For large-scale fuzzy deduplication, consider MinHash/LSH (Locality-Sensitive Hashing) or the dedupe Python library.
Python: Levenshtein package (faster)
import Levenshtein
distance = Levenshtein.distance("new york", "new-york") # = 1
ratio = Levenshtein.ratio("new york", "New York") # = 0.94
Deduplication of structured data
For JSON arrays, CSV files, and database tables, exact-match deduplication by key is usually appropriate. The key decision is which record to keep when duplicates differ:
- Keep first: Use for data where the earliest version is authoritative (first submission, first signup)
- Keep last: Use for data where the latest version supersedes earlier ones (most recent update, last activity)
- Merge: Complex — combine fields from multiple records. Usually done in application code, not SQL.
Summary: choosing the right approach
| Situation | Tool |
|---|---|
| Plain text list, one-off | Browser deduplication tool |
| Array in JavaScript | new Set() |
| List in Python | dict.fromkeys() |
| DataFrame column | df.drop_duplicates(subset=[...]) |
| SQL table by key | DELETE ... WHERE id NOT IN (SELECT MIN(id)...GROUP BY ...) |
| Large file |
sort -u or LC_ALL=C sort -u
|
| Fuzzy matching | Levenshtein / SequenceMatcher |
The right tool depends on scale, order-preservation requirements, and whether you need exact or fuzzy matching.
Top comments (0)