When you aggregate trending videos from 9 regions, the same video often appears in multiple regions simultaneously. A K-pop release might trend in KR, TW, TH, SG, VN, HK, US, and GB — all at the same time. Without deduplication, your database fills with redundant copies.
Here's how I handle deduplication on TopVideoHub.
The Simple Case: Same Video ID
The easiest deduplication is by YouTube video ID. The same video trending in multiple regions has the same ID:
class VideoDeduplicator {
public function __construct(private readonly \PDO $db) {}
/**
* Insert or merge a video. Returns the internal ID.
*/
public function upsert(array $videoData, string $region): int {
$existing = $this->db->prepare(
"SELECT id FROM videos WHERE video_id = ?"
);
$existing->execute([$videoData['video_id']]);
$row = $existing->fetch(\PDO::FETCH_ASSOC);
if ($row) {
// Video exists — add this region
$this->addRegion((int)$row['id'], $region, $videoData);
return (int)$row['id'];
}
// New video — insert
return $this->insertNew($videoData, $region);
}
private function insertNew(array $data, string $region): int {
$stmt = $this->db->prepare(
"INSERT INTO videos (video_id, title, channel_title, channel_id,
thumbnail_url, view_count, like_count, comment_count,
published_at, category_id, duration)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"
);
$stmt->execute([
$data['video_id'], $data['title'], $data['channel_title'],
$data['channel_id'], $data['thumbnail_url'],
$data['view_count'], $data['like_count'], $data['comment_count'],
$data['published_at'], $data['category_id'], $data['duration'],
]);
$id = (int)$this->db->lastInsertId();
$this->addRegion($id, $region, $data);
return $id;
}
private function addRegion(int $internalId, string $region, array $data): void {
// UPSERT region data — update stats if region already exists
$this->db->prepare(
"INSERT INTO video_regions (video_id, region, trending_rank, fetched_at)
VALUES (?, ?, ?, CURRENT_TIMESTAMP)
ON CONFLICT(video_id, region) DO UPDATE SET
trending_rank = excluded.trending_rank,
fetched_at = CURRENT_TIMESTAMP"
)->execute([
$internalId,
$region,
$data['trending_rank'] ?? null,
]);
// Update view count (always use latest)
if (isset($data['view_count'])) {
$this->db->prepare(
"UPDATE videos SET view_count = MAX(view_count, ?)
WHERE id = ?"
)->execute([$data['view_count'], $internalId]);
}
}
}
Cross-Region Scoring
Videos trending in multiple regions are inherently more interesting. I boost their score:
class TrendingScorer {
/**
* Calculate a trending score that rewards cross-region popularity.
*/
public static function score(array $video, int $regionCount): float {
$viewScore = log10(max($video['view_count'], 1));
$ageHours = (time() - strtotime($video['published_at'])) / 3600;
$freshnessDecay = 1 / (1 + ($ageHours / 48)); // Half-life: 48 hours
// Cross-region multiplier: logarithmic boost
// 1 region = 1.0x, 3 regions = 1.6x, 9 regions = 2.2x
$regionMultiplier = 1.0 + (log($regionCount) / log(9));
return $viewScore * $freshnessDecay * $regionMultiplier;
}
}
Query: Videos Trending Across Multiple Regions
-- Find videos trending in 3+ regions, ordered by score
SELECT
v.*,
COUNT(DISTINCT vr.region) as region_count,
GROUP_CONCAT(DISTINCT vr.region) as regions
FROM videos v
JOIN video_regions vr ON vr.video_id = v.id
WHERE vr.fetched_at > datetime('now', '-24 hours')
GROUP BY v.id
HAVING region_count >= 3
ORDER BY region_count DESC, v.view_count DESC
LIMIT 20;
This query powers the "Trending Across Asia" section on TopVideoHub, showing content that's breaking out across multiple markets.
Cleanup: Removing Stale Data
Videos don't trend forever. Periodically clean up old data:
function cleanupStaleVideos(\PDO $db, int $maxAgeDays = 7): int {
// Remove old region associations
$db->exec(
"DELETE FROM video_regions
WHERE fetched_at < datetime('now', '-{$maxAgeDays} days')"
);
// Remove videos with no remaining region associations
$result = $db->exec(
"DELETE FROM videos
WHERE id NOT IN (SELECT DISTINCT video_id FROM video_regions)"
);
// Rebuild FTS index
$db->exec("INSERT INTO video_search(video_search) VALUES('rebuild')");
return $result;
}
Batch Processing for Efficiency
When processing 450 videos (50 per region x 9 regions), wrap operations in a transaction:
function batchUpsert(\PDO $db, array $allVideos): array {
$dedup = new VideoDeduplicator($db);
$stats = ['inserted' => 0, 'merged' => 0];
$db->beginTransaction();
try {
foreach ($allVideos as $regionData) {
$region = $regionData['region'];
foreach ($regionData['videos'] as $rank => $video) {
$video['trending_rank'] = $rank + 1;
$existsBefore = $db->prepare(
"SELECT 1 FROM videos WHERE video_id = ?"
)->execute([$video['video_id']])->fetchColumn();
$dedup->upsert($video, $region);
if ($existsBefore) {
$stats['merged']++;
} else {
$stats['inserted']++;
}
}
}
$db->commit();
} catch (\Throwable $e) {
$db->rollBack();
throw $e;
}
return $stats;
}
Output after a typical fetch:
Fetch complete:
Inserted: 312 new videos
Merged: 138 cross-region duplicates
Total unique: 312
Cross-region videos: 138 (30.7%)
That 30% cross-region rate is consistent with what we see on TopVideoHub — about a third of trending videos appear in multiple markets, with K-pop content being the most frequent cross-region hit.
Deduplication is a fundamental building block for any multi-region content aggregator. Get it right and your data stays clean. Get it wrong and your storage grows linearly with regions instead of being bounded.
Top comments (0)