DEV Community

ahmet gedik
ahmet gedik

Posted on

Creating an Efficient Video Deduplication System

When you aggregate trending videos from 9 regions, the same video often appears in multiple regions simultaneously. A K-pop release might trend in KR, TW, TH, SG, VN, HK, US, and GB — all at the same time. Without deduplication, your database fills with redundant copies.

Here's how I handle deduplication on TopVideoHub.

The Simple Case: Same Video ID

The easiest deduplication is by YouTube video ID. The same video trending in multiple regions has the same ID:

class VideoDeduplicator {
    public function __construct(private readonly \PDO $db) {}

    /**
     * Insert or merge a video. Returns the internal ID.
     */
    public function upsert(array $videoData, string $region): int {
        $existing = $this->db->prepare(
            "SELECT id FROM videos WHERE video_id = ?"
        );
        $existing->execute([$videoData['video_id']]);
        $row = $existing->fetch(\PDO::FETCH_ASSOC);

        if ($row) {
            // Video exists — add this region
            $this->addRegion((int)$row['id'], $region, $videoData);
            return (int)$row['id'];
        }

        // New video — insert
        return $this->insertNew($videoData, $region);
    }

    private function insertNew(array $data, string $region): int {
        $stmt = $this->db->prepare(
            "INSERT INTO videos (video_id, title, channel_title, channel_id,
             thumbnail_url, view_count, like_count, comment_count,
             published_at, category_id, duration)
             VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"
        );
        $stmt->execute([
            $data['video_id'], $data['title'], $data['channel_title'],
            $data['channel_id'], $data['thumbnail_url'],
            $data['view_count'], $data['like_count'], $data['comment_count'],
            $data['published_at'], $data['category_id'], $data['duration'],
        ]);

        $id = (int)$this->db->lastInsertId();
        $this->addRegion($id, $region, $data);
        return $id;
    }

    private function addRegion(int $internalId, string $region, array $data): void {
        // UPSERT region data — update stats if region already exists
        $this->db->prepare(
            "INSERT INTO video_regions (video_id, region, trending_rank, fetched_at)
             VALUES (?, ?, ?, CURRENT_TIMESTAMP)
             ON CONFLICT(video_id, region) DO UPDATE SET
                trending_rank = excluded.trending_rank,
                fetched_at = CURRENT_TIMESTAMP"
        )->execute([
            $internalId,
            $region,
            $data['trending_rank'] ?? null,
        ]);

        // Update view count (always use latest)
        if (isset($data['view_count'])) {
            $this->db->prepare(
                "UPDATE videos SET view_count = MAX(view_count, ?)
                 WHERE id = ?"
            )->execute([$data['view_count'], $internalId]);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Cross-Region Scoring

Videos trending in multiple regions are inherently more interesting. I boost their score:

class TrendingScorer {
    /**
     * Calculate a trending score that rewards cross-region popularity.
     */
    public static function score(array $video, int $regionCount): float {
        $viewScore = log10(max($video['view_count'], 1));
        $ageHours = (time() - strtotime($video['published_at'])) / 3600;
        $freshnessDecay = 1 / (1 + ($ageHours / 48)); // Half-life: 48 hours

        // Cross-region multiplier: logarithmic boost
        // 1 region = 1.0x, 3 regions = 1.6x, 9 regions = 2.2x
        $regionMultiplier = 1.0 + (log($regionCount) / log(9));

        return $viewScore * $freshnessDecay * $regionMultiplier;
    }
}
Enter fullscreen mode Exit fullscreen mode

Query: Videos Trending Across Multiple Regions

-- Find videos trending in 3+ regions, ordered by score
SELECT 
    v.*,
    COUNT(DISTINCT vr.region) as region_count,
    GROUP_CONCAT(DISTINCT vr.region) as regions
FROM videos v
JOIN video_regions vr ON vr.video_id = v.id
WHERE vr.fetched_at > datetime('now', '-24 hours')
GROUP BY v.id
HAVING region_count >= 3
ORDER BY region_count DESC, v.view_count DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

This query powers the "Trending Across Asia" section on TopVideoHub, showing content that's breaking out across multiple markets.

Cleanup: Removing Stale Data

Videos don't trend forever. Periodically clean up old data:

function cleanupStaleVideos(\PDO $db, int $maxAgeDays = 7): int {
    // Remove old region associations
    $db->exec(
        "DELETE FROM video_regions 
         WHERE fetched_at < datetime('now', '-{$maxAgeDays} days')"
    );

    // Remove videos with no remaining region associations
    $result = $db->exec(
        "DELETE FROM videos 
         WHERE id NOT IN (SELECT DISTINCT video_id FROM video_regions)"
    );

    // Rebuild FTS index
    $db->exec("INSERT INTO video_search(video_search) VALUES('rebuild')");

    return $result;
}
Enter fullscreen mode Exit fullscreen mode

Batch Processing for Efficiency

When processing 450 videos (50 per region x 9 regions), wrap operations in a transaction:

function batchUpsert(\PDO $db, array $allVideos): array {
    $dedup = new VideoDeduplicator($db);
    $stats = ['inserted' => 0, 'merged' => 0];

    $db->beginTransaction();
    try {
        foreach ($allVideos as $regionData) {
            $region = $regionData['region'];
            foreach ($regionData['videos'] as $rank => $video) {
                $video['trending_rank'] = $rank + 1;

                $existsBefore = $db->prepare(
                    "SELECT 1 FROM videos WHERE video_id = ?"
                )->execute([$video['video_id']])->fetchColumn();

                $dedup->upsert($video, $region);

                if ($existsBefore) {
                    $stats['merged']++;
                } else {
                    $stats['inserted']++;
                }
            }
        }
        $db->commit();
    } catch (\Throwable $e) {
        $db->rollBack();
        throw $e;
    }

    return $stats;
}
Enter fullscreen mode Exit fullscreen mode

Output after a typical fetch:

Fetch complete:
  Inserted: 312 new videos
  Merged: 138 cross-region duplicates
  Total unique: 312
  Cross-region videos: 138 (30.7%)
Enter fullscreen mode Exit fullscreen mode

That 30% cross-region rate is consistent with what we see on TopVideoHub — about a third of trending videos appear in multiple markets, with K-pop content being the most frequent cross-region hit.

Deduplication is a fundamental building block for any multi-region content aggregator. Get it right and your data stays clean. Get it wrong and your storage grows linearly with regions instead of being bounded.

Top comments (0)