DEV Community

Cover image for Algorithmic Entity Resolution in Music Metadata
Jorge Martinez
Jorge Martinez

Posted on

Algorithmic Entity Resolution in Music Metadata

In the global streaming economy, Spotify, Apple Music, and other DSPs process billions of plays daily. Behind this massive transaction layer lies a fragmented, dual-copyright structure:

  1. The Recording Copyright (Master Right): Identifies the audio file, registered using the ISRC (International Standard Recording Code).
  2. The Composition Copyright (Publishing Right): Identifies the melody, lyrics, and arrangement, registered using the ISWC (International Standard Musical Work Code).

Because these registries are managed by separate global entities (IFPI for ISRCs and CISAC for ISWCs), there is no central mapping registry between them. This gap causes millions of dollars in mechanical royalties to sit unclaimed in collective management organization (CMO) "Black Boxes" before being liquidated to major publishers.

In this article, we'll design and implement a high-performance Semantic Entity Resolution Protocol (SERP) to bridge this metadata gap programmatically.


The SERP Resolution Pipeline

Reconciling these records requires a multi-layered classification pipeline. Since manual matching is logistically impossible, we implement a three-tiered algorithmic approach:

┌────────────────────────┐
│ Raw Recording & Work   │
│ Data Ingestion         │
└───────────┬────────────┘
            │
            ▼
┌────────────────────────┐
│ 1. Normalized Title    │ ──[Similarity < 0.85]──> [Unmatched Queue]
│    Distance Filter     │
└───────────┬────────────┘
            │ [Similarity >= 0.85]
            ▼
┌────────────────────────┐
│ 2. Creator Overlap     │ ──[No Overlap]──────────> [Unmatched Queue]
│    Intersection Matrix │
└───────────┬────────────┘
            │ [Intersection >= 1]
            ▼
┌────────────────────────┐
│ 3. Duration Tolerance  │ ──[Delta > 4s]──────────> [Manual Verification]
│    Guard Check         │
└───────────┬────────────┘
            │ [Delta <= 4s]
            ▼
┌────────────────────────┐
│   Verified Link &      │
│   CMO Dispute Ready    │
└────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Step 1: Normalization & String Similarity Filter

Title comparisons often fail due to punctuation mismatches, subtitle variations, or case discrepancies. We normalize text (converting to lowercase and stripping non-alphanumeric characters) and execute a punctuation-insensitive Levenshtein Distance comparison:

Sim(s1, s2) = 1 − [ Levenshtein(s1, s2) ÷ max(|s1|, |s2|) ]

If the similarity score exceeds 0.85, we pass the record to the next phase.

Step 2: Collaborator Overlap Intersection Matrix

To prevent matching cover songs or sample tracks of similar titles, we construct an intersection matrix of recording artists and registered songwriters. If the intersection of credited creators is greater than or equal to 1, the match proceeds.

Step 3: Duration Tolerance Guard

To distinguish between original mixes, radio edits, and acoustic/live variants (which alter the shares and statutory splits), we apply a strict duration check. The play length of the audio recording must match the composition work registry within a tolerance of ±4 seconds:

DISRC − DISWC ≤ 4 seconds


Programmatic Implementation with Polars

Below is a reference implementation using the high-performance Polars library in Python to execute this pipeline over large datasets:

import polars as pl

def resolve_metadata_gap(recordings_df: pl.DataFrame, compositions_df: pl.DataFrame) -> pl.DataFrame:
    # 1. Clean and normalize identifiers (removing hyphens, dots, spaces)
    cleaned_rec = recordings_df.with_columns([
        pl.col("isrc").str.replace_all(r"[\s-]", "").str.to_uppercase(),
        pl.col("track_title").str.to_lowercase().str.replace_all(r"[^\w\s]", "").str.strip_chars()
    ])

    cleaned_comp = compositions_df.with_columns([
        pl.col("iswc").str.replace_all(r"[\s.-]", "").str.to_uppercase(),
        pl.col("work_title").str.to_lowercase().str.replace_all(r"[^\w\s]", "").str.strip_chars()
    ])

    # 2. Join datasets on matching metadata fields
    # Here, we perform an initial join based on normalized titles to reduce search space
    joined = cleaned_rec.join(cleaned_comp, left_on="track_title", right_on="work_title", how="inner")

    # 3. Apply the Duration Tolerance Guard (<= 4 seconds)
    matched_candidates = joined.filter(
        (pl.col("duration_sec_rec") - pl.col("duration_sec_comp")).abs() <= 4
    )

    # 4. Creator Overlap Check
    # Verify at least one songwriter is present in the artist name or recording credits
    matched_candidates = matched_candidates.filter(
        pl.struct(["artist_name", "songwriters"]).map_batches(
            lambda s: pl.Series([
                any(writer.lower() in row["artist_name"].lower() for writer in row["songwriters"])
                for row in s.to_list()
            ])
        )
    )

    return matched_candidates.select([
        "isrc", "iswc", "track_title", "artist_name", "duration_sec_rec", "duration_sec_comp"
    ])

# Example Mock Data
recordings = pl.DataFrame({
    "isrc": ["US-RC1-23-00001", "GB-AHT-24-99882"],
    "track_title": ["Starlight (Midnight Edit)", "Hold Me Back"],
    "artist_name": ["Jane Doe & The Band", "John Smith"],
    "duration_sec_rec": [242, 180]
})

compositions = pl.DataFrame({
    "iswc": ["T-123.456.789-1", "T-987.654.321-0"],
    "work_title": ["Starlight", "Hold Me Back (Live Version)"],
    "songwriters": [["Jane Doe", "Alex Wright"], ["John Smith"]],
    "duration_sec_comp": [240, 310] # Track 2 fails duration check (live version mismatch)
})

resolution = resolve_metadata_gap(recordings, compositions)
print("Successfully Resolved Metadata Links:")
print(resolution)
Enter fullscreen mode Exit fullscreen mode

Scaling Up the Pipeline

When handling catalogs containing millions of stream rows and payout statements, performing inline string processing is slow. In production systems, we load normalized Arrow caches and use stateful orchestration engines (like LangGraph and vector databases) to handle semantic edge-cases.

For a deeper dive into open-source music metadata standards and reference implementations, check out the Universal Music Rights Metadata Standard (UMRMS) Spec.

Top comments (0)