DEV Community

Cover image for Building an AI DJ: What I Got Wrong About Music Embeddings πŸŽΆπŸ•Ί
Brandon Lozano
Brandon Lozano

Posted on

Building an AI DJ: What I Got Wrong About Music Embeddings πŸŽΆπŸ•Ί

It All Begins With a Vibe Gripe

We have all been there, at weddings, maybe even our own. The DJ decides to whip out Cupid Shuffle, Cha-Cha slide, and Cotten-eyed Joe (no offense if you genuinely enjoy these tracks). The intent is to drive engagement, but one thing we have found socially is that obligatory engagement isn't engagement. The purpose of a wedding reception is to get people dancing and to provide them with a great time.

Base Premise of the Application

I had a hunch that using meta-data, embeddings, personal data, location, and weather I could build an algorithm that outperformed 95% of wedding DJs. I am not bullish that computers will at all replace true art -- Tiesto, Steve Aoki, and Diplo are definitely safe IMO. The goal here is to build something that is budget friendly for couples who don't have an endless budget and want their guests to have a good time.

At the end of the day, engagement and music itself can be represented with tags and math.

Initial thoughts

I initially leveraged Spotify due to the countless public playlists β€” the like count is a decent proxy for crowd-tested quality. A few examples are SONGS TO DANCE TO AT 3:33 AM, sad songs to cry out to, and of course wedding reception BANGERS.

Originally my thought was to get the top 3 most liked playlists for wedding ceremonies, cocktail hours, dinners, and receptions using Spotify's API. Initially I assumed that BPM, Camelot Key, Popularity, Genre, and Artist would be sufficient to begin creating a vibe curve.

Importing Music

After copying 12 of the most popular playlist I setup a basic API to seed my PostgreSQL database. Quickly I found out that Spotify's /audio-features endpoint (which includes BPM, key, energy, etc.) was blocked for our app credentials. Rather than fight with API permissions, I built a manual entry script manually entering BPM and Camelot key for each track.

# Manual import script
def import_playlist_manual(playlist_id, stage):
    """Import playlist with manual BPM/Key entry from songbpm.com"""

    # Fetch all tracks from Spotify
    results = sp.playlist_tracks(playlist_id)
    # ... handle pagination ...

    for track in tracks:
        # Destructure track object from Spotify

        # Display track info
        print(f"🎡 {track_name}")
        print(f"🎀 {artist_name}")

        # Manual entry
        bpm_input = input("Enter BPM (or 's' to skip): ").strip()
        bpm = int(bpm_input)

        key_input = input("Enter Key (e.g. 10A, 2B) or press Enter: ").strip()

        # Estimate energy from BPM
        energy = 0.8 if bpm >= 130 else 0.6 if bpm >= 110 else 0.4

        # Save to database
        cursor.execute("""
            INSERT INTO curated_tracks 
            (spotify_id, name, artist, album, bpm, key, energy, stages, 
             popularity, duration_ms, genres, release_year, explicit)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        """, (track_id, track_name, artist_name, album_name, bpm, key, 
              energy, [stage], popularity, duration_ms, genres, 
              release_year, explicit))

# Function call with manual stage denotation, based on playlist
import_playlist_manual(40baa6dacb5a4ad2, 'reception')
Enter fullscreen mode Exit fullscreen mode

After building out the initial playlist generation services, I started to notice that recommendations were entirely off. I was adhering to the vibe curve for stages but the genre and semantic meaning of the music was way off. The vibe curve was one of the initial ideas, energy created from the music should have a curve based on the stage in the event to maximize engagement and minimize burnout.

CEREMONY = {
    'duration': 30,  # minutes
    'energy_range': (0.2, 0.4),
    'curve': 'flat_low',
    'description': 'Intimate and elegant throughout'
}

COCKTAIL = {
    'duration': 60,
    'energy_range': (0.4, 0.6),
    'curve': 'gentle_rise',
    'description': 'Conversational background, slight build'
}

DINNER = {
    'duration': 90,
    'energy_range': (0.3, 0.5),
    'curve': 'flat_medium',
    'description': 'Steady background, family-friendly'
}

RECEPTION = {
    'duration': 180,
    'energy_range': (0.5, 0.95),
    'curve': 'two_peak',
    'description': 'Gradual build β†’ peak β†’ sustain β†’ final peak'
}
Enter fullscreen mode Exit fullscreen mode
# Visual Representation
Energy
1.0 |                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
    |                        β”Œβ”€β”€β”€β”˜        └─
0.8 |                 β”Œβ”€β”€β”€β”€β”€β”€β”˜
    |            β”Œβ”€β”€β”€β”€β”˜
0.6 |      β”Œβ”€β”€β”€β”€β”€β”˜
    | β”€β”€β”€β”€β”€β”˜
0.4 |
    └─────────────────────────────────────> Time
    0%   15%   40%  55%      85%       100%
    warm  peak1 dip  peak2    sendoff
Enter fullscreen mode Exit fullscreen mode

Music Catalog Enhancements

From architecting RAGs and my overall experience with AI, I was aware of embeddings and wondered if that was something I could utilize to build a list of most similar songs to build the queues. Utilizing iTunes music preview API to download 30 second previews and LAION CLAP I was able to generate audio embeddings and Sentence Transformers to create embeddings based on meta-data per song. The approach: Generate 512-dimensional vectors that capture both sonic texture (tempo, timbre, instrumentation) and textual metadata (genres, artist, stage tags). By combining these with a 60/40 audio-to-metadata weighting, we could find songs that both sound similar and fit similar contexts.

# Initialize models
clap_model = CLAP_Module(enable_fusion=False)
clap_model.load_ckpt()  # LAION CLAP (HTSAT) for audio
text_model = SentenceTransformer('distiluse-base-multilingual-cased')

def generate_embeddings(track):
    # 1. Audio embedding from 30-second iTunes preview
    audio, sr = librosa.load(f"previews/{track.spotify_id}.m4a")
    audio_embedding = clap_model.get_audio_embedding_from_data(
        x=audio, 
        use_tensor=False
    )  # Returns 512-dim vector

    # 2. Metadata text embedding
    metadata_text = f"{track.genres}, {track.stages}, {track.artist}"
    metadata_embedding = text_model.encode(metadata_text)  # 512-dim

    # 3. Weighted combination (audio = 60%, metadata = 40%)
    combined = (audio_embedding * 0.6) + (metadata_embedding * 0.4)
    combined = combined / np.linalg.norm(combined)  # Normalize

    # Store in PostgreSQL with pgvector
    cursor.execute("""
        UPDATE curated_tracks 
        SET audio_embedding = %s,
            metadata_embedding = %s,
            combined_embedding = %s
        WHERE spotify_id = %s
    """, (audio_embedding.tolist(), metadata_embedding.tolist(),
          combined.tolist(), track.spotify_id))
Enter fullscreen mode Exit fullscreen mode

PostgreSQL vector similarity search

Using pgvector extension (which adds vector similarity search to PostgreSQL), we can query for similar songs using cosine distance:

-- Find 10 most similar songs by audio + metadata
SELECT name, artist, bpm, energy,
       combined_embedding <=> %s::vector AS similarity
FROM curated_tracks
WHERE combined_embedding IS NOT NULL
ORDER BY combined_embedding <=> %s::vector  -- Cosine distance
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

This did improve relevancy with suggested songs, but I could tell that something was missing. The problem: audio similarity captures sonic texture but not semantic intent. It was obvious that the essence, apart from lyrics of suggested songs was similar but the queue would add songs about war, club music, and romantic melodies all to the ceremony (if you are thinking of this split for your wedding please reconsider). A song about heartbreak can sound identical to a love songβ€”same tempo, same instrumentation, completely different meaning. Embeddings alone couldn't distinguish between "Fortunate Son" (war protest) and "Signed, Sealed, Delivered" (celebration) if they had similar acoustic profiles.

Enter Claude: Semantic Tagging for Intent

I decided to use Claude's API to tag music to represent the semantic meaning apart from audio qualities, but more so to capture the intent of each song. The cost was negligible - costing around $15 for 2.5k songs.

def tag_song_with_claude(track):
    """Extract semantic meaning beyond audio qualities"""

    prompt = f"""Analyze this wedding song:

    Title: {track.name}
    Artist: {track.artist}
    BPM: {track.bpm}
    Energy: {track.energy}

    Provide wedding-specific tags in JSON:
    {{
      "vibe_tags": ["romantic", "celebratory", "energetic", etc.],
      "occasion_tags": ["first-dance", "processional", "general-dance", etc.],
      "mood_tags": ["joyful", "tender", "euphoric", etc.],
      "lyric_themes": ["love", "commitment", "celebration", etc.],
      "content_flags": ["war-themes", "breakup", "explicit", "club-music"],
      "wedding_appropriate": true/false,
      "wedding_notes": "Brief explanation"
    }}"""

    message = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    tags = json.loads(message.content[0].text)

    # Save semantic tags to database
    cursor.execute("""
        UPDATE curated_tracks 
        SET vibe_tags = %s,
            occasion_tags = %s,
            mood_tags = %s,
            lyric_themes = %s,
            content_flags = %s,
            wedding_appropriate = %s,
            wedding_notes = %s
        WHERE spotify_id = %s
    """, (tags['vibe_tags'], tags['occasion_tags'], tags['mood_tags'],
          tags['lyric_themes'], tags['content_flags'], 
          tags['wedding_appropriate'], tags['wedding_notes'],
          track.spotify_id))
Enter fullscreen mode Exit fullscreen mode

The impact was immediate:

# Before Claude tagging - embeddings alone:
ceremony_queue = [
    "Canon in D",           # βœ… Perfect
    "The Sound of Silence", # ❌ Too melancholic
    "Fortunate Son",        # ❌ War protest song
    "Call Me Maybe"         # ❌ Club pop
]

# After Claude tagging - filtered by semantic meaning:
ceremony_queue = [
    "Canon in D",           # βœ… occasion_tags: ["processional"]
    "A Thousand Years",     # βœ… vibe_tags: ["romantic", "intimate"]
    "At Last",              # βœ… mood_tags: ["tender", "joyful"]
    "Marry You"             # βœ… lyric_themes: ["commitment", "love"]
]
Enter fullscreen mode Exit fullscreen mode

The Hybrid Approach

The final song selection algorithm uses a weighted scoring system:

  • Embedding similarity (60%) - Heaviest weight because sonic continuity matters most for dance floor flow
  • BPM/energy matching (25%) - Adheres to the vibe curve progression
  • Semantic tags (15%) - Acts as a filter rather than primary driver; prevents obvious mismatches

These weights came from iterative testing - we found that prioritizing sonic flow over semantic perfection kept energy up while still avoiding thematic disasters.

This hybrid approach prevents the "sounds right but feels wrong" problemβ€”songs flow naturally while staying thematically appropriate for each wedding moment.

In Conclusion

Vibe Curator is still in active development -- I am making some queue generation, conversation service, and UI enhancements before launching to my network for testing. Afterwards I plan to integrate real-time crowd engagement analysis to dynamically adjust the queue based on how people are actually responding on the dance floor.

The biggest takeaway from this whole process isn't specific to music though: embeddings are powerful, but they capture how something sounds or reads, not what it means. Pairing vector similarity with semantic intent is becoming a pattern I reach for anytime embeddings alone feel shallow.

If you've tackled a similar hybrid approach, or have thoughts on the scoring weights, I'd love to hear how you approached it in the comments.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.