agenthustler

Posted on Apr 9 • Edited on Apr 19

SoundCloud Scraping: Extract Tracks, Artists, and Playlist Data

#webdev #javascript #programming #webscraping

SoundCloud remains one of the largest audio platforms in the world, with over 300 million tracks and 76 million monthly active users. For music industry analysts, A&R scouts, content curators, and developers building music-related applications, the ability to extract SoundCloud data programmatically is incredibly valuable.

In this guide, we'll explore SoundCloud's data architecture, walk through practical scraping techniques with code examples, handle pagination challenges, and show you how to use the Apify platform for production-grade data extraction at scale.

Understanding SoundCloud's Data Architecture

SoundCloud organizes its data around several core entities that form a rich, interconnected graph of audio content. Understanding these relationships is key to effective scraping.

Tracks

Each track on SoundCloud contains extensive metadata that's useful for analytics and discovery:

Track ID: Unique numerical identifier
Title and description: The track's name and creator's notes
Duration: Length in milliseconds
Genre and tags: Classification metadata for discovery
Play count, likes, reposts, and comments: Engagement metrics
Waveform data: Visual representation of the audio signal
Created/uploaded date: When the track was published
Downloadable flag: Whether the creator allows downloads
Stream URL: The audio streaming endpoint (requires client_id)
Artwork URL: Cover image available in various sizes (t500x500, crop, large, etc.)
License type: Creative Commons or All Rights Reserved
BPM and key: Musical metadata (when provided by the uploader)

Artist Profiles (Users)

Creator profiles contain business-critical information for talent scouting and market analysis:

User ID and permalink: Unique identifiers
Username and display name
Bio/description with rich text
Location (city, country): Geographic data for regional analysis
Follower and following counts: Social graph metrics
Track count and playlist count: Content volume indicators
Verified status and Pro/Pro Unlimited badges: Account tier information
Website and social links: External presence
Avatar and banner images: Visual branding assets
Monthly listener count: Audience reach metric

Playlists (Sets)

Playlists group tracks together and include their own metadata:

Playlist ID and permalink
Title, description, and tags
Track listing with order: The full tracklist
Total duration: Combined length of all tracks
Like and repost counts: Engagement on the playlist itself
Public/private status: Visibility setting
Created and last modified dates: Activity indicators
Playlist type: Album, EP, single, compilation, or playlist

Comments

SoundCloud's unique timed-comment system provides engagement data:

Comment text: The actual message
Timestamp position: Where in the track the comment was placed
Author information: Who left the comment
Creation date: When the comment was posted

Why Scrape SoundCloud Data?

There are many legitimate and valuable reasons to extract SoundCloud data:

A&R and Talent Scouting: Discovering emerging artists based on engagement growth, genre performance, and audience metrics before they blow up
Market Analysis: Understanding genre trends, popular sounds, release patterns, and audience preferences across regions
Playlist Curation: Building data-driven playlists based on track metrics, genre classification, and BPM matching
Music Industry Research: Academic studies on music distribution, consumption patterns, and platform economics
Competitive Analysis: Tracking competitor labels, monitoring artist release strategies, and benchmarking performance
Content Monitoring: Detecting unauthorized uploads of copyrighted content or tracking remix culture
Recommendation Engines: Building music discovery tools based on listening patterns and engagement signals

SoundCloud's Technical Landscape

Before writing any code, let's understand the technical environment.

The SoundCloud API Situation

SoundCloud officially deprecated its public API registration in 2017. While the API endpoints still exist and function with valid client IDs, getting a new client ID through official channels is no longer possible. However, the internal API (api-v2.soundcloud.com) is used by the web client and can be accessed if you can obtain a valid client_id from the page source.

Hydration Data: The Scraper's Best Friend

SoundCloud uses a hybrid rendering approach that's actually favorable for scraping. Initial page loads include server-rendered HTML with embedded JSON data via window.__sc_hydration. This means you can often extract structured data from a simple HTTP request without needing a headless browser — a huge advantage over platforms like TikTok or Instagram.

Rate Limiting

SoundCloud implements rate limiting on both its web pages and internal API endpoints. You'll typically see 429 responses after sustained high-volume requests. Their limits are more generous than most social platforms, but aggressive scraping will still get you blocked.

Method 1: Extracting Hydration Data from SoundCloud Pages

The most efficient scraping approach leverages SoundCloud's embedded hydration data. No headless browser needed:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeSoundCloudTrack(trackUrl) {
    const headers = {
        'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
            'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    };

    try {
        const response = await axios.get(
            trackUrl, { headers }
        );
        const $ = cheerio.load(response.data);

        // SoundCloud embeds structured data in script tags
        const scripts = $('script')
            .map((_, el) => $(el).html())
            .get();

        const hydrationScript = scripts.find(s =>
            s && s.includes('window.__sc_hydration')
        );

        if (!hydrationScript) {
            throw new Error(
                'Could not find hydration data'
            );
        }

        // Extract the JSON array from the script
        const jsonMatch = hydrationScript.match(
            /window\.__sc_hydration\s*=\s*(\[.*?\]);/s
        );

        if (!jsonMatch) {
            throw new Error(
                'Could not parse hydration JSON'
            );
        }

        const hydrationData = JSON.parse(jsonMatch[1]);

        // Find the track data in the hydration array
        const trackData = hydrationData.find(
            item => item.hydratable === 'sound'
        );

        if (!trackData) {
            throw new Error('Track data not found');
        }

        const track = trackData.data;

        return {
            id: track.id,
            title: track.title,
            description: track.description,
            duration: track.duration,
            durationFormatted: formatDuration(
                track.duration
            ),
            genre: track.genre,
            tags: track.tag_list,
            playCount: track.playback_count,
            likeCount: track.likes_count,
            repostCount: track.reposts_count,
            commentCount: track.comment_count,
            createdAt: track.created_at,
            artworkUrl: track.artwork_url,
            waveformUrl: track.waveform_url,
            downloadable: track.downloadable,
            license: track.license,
            author: {
                id: track.user?.id,
                username: track.user?.username,
                displayName: track.user?.full_name,
                permalink: track.user?.permalink,
                avatarUrl: track.user?.avatar_url,
                verified: track.user?.verified,
                followers: track.user?.followers_count,
            },
        };
    } catch (error) {
        console.error(`Scraping error: ${error.message}`);
        return null;
    }
}

function formatDuration(ms) {
    const minutes = Math.floor(ms / 60000);
    const seconds = Math.floor((ms % 60000) / 1000);
    return `${minutes}:${seconds.toString().padStart(2, '0')}`;
}

// Usage
scrapeSoundCloudTrack(
    'https://soundcloud.com/artist/track-name'
).then(data => {
    if (data) {
        console.log(`${data.title} by ${data.author.displayName}`);
        console.log(`${data.playCount.toLocaleString()} plays`);
        console.log(`Duration: ${data.durationFormatted}`);
    }
});

This approach is fast and efficient because it only requires a single HTTP request per page — no browser rendering overhead.

Method 2: Scraping Artist Profiles and Their Catalog

Extracting a complete artist profile follows the same hydration pattern:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeArtistProfile(artistUrl) {
    const headers = {
        'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
            'AppleWebKit/537.36 Chrome/120.0.0.0',
    };

    const response = await axios.get(artistUrl, { headers });
    const $ = cheerio.load(response.data);

    // Extract hydration data
    const scripts = $('script')
        .map((_, el) => $(el).html())
        .get();
    const hydrationScript = scripts.find(
        s => s && s.includes('window.__sc_hydration')
    );
    const jsonMatch = hydrationScript.match(
        /window\.__sc_hydration\s*=\s*(\[.*?\]);/s
    );
    const hydrationData = JSON.parse(jsonMatch[1]);

    // Extract user profile data
    const userData = hydrationData.find(
        item => item.hydratable === 'user'
    );
    const user = userData.data;

    const profile = {
        id: user.id,
        username: user.username,
        displayName: user.full_name,
        permalink: user.permalink,
        bio: user.description,
        location: {
            city: user.city,
            country: user.country_code,
        },
        stats: {
            followers: user.followers_count,
            following: user.followings_count,
            tracks: user.track_count,
            playlists: user.playlist_count,
            likes: user.likes_count,
        },
        verified: user.verified,
        avatarUrl: user.avatar_url?.replace(
            '-large', '-t500x500'
        ),
        bannerUrl: user.visuals?.visuals?.[0]?.visual_url,
        website: user.website,
        createdAt: user.created_at,
        lastModified: user.last_modified,
    };

    // Also extract initial track listing if available
    const trackListing = hydrationData.find(
        item => item.hydratable === 'playlist'
    );

    if (trackListing?.data?.tracks) {
        profile.recentTracks = trackListing.data.tracks
            .slice(0, 10)
            .map(t => ({
                id: t.id,
                title: t.title,
                plays: t.playback_count,
                likes: t.likes_count,
                duration: formatDuration(t.duration),
            }));
    }

    return profile;
}

// Usage
scrapeArtistProfile(
    'https://soundcloud.com/artist-name'
).then(profile => {
    console.log(`${profile.displayName} (@${profile.username})`);
    console.log(`${profile.stats.followers.toLocaleString()} followers`);
    console.log(`${profile.stats.tracks} tracks`);
    console.log(`Location: ${profile.location.city || 'Unknown'}`);
});

Method 3: Handling Pagination for Large Track Catalogs

SoundCloud uses cursor-based pagination for its internal API. When an artist has hundreds of tracks, you need to handle this pagination properly:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Method 4: Scraping Playlist Data

Playlists require extracting nested track data. Here's a clean approach:

async function scrapePlaylist(playlistUrl) {
    const headers = {
        'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
            'AppleWebKit/537.36 Chrome/120.0.0.0',
    };

    const response = await axios.get(
        playlistUrl, { headers }
    );
    const $ = cheerio.load(response.data);

    const scripts = $('script')
        .map((_, el) => $(el).html())
        .get();
    const hydrationScript = scripts.find(
        s => s && s.includes('window.__sc_hydration')
    );

    if (!hydrationScript) {
        throw new Error('Hydration data not found');
    }

    const jsonMatch = hydrationScript.match(
        /window\.__sc_hydration\s*=\s*(\[.*?\]);/s
    );
    const hydrationData = JSON.parse(jsonMatch[1]);

    const playlistData = hydrationData.find(
        item => item.hydratable === 'playlist'
    );

    if (!playlistData) {
        throw new Error('Playlist data not found');
    }

    const playlist = playlistData.data;

    return {
        id: playlist.id,
        title: playlist.title,
        description: playlist.description,
        duration: playlist.duration,
        durationFormatted: formatDuration(
            playlist.duration
        ),
        trackCount: playlist.track_count,
        likeCount: playlist.likes_count,
        repostCount: playlist.reposts_count,
        createdAt: playlist.created_at,
        lastModified: playlist.last_modified,
        genre: playlist.genre,
        tags: playlist.tag_list,
        isAlbum: playlist.is_album,
        setType: playlist.set_type,
        author: {
            username: playlist.user?.username,
            displayName: playlist.user?.full_name,
            permalink: playlist.user?.permalink,
            verified: playlist.user?.verified,
        },
        tracks: (playlist.tracks || []).map(track => ({
            id: track.id,
            title: track.title,
            duration: track.duration,
            durationFormatted: formatDuration(
                track.duration
            ),
            playCount: track.playback_count,
            likeCount: track.likes_count,
            genre: track.genre,
            author: track.user?.username,
        })),
    };
}

// Usage
scrapePlaylist(
    'https://soundcloud.com/user/sets/playlist-name'
).then(playlist => {
    console.log(`${playlist.title} by @${playlist.author.username}`);
    console.log(`${playlist.trackCount} tracks, ${playlist.durationFormatted}`);
    console.log(`${playlist.likeCount} likes\n`);

    playlist.tracks.forEach((t, i) => {
        console.log(
            `${i + 1}. ${t.title} (${t.durationFormatted}) - ` +
            `${(t.playCount || 0).toLocaleString()} plays`
        );
    });
});

Scaling with Apify: Production-Ready SoundCloud Scraping

Building and maintaining custom scrapers is time-consuming, especially when SoundCloud changes its frontend, API structure, or anti-bot measures. For production workloads requiring reliability and scale, the Apify platform provides battle-tested infrastructure.

Using SoundCloud Actors from the Apify Store

The Apify Store has pre-built actors for SoundCloud scraping that handle all the complexities we've discussed:

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

async function scrapeSoundCloudWithApify() {
    // Run a SoundCloud scraper actor
    const run = await client.actor(
        'natasha.lekh/soundcloud-scraper'
    ).call({
        startUrls: [
            { url: 'https://soundcloud.com/artist-name' },
            { url: 'https://soundcloud.com/discover/sets/charts-top:all-music' },
        ],
        maxItems: 100,
        proxy: {
            useApifyProxy: true,
            apifyProxyGroups: ['RESIDENTIAL'],
        },
    });

    // Fetch and process results
    const { items } = await client.dataset(
        run.defaultDatasetId
    ).listItems();

    console.log(`Extracted ${items.length} items`);

    // Quick genre distribution analysis
    const genreDistribution = {};
    items.forEach(item => {
        const genre = item.genre || 'Unknown';
        genreDistribution[genre] =
            (genreDistribution[genre] || 0) + 1;
    });

    console.log(
        'Genre distribution:',
        JSON.stringify(genreDistribution, null, 2)
    );
    return items;
}

scrapeSoundCloudWithApify();

Why Apify for SoundCloud Scraping?

Maintained Scrapers: The Apify community continuously updates actors when SoundCloud changes its structure or API
Proxy Rotation: Automatic IP rotation prevents blocks during large-scale operations
Parallel Execution: Scrape multiple profiles, playlists, and search results simultaneously
Data Export: Export to JSON, CSV, Excel, or push directly to databases via built-in integrations
Webhooks: Trigger downstream processes (email alerts, database updates) when scraping completes
Cost Efficiency: Pay-per-use pricing means you only pay for compute time actually used
Scheduling: Set up recurring scrapes on any CRON schedule for continuous monitoring

Building a SoundCloud Analytics Pipeline

Here's a complete analytics pipeline that scrapes, processes, and generates insights:

const { ApifyClient } = require('apify-client');

class SoundCloudAnalytics {
    constructor(apifyToken) {
        this.client = new ApifyClient(
            { token: apifyToken }
        );
    }

    async scrapeArtistCatalog(artistUrls) {
        const run = await this.client.actor(
            'natasha.lekh/soundcloud-scraper'
        ).call({
            startUrls: artistUrls.map(
                url => ({ url: `${url}/tracks` })
            ),
            maxItems: 500,
        });

        const { items } = await this.client.dataset(
            run.defaultDatasetId
        ).listItems();
        return items;
    }

    analyzeGrowthPatterns(tracks) {
        const sorted = [...tracks].sort(
            (a, b) =>
                new Date(a.createdAt) -
                new Date(b.createdAt)
        );

        const windowSize = 5;
        return sorted.map((track, i) => {
            const window = sorted.slice(
                Math.max(0, i - windowSize + 1),
                i + 1
            );
            const avgPlays = window.reduce(
                (sum, t) => sum + (t.playCount || 0), 0
            ) / window.length;

            return {
                title: track.title,
                date: track.createdAt,
                plays: track.playCount,
                rollingAvgPlays: Math.round(avgPlays),
                trend: avgPlays > (track.playCount || 0)
                    ? 'declining' : 'growing',
            };
        });
    }

    findBreakoutTracks(tracks, threshold = 3) {
        const avgPlays = tracks.reduce(
            (sum, t) => sum + (t.playCount || 0), 0
        ) / tracks.length;

        return tracks
            .filter(t =>
                (t.playCount || 0) > avgPlays * threshold
            )
            .sort((a, b) =>
                (b.playCount || 0) - (a.playCount || 0)
            );
    }

    generateReport(tracks) {
        if (!tracks.length) {
            return { error: 'No tracks to analyze' };
        }

        const totalPlays = tracks.reduce(
            (sum, t) => sum + (t.playCount || 0), 0
        );
        const totalLikes = tracks.reduce(
            (sum, t) => sum + (t.likeCount || 0), 0
        );
        const engRate = totalPlays > 0
            ? (totalLikes / totalPlays * 100).toFixed(2)
            : 0;

        // Genre breakdown
        const genres = {};
        tracks.forEach(t => {
            const g = t.genre || 'Unknown';
            genres[g] = (genres[g] || 0) + 1;
        });

        return {
            summary: {
                totalTracks: tracks.length,
                totalPlays: totalPlays.toLocaleString(),
                totalLikes: totalLikes.toLocaleString(),
                avgEngagementRate: `${engRate}%`,
                avgPlaysPerTrack: Math.round(
                    totalPlays / tracks.length
                ).toLocaleString(),
            },
            genres: Object.entries(genres)
                .sort((a, b) => b[1] - a[1]),
            breakoutTracks:
                this.findBreakoutTracks(tracks),
            growthAnalysis:
                this.analyzeGrowthPatterns(tracks),
        };
    }
}

// Usage
const analytics = new SoundCloudAnalytics(
    'YOUR_APIFY_TOKEN'
);

analytics.scrapeArtistCatalog([
    'https://soundcloud.com/artist1',
    'https://soundcloud.com/artist2',
]).then(tracks => {
    const report = analytics.generateReport(tracks);
    console.log('=== Summary ===');
    console.log(JSON.stringify(report.summary, null, 2));
    console.log(`\n=== Breakout Tracks (${report.breakoutTracks.length}) ===`);
    report.breakoutTracks.forEach(t =>
        console.log(`  ${t.title}: ${t.playCount.toLocaleString()} plays`)
    );
});

Best Practices for SoundCloud Scraping

1. Respect Rate Limits

Always implement polite scraping with delays and backoff:

async function politeRequest(url, options = {}) {
    const {
        minDelay = 1000,
        maxDelay = 3000
    } = options;

    // Random delay to appear more natural
    const delay = Math.floor(
        Math.random() * (maxDelay - minDelay) + minDelay
    );
    await new Promise(r => setTimeout(r, delay));

    try {
        return await axios.get(url, {
            headers: {
                'User-Agent':
                    'Mozilla/5.0 (Windows NT 10.0; ' +
                    'Win64; x64) AppleWebKit/537.36 ' +
                    'Chrome/120.0.0.0',
            },
            timeout: 15000,
        });
    } catch (error) {
        if (error.response?.status === 429) {
            const retryAfter =
                error.response.headers['retry-after']
                || 60;
            console.log(
                `Rate limited. Waiting ${retryAfter}s`
            );
            await new Promise(
                r => setTimeout(r, retryAfter * 1000)
            );
            return politeRequest(url, options);
        }
        throw error;
    }
}

2. Handle Missing Data Gracefully

SoundCloud data isn't always complete — some fields are optional or user-dependent:

function safeExtract(obj, path, defaultValue = null) {
    return path.split('.').reduce(
        (acc, key) =>
            acc && acc[key] !== undefined
                ? acc[key]
                : defaultValue,
        obj
    );
}

// Usage
const playCount = safeExtract(
    track, 'stats.playback_count', 0
);
const city = safeExtract(
    user, 'location.city', 'Unknown'
);

3. Deduplicate Results Across Runs

When scraping across multiple pages or running recurring jobs, deduplication is essential:

function deduplicateTracks(tracks) {
    const seen = new Map();
    return tracks.filter(track => {
        if (seen.has(track.id)) {
            const existing = seen.get(track.id);
            // Keep the more recently scraped version
            if (new Date(track.scrapedAt) >
                new Date(existing.scrapedAt)) {
                seen.set(track.id, track);
            }
            return false;
        }
        seen.set(track.id, track);
        return true;
    });
}

4. Extracting the Client ID

If you need to make direct API calls, you can extract the client_id from SoundCloud's JavaScript bundles:

async function extractClientId() {
    const response = await axios.get(
        'https://soundcloud.com',
        {
            headers: {
                'User-Agent':
                    'Mozilla/5.0 Chrome/120.0.0.0'
            }
        }
    );

    // Find script URLs in the page
    const scriptUrls = response.data.match(
        /https:\/\/a-v2\.sndcdn\.com\/assets\/[^"]+\.js/g
    );

    if (!scriptUrls) return null;

    // Check each script for client_id
    for (const scriptUrl of scriptUrls.slice(-3)) {
        const scriptResp = await axios.get(scriptUrl);
        const match = scriptResp.data.match(
            /client_id:"([a-zA-Z0-9]+)"/
        );
        if (match) return match[1];
    }

    return null;
}

Legal and Ethical Considerations

As with any web scraping project, operating ethically is paramount:

Public Data Only: Only scrape publicly available content. Don't attempt to access private tracks, playlists, or profiles.
Respect robots.txt: Check SoundCloud's robots.txt before scraping specific paths.
Don't Download Audio: Scraping metadata is fundamentally different from downloading copyrighted audio content. Stick to metadata and engagement metrics.
Rate Limit Compliance: Don't overwhelm SoundCloud's servers. Use reasonable delays and respect 429 responses.
Data Privacy: If you collect user data (profiles, comments), comply with GDPR, CCPA, and other applicable privacy regulations.
Terms of Service: Review SoundCloud's Terms of Service before beginning any scraping project.
Attribution: If you publish analysis based on SoundCloud data, credit the platform and creators appropriately.

Conclusion

SoundCloud scraping opens up powerful possibilities for music industry analytics, talent discovery, playlist curation, and content monitoring. The platform's embedded hydration data makes it more accessible than many modern SPAs — you can extract structured data from simple HTTP requests without always needing a full headless browser.

For small-scale projects and learning, the Node.js approaches shown in this guide will serve you well. For production workloads requiring reliability, scale, and minimal maintenance, the Apify platform provides purpose-built infrastructure with pre-made actors, automatic proxy management, and built-in scheduling.

Whether you're building a music analytics dashboard, scouting for emerging artists, or researching audio content trends, the combination of smart scraping techniques and robust infrastructure will give you the data foundation you need.

Start with a specific use case, validate your approach on a small dataset, and scale up methodically. The SoundCloud ecosystem has over 300 million tracks worth of data waiting to be analyzed.

Need production-ready SoundCloud scraping? Browse the Apify Store for maintained actors with built-in proxy rotation, pagination handling, and data export capabilities.

DEV Community