SoundCloud remains one of the largest audio platforms in the world, with over 300 million tracks and 76 million monthly active users. For music industry analysts, A&R scouts, content curators, and developers building music-related applications, the ability to extract SoundCloud data programmatically is incredibly valuable.
In this guide, we'll explore SoundCloud's data architecture, walk through practical scraping techniques with code examples, handle pagination challenges, and show you how to use the Apify platform for production-grade data extraction at scale.
Understanding SoundCloud's Data Architecture
SoundCloud organizes its data around several core entities that form a rich, interconnected graph of audio content. Understanding these relationships is key to effective scraping.
Tracks
Each track on SoundCloud contains extensive metadata that's useful for analytics and discovery:
- Track ID: Unique numerical identifier
- Title and description: The track's name and creator's notes
- Duration: Length in milliseconds
- Genre and tags: Classification metadata for discovery
- Play count, likes, reposts, and comments: Engagement metrics
- Waveform data: Visual representation of the audio signal
- Created/uploaded date: When the track was published
- Downloadable flag: Whether the creator allows downloads
- Stream URL: The audio streaming endpoint (requires client_id)
- Artwork URL: Cover image available in various sizes (t500x500, crop, large, etc.)
- License type: Creative Commons or All Rights Reserved
- BPM and key: Musical metadata (when provided by the uploader)
Artist Profiles (Users)
Creator profiles contain business-critical information for talent scouting and market analysis:
- User ID and permalink: Unique identifiers
- Username and display name
- Bio/description with rich text
- Location (city, country): Geographic data for regional analysis
- Follower and following counts: Social graph metrics
- Track count and playlist count: Content volume indicators
- Verified status and Pro/Pro Unlimited badges: Account tier information
- Website and social links: External presence
- Avatar and banner images: Visual branding assets
- Monthly listener count: Audience reach metric
Playlists (Sets)
Playlists group tracks together and include their own metadata:
- Playlist ID and permalink
- Title, description, and tags
- Track listing with order: The full tracklist
- Total duration: Combined length of all tracks
- Like and repost counts: Engagement on the playlist itself
- Public/private status: Visibility setting
- Created and last modified dates: Activity indicators
- Playlist type: Album, EP, single, compilation, or playlist
Comments
SoundCloud's unique timed-comment system provides engagement data:
- Comment text: The actual message
- Timestamp position: Where in the track the comment was placed
- Author information: Who left the comment
- Creation date: When the comment was posted
Why Scrape SoundCloud Data?
There are many legitimate and valuable reasons to extract SoundCloud data:
- A&R and Talent Scouting: Discovering emerging artists based on engagement growth, genre performance, and audience metrics before they blow up
- Market Analysis: Understanding genre trends, popular sounds, release patterns, and audience preferences across regions
- Playlist Curation: Building data-driven playlists based on track metrics, genre classification, and BPM matching
- Music Industry Research: Academic studies on music distribution, consumption patterns, and platform economics
- Competitive Analysis: Tracking competitor labels, monitoring artist release strategies, and benchmarking performance
- Content Monitoring: Detecting unauthorized uploads of copyrighted content or tracking remix culture
- Recommendation Engines: Building music discovery tools based on listening patterns and engagement signals
SoundCloud's Technical Landscape
Before writing any code, let's understand the technical environment.
The SoundCloud API Situation
SoundCloud officially deprecated its public API registration in 2017. While the API endpoints still exist and function with valid client IDs, getting a new client ID through official channels is no longer possible. However, the internal API (api-v2.soundcloud.com) is used by the web client and can be accessed if you can obtain a valid client_id from the page source.
Hydration Data: The Scraper's Best Friend
SoundCloud uses a hybrid rendering approach that's actually favorable for scraping. Initial page loads include server-rendered HTML with embedded JSON data via window.__sc_hydration. This means you can often extract structured data from a simple HTTP request without needing a headless browser — a huge advantage over platforms like TikTok or Instagram.
Rate Limiting
SoundCloud implements rate limiting on both its web pages and internal API endpoints. You'll typically see 429 responses after sustained high-volume requests. Their limits are more generous than most social platforms, but aggressive scraping will still get you blocked.
Method 1: Extracting Hydration Data from SoundCloud Pages
The most efficient scraping approach leverages SoundCloud's embedded hydration data. No headless browser needed:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeSoundCloudTrack(trackUrl) {
const headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
};
try {
const response = await axios.get(
trackUrl, { headers }
);
const $ = cheerio.load(response.data);
// SoundCloud embeds structured data in script tags
const scripts = $('script')
.map((_, el) => $(el).html())
.get();
const hydrationScript = scripts.find(s =>
s && s.includes('window.__sc_hydration')
);
if (!hydrationScript) {
throw new Error(
'Could not find hydration data'
);
}
// Extract the JSON array from the script
const jsonMatch = hydrationScript.match(
/window\.__sc_hydration\s*=\s*(\[.*?\]);/s
);
if (!jsonMatch) {
throw new Error(
'Could not parse hydration JSON'
);
}
const hydrationData = JSON.parse(jsonMatch[1]);
// Find the track data in the hydration array
const trackData = hydrationData.find(
item => item.hydratable === 'sound'
);
if (!trackData) {
throw new Error('Track data not found');
}
const track = trackData.data;
return {
id: track.id,
title: track.title,
description: track.description,
duration: track.duration,
durationFormatted: formatDuration(
track.duration
),
genre: track.genre,
tags: track.tag_list,
playCount: track.playback_count,
likeCount: track.likes_count,
repostCount: track.reposts_count,
commentCount: track.comment_count,
createdAt: track.created_at,
artworkUrl: track.artwork_url,
waveformUrl: track.waveform_url,
downloadable: track.downloadable,
license: track.license,
author: {
id: track.user?.id,
username: track.user?.username,
displayName: track.user?.full_name,
permalink: track.user?.permalink,
avatarUrl: track.user?.avatar_url,
verified: track.user?.verified,
followers: track.user?.followers_count,
},
};
} catch (error) {
console.error(`Scraping error: ${error.message}`);
return null;
}
}
function formatDuration(ms) {
const minutes = Math.floor(ms / 60000);
const seconds = Math.floor((ms % 60000) / 1000);
return `${minutes}:${seconds.toString().padStart(2, '0')}`;
}
// Usage
scrapeSoundCloudTrack(
'https://soundcloud.com/artist/track-name'
).then(data => {
if (data) {
console.log(`${data.title} by ${data.author.displayName}`);
console.log(`${data.playCount.toLocaleString()} plays`);
console.log(`Duration: ${data.durationFormatted}`);
}
});
This approach is fast and efficient because it only requires a single HTTP request per page — no browser rendering overhead.
Method 2: Scraping Artist Profiles and Their Catalog
Extracting a complete artist profile follows the same hydration pattern:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeArtistProfile(artistUrl) {
const headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
'AppleWebKit/537.36 Chrome/120.0.0.0',
};
const response = await axios.get(artistUrl, { headers });
const $ = cheerio.load(response.data);
// Extract hydration data
const scripts = $('script')
.map((_, el) => $(el).html())
.get();
const hydrationScript = scripts.find(
s => s && s.includes('window.__sc_hydration')
);
const jsonMatch = hydrationScript.match(
/window\.__sc_hydration\s*=\s*(\[.*?\]);/s
);
const hydrationData = JSON.parse(jsonMatch[1]);
// Extract user profile data
const userData = hydrationData.find(
item => item.hydratable === 'user'
);
const user = userData.data;
const profile = {
id: user.id,
username: user.username,
displayName: user.full_name,
permalink: user.permalink,
bio: user.description,
location: {
city: user.city,
country: user.country_code,
},
stats: {
followers: user.followers_count,
following: user.followings_count,
tracks: user.track_count,
playlists: user.playlist_count,
likes: user.likes_count,
},
verified: user.verified,
avatarUrl: user.avatar_url?.replace(
'-large', '-t500x500'
),
bannerUrl: user.visuals?.visuals?.[0]?.visual_url,
website: user.website,
createdAt: user.created_at,
lastModified: user.last_modified,
};
// Also extract initial track listing if available
const trackListing = hydrationData.find(
item => item.hydratable === 'playlist'
);
if (trackListing?.data?.tracks) {
profile.recentTracks = trackListing.data.tracks
.slice(0, 10)
.map(t => ({
id: t.id,
title: t.title,
plays: t.playback_count,
likes: t.likes_count,
duration: formatDuration(t.duration),
}));
}
return profile;
}
// Usage
scrapeArtistProfile(
'https://soundcloud.com/artist-name'
).then(profile => {
console.log(`${profile.displayName} (@${profile.username})`);
console.log(`${profile.stats.followers.toLocaleString()} followers`);
console.log(`${profile.stats.tracks} tracks`);
console.log(`Location: ${profile.location.city || 'Unknown'}`);
});
Method 3: Handling Pagination for Large Track Catalogs
SoundCloud uses cursor-based pagination for its internal API. When an artist has hundreds of tracks, you need to handle this pagination properly:
const puppeteer = require('puppeteer');
async function scrapeAllTracks(artistUrl, maxTracks = 200) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
const tracks = [];
const seenIds = new Set();
// Intercept API calls to capture paginated data
page.on('response', async (response) => {
const url = response.url();
if (url.includes('api-v2.soundcloud.com') &&
(url.includes('/tracks') ||
url.includes('/stream') ||
url.includes('collection'))) {
try {
const data = await response.json();
const collection = data.collection || [];
collection.forEach(item => {
// Handle both direct tracks and
// wrapped track objects
const track = item.track || item;
if (track.id &&
track.title &&
!seenIds.has(track.id)) {
seenIds.add(track.id);
tracks.push({
id: track.id,
title: track.title,
duration: track.duration,
genre: track.genre,
tags: track.tag_list,
playCount: track.playback_count,
likeCount: track.likes_count,
repostCount: track.reposts_count,
commentCount: track.comment_count,
createdAt: track.created_at,
artworkUrl: track.artwork_url,
permalink: track.permalink_url,
});
}
});
} catch (e) {
// Not a JSON response, skip
}
}
});
await page.goto(`${artistUrl}/tracks`, {
waitUntil: 'networkidle2'
});
// Scroll to trigger pagination API calls
let previousCount = 0;
let staleRounds = 0;
const maxStaleRounds = 5;
while (tracks.length < maxTracks &&
staleRounds < maxStaleRounds) {
await page.evaluate(() => {
window.scrollTo(
0, document.body.scrollHeight
);
});
await new Promise(r => setTimeout(r, 2500));
if (tracks.length === previousCount) {
staleRounds++;
} else {
staleRounds = 0;
}
previousCount = tracks.length;
// Progress logging
if (tracks.length % 50 === 0 &&
tracks.length > 0) {
console.log(
`Collected ${tracks.length} tracks...`
);
}
}
await browser.close();
return tracks.slice(0, maxTracks);
}
// Usage
scrapeAllTracks(
'https://soundcloud.com/artist-name', 500
).then(tracks => {
console.log(`Scraped ${tracks.length} tracks total`);
// Sort by play count to find top performers
const topTracks = [...tracks]
.sort((a, b) =>
(b.playCount || 0) - (a.playCount || 0)
)
.slice(0, 10);
console.log('\nTop 10 tracks by plays:');
topTracks.forEach((t, i) => {
console.log(
`${i + 1}. ${t.title} - ` +
`${(t.playCount || 0).toLocaleString()} plays`
);
});
});
Method 4: Scraping Playlist Data
Playlists require extracting nested track data. Here's a clean approach:
async function scrapePlaylist(playlistUrl) {
const headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
'AppleWebKit/537.36 Chrome/120.0.0.0',
};
const response = await axios.get(
playlistUrl, { headers }
);
const $ = cheerio.load(response.data);
const scripts = $('script')
.map((_, el) => $(el).html())
.get();
const hydrationScript = scripts.find(
s => s && s.includes('window.__sc_hydration')
);
if (!hydrationScript) {
throw new Error('Hydration data not found');
}
const jsonMatch = hydrationScript.match(
/window\.__sc_hydration\s*=\s*(\[.*?\]);/s
);
const hydrationData = JSON.parse(jsonMatch[1]);
const playlistData = hydrationData.find(
item => item.hydratable === 'playlist'
);
if (!playlistData) {
throw new Error('Playlist data not found');
}
const playlist = playlistData.data;
return {
id: playlist.id,
title: playlist.title,
description: playlist.description,
duration: playlist.duration,
durationFormatted: formatDuration(
playlist.duration
),
trackCount: playlist.track_count,
likeCount: playlist.likes_count,
repostCount: playlist.reposts_count,
createdAt: playlist.created_at,
lastModified: playlist.last_modified,
genre: playlist.genre,
tags: playlist.tag_list,
isAlbum: playlist.is_album,
setType: playlist.set_type,
author: {
username: playlist.user?.username,
displayName: playlist.user?.full_name,
permalink: playlist.user?.permalink,
verified: playlist.user?.verified,
},
tracks: (playlist.tracks || []).map(track => ({
id: track.id,
title: track.title,
duration: track.duration,
durationFormatted: formatDuration(
track.duration
),
playCount: track.playback_count,
likeCount: track.likes_count,
genre: track.genre,
author: track.user?.username,
})),
};
}
// Usage
scrapePlaylist(
'https://soundcloud.com/user/sets/playlist-name'
).then(playlist => {
console.log(`${playlist.title} by @${playlist.author.username}`);
console.log(`${playlist.trackCount} tracks, ${playlist.durationFormatted}`);
console.log(`${playlist.likeCount} likes\n`);
playlist.tracks.forEach((t, i) => {
console.log(
`${i + 1}. ${t.title} (${t.durationFormatted}) - ` +
`${(t.playCount || 0).toLocaleString()} plays`
);
});
});
Scaling with Apify: Production-Ready SoundCloud Scraping
Building and maintaining custom scrapers is time-consuming, especially when SoundCloud changes its frontend, API structure, or anti-bot measures. For production workloads requiring reliability and scale, the Apify platform provides battle-tested infrastructure.
Using SoundCloud Actors from the Apify Store
The Apify Store has pre-built actors for SoundCloud scraping that handle all the complexities we've discussed:
const { ApifyClient } = require('apify-client');
const client = new ApifyClient({
token: 'YOUR_APIFY_API_TOKEN',
});
async function scrapeSoundCloudWithApify() {
// Run a SoundCloud scraper actor
const run = await client.actor(
'natasha.lekh/soundcloud-scraper'
).call({
startUrls: [
{ url: 'https://soundcloud.com/artist-name' },
{ url: 'https://soundcloud.com/discover/sets/charts-top:all-music' },
],
maxItems: 100,
proxy: {
useApifyProxy: true,
apifyProxyGroups: ['RESIDENTIAL'],
},
});
// Fetch and process results
const { items } = await client.dataset(
run.defaultDatasetId
).listItems();
console.log(`Extracted ${items.length} items`);
// Quick genre distribution analysis
const genreDistribution = {};
items.forEach(item => {
const genre = item.genre || 'Unknown';
genreDistribution[genre] =
(genreDistribution[genre] || 0) + 1;
});
console.log(
'Genre distribution:',
JSON.stringify(genreDistribution, null, 2)
);
return items;
}
scrapeSoundCloudWithApify();
Why Apify for SoundCloud Scraping?
- Maintained Scrapers: The Apify community continuously updates actors when SoundCloud changes its structure or API
- Proxy Rotation: Automatic IP rotation prevents blocks during large-scale operations
- Parallel Execution: Scrape multiple profiles, playlists, and search results simultaneously
- Data Export: Export to JSON, CSV, Excel, or push directly to databases via built-in integrations
- Webhooks: Trigger downstream processes (email alerts, database updates) when scraping completes
- Cost Efficiency: Pay-per-use pricing means you only pay for compute time actually used
- Scheduling: Set up recurring scrapes on any CRON schedule for continuous monitoring
Building a SoundCloud Analytics Pipeline
Here's a complete analytics pipeline that scrapes, processes, and generates insights:
const { ApifyClient } = require('apify-client');
class SoundCloudAnalytics {
constructor(apifyToken) {
this.client = new ApifyClient(
{ token: apifyToken }
);
}
async scrapeArtistCatalog(artistUrls) {
const run = await this.client.actor(
'natasha.lekh/soundcloud-scraper'
).call({
startUrls: artistUrls.map(
url => ({ url: `${url}/tracks` })
),
maxItems: 500,
});
const { items } = await this.client.dataset(
run.defaultDatasetId
).listItems();
return items;
}
analyzeGrowthPatterns(tracks) {
const sorted = [...tracks].sort(
(a, b) =>
new Date(a.createdAt) -
new Date(b.createdAt)
);
const windowSize = 5;
return sorted.map((track, i) => {
const window = sorted.slice(
Math.max(0, i - windowSize + 1),
i + 1
);
const avgPlays = window.reduce(
(sum, t) => sum + (t.playCount || 0), 0
) / window.length;
return {
title: track.title,
date: track.createdAt,
plays: track.playCount,
rollingAvgPlays: Math.round(avgPlays),
trend: avgPlays > (track.playCount || 0)
? 'declining' : 'growing',
};
});
}
findBreakoutTracks(tracks, threshold = 3) {
const avgPlays = tracks.reduce(
(sum, t) => sum + (t.playCount || 0), 0
) / tracks.length;
return tracks
.filter(t =>
(t.playCount || 0) > avgPlays * threshold
)
.sort((a, b) =>
(b.playCount || 0) - (a.playCount || 0)
);
}
generateReport(tracks) {
if (!tracks.length) {
return { error: 'No tracks to analyze' };
}
const totalPlays = tracks.reduce(
(sum, t) => sum + (t.playCount || 0), 0
);
const totalLikes = tracks.reduce(
(sum, t) => sum + (t.likeCount || 0), 0
);
const engRate = totalPlays > 0
? (totalLikes / totalPlays * 100).toFixed(2)
: 0;
// Genre breakdown
const genres = {};
tracks.forEach(t => {
const g = t.genre || 'Unknown';
genres[g] = (genres[g] || 0) + 1;
});
return {
summary: {
totalTracks: tracks.length,
totalPlays: totalPlays.toLocaleString(),
totalLikes: totalLikes.toLocaleString(),
avgEngagementRate: `${engRate}%`,
avgPlaysPerTrack: Math.round(
totalPlays / tracks.length
).toLocaleString(),
},
genres: Object.entries(genres)
.sort((a, b) => b[1] - a[1]),
breakoutTracks:
this.findBreakoutTracks(tracks),
growthAnalysis:
this.analyzeGrowthPatterns(tracks),
};
}
}
// Usage
const analytics = new SoundCloudAnalytics(
'YOUR_APIFY_TOKEN'
);
analytics.scrapeArtistCatalog([
'https://soundcloud.com/artist1',
'https://soundcloud.com/artist2',
]).then(tracks => {
const report = analytics.generateReport(tracks);
console.log('=== Summary ===');
console.log(JSON.stringify(report.summary, null, 2));
console.log(`\n=== Breakout Tracks (${report.breakoutTracks.length}) ===`);
report.breakoutTracks.forEach(t =>
console.log(` ${t.title}: ${t.playCount.toLocaleString()} plays`)
);
});
Best Practices for SoundCloud Scraping
1. Respect Rate Limits
Always implement polite scraping with delays and backoff:
async function politeRequest(url, options = {}) {
const {
minDelay = 1000,
maxDelay = 3000
} = options;
// Random delay to appear more natural
const delay = Math.floor(
Math.random() * (maxDelay - minDelay) + minDelay
);
await new Promise(r => setTimeout(r, delay));
try {
return await axios.get(url, {
headers: {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; ' +
'Win64; x64) AppleWebKit/537.36 ' +
'Chrome/120.0.0.0',
},
timeout: 15000,
});
} catch (error) {
if (error.response?.status === 429) {
const retryAfter =
error.response.headers['retry-after']
|| 60;
console.log(
`Rate limited. Waiting ${retryAfter}s`
);
await new Promise(
r => setTimeout(r, retryAfter * 1000)
);
return politeRequest(url, options);
}
throw error;
}
}
2. Handle Missing Data Gracefully
SoundCloud data isn't always complete — some fields are optional or user-dependent:
function safeExtract(obj, path, defaultValue = null) {
return path.split('.').reduce(
(acc, key) =>
acc && acc[key] !== undefined
? acc[key]
: defaultValue,
obj
);
}
// Usage
const playCount = safeExtract(
track, 'stats.playback_count', 0
);
const city = safeExtract(
user, 'location.city', 'Unknown'
);
3. Deduplicate Results Across Runs
When scraping across multiple pages or running recurring jobs, deduplication is essential:
function deduplicateTracks(tracks) {
const seen = new Map();
return tracks.filter(track => {
if (seen.has(track.id)) {
const existing = seen.get(track.id);
// Keep the more recently scraped version
if (new Date(track.scrapedAt) >
new Date(existing.scrapedAt)) {
seen.set(track.id, track);
}
return false;
}
seen.set(track.id, track);
return true;
});
}
4. Extracting the Client ID
If you need to make direct API calls, you can extract the client_id from SoundCloud's JavaScript bundles:
async function extractClientId() {
const response = await axios.get(
'https://soundcloud.com',
{
headers: {
'User-Agent':
'Mozilla/5.0 Chrome/120.0.0.0'
}
}
);
// Find script URLs in the page
const scriptUrls = response.data.match(
/https:\/\/a-v2\.sndcdn\.com\/assets\/[^"]+\.js/g
);
if (!scriptUrls) return null;
// Check each script for client_id
for (const scriptUrl of scriptUrls.slice(-3)) {
const scriptResp = await axios.get(scriptUrl);
const match = scriptResp.data.match(
/client_id:"([a-zA-Z0-9]+)"/
);
if (match) return match[1];
}
return null;
}
Legal and Ethical Considerations
As with any web scraping project, operating ethically is paramount:
- Public Data Only: Only scrape publicly available content. Don't attempt to access private tracks, playlists, or profiles.
- Respect robots.txt: Check SoundCloud's robots.txt before scraping specific paths.
- Don't Download Audio: Scraping metadata is fundamentally different from downloading copyrighted audio content. Stick to metadata and engagement metrics.
- Rate Limit Compliance: Don't overwhelm SoundCloud's servers. Use reasonable delays and respect 429 responses.
- Data Privacy: If you collect user data (profiles, comments), comply with GDPR, CCPA, and other applicable privacy regulations.
- Terms of Service: Review SoundCloud's Terms of Service before beginning any scraping project.
- Attribution: If you publish analysis based on SoundCloud data, credit the platform and creators appropriately.
Conclusion
SoundCloud scraping opens up powerful possibilities for music industry analytics, talent discovery, playlist curation, and content monitoring. The platform's embedded hydration data makes it more accessible than many modern SPAs — you can extract structured data from simple HTTP requests without always needing a full headless browser.
For small-scale projects and learning, the Node.js approaches shown in this guide will serve you well. For production workloads requiring reliability, scale, and minimal maintenance, the Apify platform provides purpose-built infrastructure with pre-made actors, automatic proxy management, and built-in scheduling.
Whether you're building a music analytics dashboard, scouting for emerging artists, or researching audio content trends, the combination of smart scraping techniques and robust infrastructure will give you the data foundation you need.
Start with a specific use case, validate your approach on a small dataset, and scale up methodically. The SoundCloud ecosystem has over 300 million tracks worth of data waiting to be analyzed.
Need production-ready SoundCloud scraping? Browse the Apify Store for maintained actors with built-in proxy rotation, pagination handling, and data export capabilities.
Top comments (0)