Medium has become one of the most influential content platforms on the web, hosting millions of articles across technology, business, science, self-improvement, and countless other topics. For researchers, content analysts, competitive intelligence teams, and data scientists, Medium represents a goldmine of structured knowledge waiting to be extracted.
In this comprehensive guide, we'll walk through how to scrape Medium effectively — covering article metadata, publication pages, clap counts, author profiles, and tag-based discovery. We'll look at Medium's underlying structure, write practical code examples in both Python and Node.js, and show how to scale your scraping with Apify.
Understanding Medium's Structure
Before writing a single line of scraping code, it helps to understand how Medium organizes its content. Medium's architecture revolves around several core entities:
Articles (Posts): The fundamental content unit. Each article has a unique URL slug, title, subtitle, body content (in a custom markup format), publication date, reading time, and engagement metrics (claps, responses).
Publications: Curated collections of articles managed by editorial teams. Publications like Better Programming, Towards Data Science, and The Startup aggregate content around specific themes. Each publication has its own homepage, archive, and contributor list.
Authors (Users): Individual writers with profile pages showing their bio, follower count, published articles, and engagement history.
Tags: Medium's categorization system. Tags like "machine-learning," "javascript," or "startup" group related articles together and power Medium's recommendation engine.
Responses: Medium's comment system, where responses are themselves full articles linked to a parent post.
Medium's URL Patterns
Understanding URL patterns is crucial for building efficient scrapers:
# Article URL patterns
https://medium.com/@username/article-slug-hexid
https://medium.com/publication-name/article-slug-hexid
https://username.medium.com/article-slug-hexid
# Author profile
https://medium.com/@username
# Publication homepage
https://medium.com/publication-name
# Tag page
https://medium.com/tag/javascript
# Publication archive
https://medium.com/publication-name/archive
Each article URL ends with a hexadecimal ID (typically 12 characters) that uniquely identifies the post in Medium's database.
Extracting Article Metadata
Articles are where the richest data lives. Here's what you can extract from a typical Medium article page:
- Title and subtitle
- Author name and profile URL
- Publication name (if published under one)
- Publication date and last modified date
- Reading time estimate
- Clap count (Medium's equivalent of likes)
- Response count
- Tags/topics
- Article body (text, images, embeds)
- Canonical URL
Python Example: Scraping Article Data
import requests
from bs4 import BeautifulSoup
import json
import re
def scrape_medium_article(url):
# Extract metadata and content from a Medium article.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
raise Exception(f"Failed to fetch article: {response.status_code}")
soup = BeautifulSoup(response.text, 'html.parser')
# Extract structured data from JSON-LD
script_tags = soup.find_all('script', type='application/ld+json')
structured_data = {}
for script in script_tags:
try:
data = json.loads(script.string)
if data.get('@type') == 'Article':
structured_data = data
break
except (json.JSONDecodeError, TypeError):
continue
# Extract Open Graph metadata as fallback
og_title = soup.find('meta', property='og:title')
og_description = soup.find('meta', property='og:description')
og_image = soup.find('meta', property='og:image')
# Extract article body paragraphs
article_body = soup.find('article')
paragraphs = []
if article_body:
for p in article_body.find_all(['p', 'h1', 'h2', 'h3', 'h4']):
text = p.get_text(strip=True)
if text:
paragraphs.append({
'tag': p.name,
'text': text
})
# Extract clap count from page data
clap_count = None
clap_button = soup.find('button', {'data-testid': 'headerClapButton'})
if clap_button:
clap_text = clap_button.get_text(strip=True)
clap_match = re.search(r'[\d,.]+[KkMm]?', clap_text)
if clap_match:
clap_count = clap_match.group()
# Build the result
result = {
'url': url,
'title': structured_data.get('headline')
or (og_title['content'] if og_title else None),
'description': structured_data.get('description')
or (og_description['content'] if og_description else None),
'image': og_image['content'] if og_image else None,
'author': structured_data.get('author', {}).get('name'),
'date_published': structured_data.get('datePublished'),
'date_modified': structured_data.get('dateModified'),
'publisher': structured_data.get('publisher', {}).get('name'),
'claps': clap_count,
'word_count': sum(len(p['text'].split()) for p in paragraphs),
'content': paragraphs
}
return result
# Usage
article = scrape_medium_article(
'https://medium.com/@example/sample-article-abc123def456'
)
print(json.dumps(article, indent=2))
Node.js Example: Scraping with Cheerio
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeMediumArticle(url) {
const { data: html } = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
+ 'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
}
});
const $ = cheerio.load(html);
// Parse JSON-LD structured data
let structuredData = {};
$('script[type="application/ld+json"]').each((_, el) => {
try {
const data = JSON.parse($(el).html());
if (data['@type'] === 'Article') {
structuredData = data;
}
} catch (e) { /* skip invalid JSON */ }
});
// Extract reading time from meta tags
const readingTime = $('meta[name="twitter:data1"]').attr('content');
// Extract tags from meta keywords
const keywords = $('meta[name="keywords"]').attr('content');
const tags = keywords ? keywords.split(',').map(t => t.trim()) : [];
// Extract all article paragraphs
const content = [];
$('article p, article h1, article h2, article h3').each((_, el) => {
const text = $(el).text().trim();
if (text) {
content.push({
tag: el.tagName,
text: text
});
}
});
return {
url,
title: structuredData.headline
|| $('meta[property="og:title"]').attr('content'),
description: structuredData.description
|| $('meta[property="og:description"]').attr('content'),
author: structuredData.author?.name || null,
datePublished: structuredData.datePublished || null,
dateModified: structuredData.dateModified || null,
publisher: structuredData.publisher?.name || null,
readingTime: readingTime || null,
tags,
wordCount: content.reduce(
(sum, p) => sum + p.text.split(/\s+/).length, 0
),
content
};
}
// Usage
scrapeMediumArticle('https://medium.com/@example/sample-article-abc123')
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(console.error);
Scraping Publication Pages
Medium publications are powerful aggregation points. A single publication like Towards Data Science contains thousands of articles organized by topic, making it perfect for bulk data collection.
Publication Archive Strategy
The most efficient way to collect articles from a publication is through its archive pages. Medium provides monthly archive pages that list all articles published in a given month:
https://medium.com/towards-data-science/archive/2025/01
https://medium.com/better-programming/archive/2024/12
import requests
from bs4 import BeautifulSoup
from datetime import datetime
def scrape_publication_archive(publication_slug, year, month):
# Scrape all articles from a publication's monthly archive.
url = f"https://medium.com/{publication_slug}/archive/{year}/{month:02d}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
articles = []
# Find all article links in the archive
for article_link in soup.find_all('a', href=True):
href = article_link['href']
# Medium article URLs contain a hex ID at the end
if f'/{publication_slug}/' in href and len(href.split('-')[-1]) >= 10:
title_el = article_link.find(['h2', 'h3'])
if title_el:
articles.append({
'title': title_el.get_text(strip=True),
'url': href if href.startswith('http')
else f"https://medium.com{href}",
'publication': publication_slug,
'archive_month': f"{year}-{month:02d}"
})
# Deduplicate by URL
seen = set()
unique_articles = []
for article in articles:
if article['url'] not in seen:
seen.add(article['url'])
unique_articles.append(article)
return unique_articles
def scrape_publication_range(publication_slug, start_year,
start_month, end_year, end_month):
# Scrape a publication archive across multiple months.
all_articles = []
current = datetime(start_year, start_month, 1)
end = datetime(end_year, end_month, 1)
while current <= end:
print(f"Scraping {publication_slug} - {current.strftime('%Y-%m')}")
articles = scrape_publication_archive(
publication_slug, current.year, current.month
)
all_articles.extend(articles)
# Move to the next month
if current.month == 12:
current = current.replace(year=current.year + 1, month=1)
else:
current = current.replace(month=current.month + 1)
return all_articles
Extracting Author Profiles
Author data is valuable for influencer analysis, content sourcing, and understanding expertise distribution across topics.
def scrape_author_profile(username):
# Extract author profile data from Medium.
url = f"https://medium.com/@{username}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract profile metadata
name = soup.find('h2')
bio = soup.find('meta', property='og:description')
# Find follower count in page text
follower_text = None
for span in soup.find_all('span'):
text = span.get_text()
if 'Follower' in text:
follower_text = text
break
# Collect recent article links
recent_articles = []
for link in soup.find_all('a', href=True):
href = link['href']
if f'@{username}/' in href or f'{username}.medium.com' in href:
h_tag = link.find(['h2', 'h3'])
if h_tag:
recent_articles.append({
'title': h_tag.get_text(strip=True),
'url': href if href.startswith('http')
else f"https://medium.com{href}"
})
return {
'username': username,
'name': name.get_text(strip=True) if name else None,
'bio': bio['content'] if bio else None,
'followers': follower_text,
'profile_url': url,
'recent_articles': recent_articles[:10]
}
Tag-Based Discovery
Medium's tag system is one of the best ways to discover content in specific niches. Each tag page shows trending and recent articles for that topic.
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeTagPage(tag) {
const url = `https://medium.com/tag/${tag}`;
const { data: html } = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
+ 'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
}
});
const $ = cheerio.load(html);
const articles = [];
// Extract articles listed under the tag
$('article').each((_, el) => {
const titleEl = $(el).find('h2, h3').first();
const linkEl = $(el).find('a[href*="medium.com"]').first();
const authorEl = $(el).find('a[href*="@"]').first();
if (titleEl.length && linkEl.length) {
articles.push({
title: titleEl.text().trim(),
url: linkEl.attr('href'),
author: authorEl.length ? authorEl.text().trim() : null,
tag: tag
});
}
});
return articles;
}
// Discover content across multiple tags
async function discoverByTags(tags) {
const results = {};
for (const tag of tags) {
console.log(`Scraping tag: ${tag}`);
results[tag] = await scrapeTagPage(tag);
// Be polite with rate limiting
await new Promise(r => setTimeout(r, 2000));
}
return results;
}
// Usage
discoverByTags(['javascript', 'machine-learning', 'web-development'])
.then(data => console.log(JSON.stringify(data, null, 2)));
Handling Medium's Anti-Scraping Measures
Medium employs several techniques to prevent automated scraping:
JavaScript rendering: Many article pages require JavaScript execution to fully render content. A simple HTTP request may return an incomplete page.
Rate limiting: Medium throttles requests from single IP addresses, returning 429 status codes after too many requests.
Paywall: Member-only articles require authentication to access full content.
Dynamic loading: Publication and tag pages use infinite scroll, loading content dynamically as users scroll down.
Strategies for Overcoming These Challenges
Use headless browsers for JavaScript-rendered content. Tools like Playwright or Puppeteer can execute JavaScript and wait for content to load before extracting data.
Implement request delays between page fetches. A 2-5 second delay between requests significantly reduces the chance of rate limiting.
Rotate proxies for large-scale scraping to distribute requests across multiple IP addresses.
Use Medium's RSS feeds as a lightweight alternative for recent articles. Medium provides RSS feeds for users, publications, and tags:
https://medium.com/feed/@username
https://medium.com/feed/publication-name
https://medium.com/feed/tag/javascript
Scaling with Apify
While the code examples above work for small-scale scraping, production workloads need infrastructure for proxy rotation, scheduling, retry logic, and data storage. This is where Apify comes in.
Apify provides a cloud platform for running web scrapers (called Actors) with built-in proxy management, scheduling, and storage. For Medium scraping specifically, you can find ready-made Actors on the Apify Store that handle all the complexity of extracting Medium data at scale.
Using the Apify SDK
const { Actor } = require('apify');
Actor.main(async () => {
const input = await Actor.getInput();
const { urls, maxArticles = 100 } = input;
const requestList = await Actor.openRequestList('medium-urls',
urls.map(url => ({ url }))
);
const crawler = new Actor.PuppeteerCrawler({
requestList,
maxConcurrency: 5,
requestHandlerTimeoutSecs: 60,
async requestHandler({ page, request }) {
// Wait for the article to fully render
await page.waitForSelector('article', { timeout: 15000 });
const articleData = await page.evaluate(() => {
const title = document.querySelector('h1');
const author = document.querySelector('a[rel="author"]');
const dateEl = document.querySelector('time');
const article = document.querySelector('article');
// Extract all paragraph text
const paragraphs = Array.from(
article?.querySelectorAll('p') || []
).map(p => p.textContent.trim()).filter(Boolean);
return {
title: title?.textContent?.trim(),
author: author?.textContent?.trim(),
date: dateEl?.getAttribute('datetime'),
content: paragraphs.join('\n\n'),
wordCount: paragraphs.join(' ').split(/\s+/).length
};
});
await Actor.pushData({
...articleData,
url: request.url,
scrapedAt: new Date().toISOString()
});
},
async failedRequestHandler({ request }) {
console.error(`Failed: ${request.url}`);
}
});
await crawler.run();
});
Benefits of Using Apify for Medium Scraping
- Built-in proxy rotation prevents IP bans and handles rate limiting automatically
- Automatic retries for failed requests with configurable retry strategies
- Scheduled runs to collect new articles daily, weekly, or on any custom schedule
- Dataset storage with export to JSON, CSV, or Excel formats
- Monitoring and alerts for scraping health and error rates
- Headless browser support through Puppeteer and Playwright integrations
Use Cases for Medium Data
Once you have a pipeline for extracting Medium data, the possibilities are extensive:
Content research: Identify trending topics, popular writing styles, and high-engagement formats across specific niches.
Competitive intelligence: Track what publications in your industry are covering, which authors are gaining traction, and what messaging resonates.
Academic research: Analyze writing patterns, content distribution, and engagement metrics across the platform.
Training data: Build datasets of high-quality technical writing for natural language processing models.
Influencer identification: Find top authors in specific domains based on clap counts, follower numbers, and publication frequency.
Ethical Considerations and Best Practices
When scraping Medium, keep these guidelines in mind:
Respect robots.txt — Check Medium's robots.txt for disallowed paths and honor those restrictions.
Rate limit your requests — Don't overwhelm Medium's servers. Implement polite delays between requests.
Cache aggressively — Store already-scraped data locally to avoid redundant requests.
Don't bypass the paywall — Respect Medium's membership model. Scraping paywalled content without authorization violates their terms of service.
Attribute content — If you republish or analyze scraped content, always credit the original authors.
Review Medium's Terms of Service — Ensure your scraping use case complies with their current terms.
Conclusion
Medium's rich content ecosystem makes it a valuable target for data extraction, whether you're analyzing content trends, building research datasets, or monitoring competitive activity. By understanding Medium's structure — articles, publications, authors, and tags — you can build targeted scrapers that extract exactly the data you need.
For small-scale projects, the Python and Node.js examples in this guide provide a solid foundation. For production workloads, leveraging Apify's infrastructure for proxy rotation, scheduling, and storage transforms what would otherwise be a fragile script into a reliable data pipeline.
The key is starting with a clear understanding of what data you need and working backward from there to build the simplest scraper that gets the job done. Happy scraping!
Top comments (0)