DEV Community

agenthustler
agenthustler

Posted on

Medium Scraping: Extract Articles, Publications and Author Data

Medium has become one of the most influential content platforms on the web, hosting millions of articles across technology, business, science, self-improvement, and countless other topics. For researchers, content analysts, competitive intelligence teams, and data scientists, Medium represents a goldmine of structured knowledge waiting to be extracted.

In this comprehensive guide, we'll walk through how to scrape Medium effectively — covering article metadata, publication pages, clap counts, author profiles, and tag-based discovery. We'll look at Medium's underlying structure, write practical code examples in both Python and Node.js, and show how to scale your scraping with Apify.

Understanding Medium's Structure

Before writing a single line of scraping code, it helps to understand how Medium organizes its content. Medium's architecture revolves around several core entities:

Articles (Posts): The fundamental content unit. Each article has a unique URL slug, title, subtitle, body content (in a custom markup format), publication date, reading time, and engagement metrics (claps, responses).

Publications: Curated collections of articles managed by editorial teams. Publications like Better Programming, Towards Data Science, and The Startup aggregate content around specific themes. Each publication has its own homepage, archive, and contributor list.

Authors (Users): Individual writers with profile pages showing their bio, follower count, published articles, and engagement history.

Tags: Medium's categorization system. Tags like "machine-learning," "javascript," or "startup" group related articles together and power Medium's recommendation engine.

Responses: Medium's comment system, where responses are themselves full articles linked to a parent post.

Medium's URL Patterns

Understanding URL patterns is crucial for building efficient scrapers:

# Article URL patterns
https://medium.com/@username/article-slug-hexid
https://medium.com/publication-name/article-slug-hexid
https://username.medium.com/article-slug-hexid

# Author profile
https://medium.com/@username

# Publication homepage
https://medium.com/publication-name

# Tag page
https://medium.com/tag/javascript

# Publication archive
https://medium.com/publication-name/archive
Enter fullscreen mode Exit fullscreen mode

Each article URL ends with a hexadecimal ID (typically 12 characters) that uniquely identifies the post in Medium's database.

Extracting Article Metadata

Articles are where the richest data lives. Here's what you can extract from a typical Medium article page:

  • Title and subtitle
  • Author name and profile URL
  • Publication name (if published under one)
  • Publication date and last modified date
  • Reading time estimate
  • Clap count (Medium's equivalent of likes)
  • Response count
  • Tags/topics
  • Article body (text, images, embeds)
  • Canonical URL

Python Example: Scraping Article Data

import requests
from bs4 import BeautifulSoup
import json
import re

def scrape_medium_article(url):
    # Extract metadata and content from a Medium article.
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                       'AppleWebKit/537.36 (KHTML, like Gecko) '
                       'Chrome/120.0.0.0 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise Exception(f"Failed to fetch article: {response.status_code}")

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract structured data from JSON-LD
    script_tags = soup.find_all('script', type='application/ld+json')
    structured_data = {}
    for script in script_tags:
        try:
            data = json.loads(script.string)
            if data.get('@type') == 'Article':
                structured_data = data
                break
        except (json.JSONDecodeError, TypeError):
            continue

    # Extract Open Graph metadata as fallback
    og_title = soup.find('meta', property='og:title')
    og_description = soup.find('meta', property='og:description')
    og_image = soup.find('meta', property='og:image')

    # Extract article body paragraphs
    article_body = soup.find('article')
    paragraphs = []
    if article_body:
        for p in article_body.find_all(['p', 'h1', 'h2', 'h3', 'h4']):
            text = p.get_text(strip=True)
            if text:
                paragraphs.append({
                    'tag': p.name,
                    'text': text
                })

    # Extract clap count from page data
    clap_count = None
    clap_button = soup.find('button', {'data-testid': 'headerClapButton'})
    if clap_button:
        clap_text = clap_button.get_text(strip=True)
        clap_match = re.search(r'[\d,.]+[KkMm]?', clap_text)
        if clap_match:
            clap_count = clap_match.group()

    # Build the result
    result = {
        'url': url,
        'title': structured_data.get('headline')
                 or (og_title['content'] if og_title else None),
        'description': structured_data.get('description')
                       or (og_description['content'] if og_description else None),
        'image': og_image['content'] if og_image else None,
        'author': structured_data.get('author', {}).get('name'),
        'date_published': structured_data.get('datePublished'),
        'date_modified': structured_data.get('dateModified'),
        'publisher': structured_data.get('publisher', {}).get('name'),
        'claps': clap_count,
        'word_count': sum(len(p['text'].split()) for p in paragraphs),
        'content': paragraphs
    }

    return result


# Usage
article = scrape_medium_article(
    'https://medium.com/@example/sample-article-abc123def456'
)
print(json.dumps(article, indent=2))
Enter fullscreen mode Exit fullscreen mode

Node.js Example: Scraping with Cheerio

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeMediumArticle(url) {
    const { data: html } = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                + 'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
        }
    });

    const $ = cheerio.load(html);

    // Parse JSON-LD structured data
    let structuredData = {};
    $('script[type="application/ld+json"]').each((_, el) => {
        try {
            const data = JSON.parse($(el).html());
            if (data['@type'] === 'Article') {
                structuredData = data;
            }
        } catch (e) { /* skip invalid JSON */ }
    });

    // Extract reading time from meta tags
    const readingTime = $('meta[name="twitter:data1"]').attr('content');

    // Extract tags from meta keywords
    const keywords = $('meta[name="keywords"]').attr('content');
    const tags = keywords ? keywords.split(',').map(t => t.trim()) : [];

    // Extract all article paragraphs
    const content = [];
    $('article p, article h1, article h2, article h3').each((_, el) => {
        const text = $(el).text().trim();
        if (text) {
            content.push({
                tag: el.tagName,
                text: text
            });
        }
    });

    return {
        url,
        title: structuredData.headline
            || $('meta[property="og:title"]').attr('content'),
        description: structuredData.description
            || $('meta[property="og:description"]').attr('content'),
        author: structuredData.author?.name || null,
        datePublished: structuredData.datePublished || null,
        dateModified: structuredData.dateModified || null,
        publisher: structuredData.publisher?.name || null,
        readingTime: readingTime || null,
        tags,
        wordCount: content.reduce(
            (sum, p) => sum + p.text.split(/\s+/).length, 0
        ),
        content
    };
}

// Usage
scrapeMediumArticle('https://medium.com/@example/sample-article-abc123')
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Scraping Publication Pages

Medium publications are powerful aggregation points. A single publication like Towards Data Science contains thousands of articles organized by topic, making it perfect for bulk data collection.

Publication Archive Strategy

The most efficient way to collect articles from a publication is through its archive pages. Medium provides monthly archive pages that list all articles published in a given month:

https://medium.com/towards-data-science/archive/2025/01
https://medium.com/better-programming/archive/2024/12
Enter fullscreen mode Exit fullscreen mode
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def scrape_publication_archive(publication_slug, year, month):
    # Scrape all articles from a publication's monthly archive.
    url = f"https://medium.com/{publication_slug}/archive/{year}/{month:02d}"

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                       'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    articles = []
    # Find all article links in the archive
    for article_link in soup.find_all('a', href=True):
        href = article_link['href']
        # Medium article URLs contain a hex ID at the end
        if f'/{publication_slug}/' in href and len(href.split('-')[-1]) >= 10:
            title_el = article_link.find(['h2', 'h3'])
            if title_el:
                articles.append({
                    'title': title_el.get_text(strip=True),
                    'url': href if href.startswith('http')
                           else f"https://medium.com{href}",
                    'publication': publication_slug,
                    'archive_month': f"{year}-{month:02d}"
                })

    # Deduplicate by URL
    seen = set()
    unique_articles = []
    for article in articles:
        if article['url'] not in seen:
            seen.add(article['url'])
            unique_articles.append(article)

    return unique_articles


def scrape_publication_range(publication_slug, start_year,
                              start_month, end_year, end_month):
    # Scrape a publication archive across multiple months.
    all_articles = []
    current = datetime(start_year, start_month, 1)
    end = datetime(end_year, end_month, 1)

    while current <= end:
        print(f"Scraping {publication_slug} - {current.strftime('%Y-%m')}")
        articles = scrape_publication_archive(
            publication_slug, current.year, current.month
        )
        all_articles.extend(articles)
        # Move to the next month
        if current.month == 12:
            current = current.replace(year=current.year + 1, month=1)
        else:
            current = current.replace(month=current.month + 1)

    return all_articles
Enter fullscreen mode Exit fullscreen mode

Extracting Author Profiles

Author data is valuable for influencer analysis, content sourcing, and understanding expertise distribution across topics.

def scrape_author_profile(username):
    # Extract author profile data from Medium.
    url = f"https://medium.com/@{username}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                       'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract profile metadata
    name = soup.find('h2')
    bio = soup.find('meta', property='og:description')

    # Find follower count in page text
    follower_text = None
    for span in soup.find_all('span'):
        text = span.get_text()
        if 'Follower' in text:
            follower_text = text
            break

    # Collect recent article links
    recent_articles = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if f'@{username}/' in href or f'{username}.medium.com' in href:
            h_tag = link.find(['h2', 'h3'])
            if h_tag:
                recent_articles.append({
                    'title': h_tag.get_text(strip=True),
                    'url': href if href.startswith('http')
                           else f"https://medium.com{href}"
                })

    return {
        'username': username,
        'name': name.get_text(strip=True) if name else None,
        'bio': bio['content'] if bio else None,
        'followers': follower_text,
        'profile_url': url,
        'recent_articles': recent_articles[:10]
    }
Enter fullscreen mode Exit fullscreen mode

Tag-Based Discovery

Medium's tag system is one of the best ways to discover content in specific niches. Each tag page shows trending and recent articles for that topic.

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeTagPage(tag) {
    const url = `https://medium.com/tag/${tag}`;

    const { data: html } = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                + 'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
        }
    });

    const $ = cheerio.load(html);
    const articles = [];

    // Extract articles listed under the tag
    $('article').each((_, el) => {
        const titleEl = $(el).find('h2, h3').first();
        const linkEl = $(el).find('a[href*="medium.com"]').first();
        const authorEl = $(el).find('a[href*="@"]').first();

        if (titleEl.length && linkEl.length) {
            articles.push({
                title: titleEl.text().trim(),
                url: linkEl.attr('href'),
                author: authorEl.length ? authorEl.text().trim() : null,
                tag: tag
            });
        }
    });

    return articles;
}

// Discover content across multiple tags
async function discoverByTags(tags) {
    const results = {};
    for (const tag of tags) {
        console.log(`Scraping tag: ${tag}`);
        results[tag] = await scrapeTagPage(tag);
        // Be polite with rate limiting
        await new Promise(r => setTimeout(r, 2000));
    }
    return results;
}

// Usage
discoverByTags(['javascript', 'machine-learning', 'web-development'])
    .then(data => console.log(JSON.stringify(data, null, 2)));
Enter fullscreen mode Exit fullscreen mode

Handling Medium's Anti-Scraping Measures

Medium employs several techniques to prevent automated scraping:

  1. JavaScript rendering: Many article pages require JavaScript execution to fully render content. A simple HTTP request may return an incomplete page.

  2. Rate limiting: Medium throttles requests from single IP addresses, returning 429 status codes after too many requests.

  3. Paywall: Member-only articles require authentication to access full content.

  4. Dynamic loading: Publication and tag pages use infinite scroll, loading content dynamically as users scroll down.

Strategies for Overcoming These Challenges

Use headless browsers for JavaScript-rendered content. Tools like Playwright or Puppeteer can execute JavaScript and wait for content to load before extracting data.

Implement request delays between page fetches. A 2-5 second delay between requests significantly reduces the chance of rate limiting.

Rotate proxies for large-scale scraping to distribute requests across multiple IP addresses.

Use Medium's RSS feeds as a lightweight alternative for recent articles. Medium provides RSS feeds for users, publications, and tags:

https://medium.com/feed/@username
https://medium.com/feed/publication-name
https://medium.com/feed/tag/javascript
Enter fullscreen mode Exit fullscreen mode

Scaling with Apify

While the code examples above work for small-scale scraping, production workloads need infrastructure for proxy rotation, scheduling, retry logic, and data storage. This is where Apify comes in.

Apify provides a cloud platform for running web scrapers (called Actors) with built-in proxy management, scheduling, and storage. For Medium scraping specifically, you can find ready-made Actors on the Apify Store that handle all the complexity of extracting Medium data at scale.

Using the Apify SDK

const { Actor } = require('apify');

Actor.main(async () => {
    const input = await Actor.getInput();
    const { urls, maxArticles = 100 } = input;

    const requestList = await Actor.openRequestList('medium-urls',
        urls.map(url => ({ url }))
    );

    const crawler = new Actor.PuppeteerCrawler({
        requestList,
        maxConcurrency: 5,
        requestHandlerTimeoutSecs: 60,

        async requestHandler({ page, request }) {
            // Wait for the article to fully render
            await page.waitForSelector('article', { timeout: 15000 });

            const articleData = await page.evaluate(() => {
                const title = document.querySelector('h1');
                const author = document.querySelector('a[rel="author"]');
                const dateEl = document.querySelector('time');
                const article = document.querySelector('article');

                // Extract all paragraph text
                const paragraphs = Array.from(
                    article?.querySelectorAll('p') || []
                ).map(p => p.textContent.trim()).filter(Boolean);

                return {
                    title: title?.textContent?.trim(),
                    author: author?.textContent?.trim(),
                    date: dateEl?.getAttribute('datetime'),
                    content: paragraphs.join('\n\n'),
                    wordCount: paragraphs.join(' ').split(/\s+/).length
                };
            });

            await Actor.pushData({
                ...articleData,
                url: request.url,
                scrapedAt: new Date().toISOString()
            });
        },

        async failedRequestHandler({ request }) {
            console.error(`Failed: ${request.url}`);
        }
    });

    await crawler.run();
});
Enter fullscreen mode Exit fullscreen mode

Benefits of Using Apify for Medium Scraping

  • Built-in proxy rotation prevents IP bans and handles rate limiting automatically
  • Automatic retries for failed requests with configurable retry strategies
  • Scheduled runs to collect new articles daily, weekly, or on any custom schedule
  • Dataset storage with export to JSON, CSV, or Excel formats
  • Monitoring and alerts for scraping health and error rates
  • Headless browser support through Puppeteer and Playwright integrations

Use Cases for Medium Data

Once you have a pipeline for extracting Medium data, the possibilities are extensive:

Content research: Identify trending topics, popular writing styles, and high-engagement formats across specific niches.

Competitive intelligence: Track what publications in your industry are covering, which authors are gaining traction, and what messaging resonates.

Academic research: Analyze writing patterns, content distribution, and engagement metrics across the platform.

Training data: Build datasets of high-quality technical writing for natural language processing models.

Influencer identification: Find top authors in specific domains based on clap counts, follower numbers, and publication frequency.

Ethical Considerations and Best Practices

When scraping Medium, keep these guidelines in mind:

  1. Respect robots.txt — Check Medium's robots.txt for disallowed paths and honor those restrictions.

  2. Rate limit your requests — Don't overwhelm Medium's servers. Implement polite delays between requests.

  3. Cache aggressively — Store already-scraped data locally to avoid redundant requests.

  4. Don't bypass the paywall — Respect Medium's membership model. Scraping paywalled content without authorization violates their terms of service.

  5. Attribute content — If you republish or analyze scraped content, always credit the original authors.

  6. Review Medium's Terms of Service — Ensure your scraping use case complies with their current terms.

Conclusion

Medium's rich content ecosystem makes it a valuable target for data extraction, whether you're analyzing content trends, building research datasets, or monitoring competitive activity. By understanding Medium's structure — articles, publications, authors, and tags — you can build targeted scrapers that extract exactly the data you need.

For small-scale projects, the Python and Node.js examples in this guide provide a solid foundation. For production workloads, leveraging Apify's infrastructure for proxy rotation, scheduling, and storage transforms what would otherwise be a fragile script into a reliable data pipeline.

The key is starting with a clear understanding of what data you need and working backward from there to build the simplest scraper that gets the job done. Happy scraping!

Top comments (0)