DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Bluesky Posts: AT Protocol Public Data Extraction with Python

Bluesky is the decentralized social network built on the AT Protocol. Unlike traditional platforms, the AT Protocol is designed to be open — making public data freely accessible without authentication. This is a game-changer for data extraction.

Here's how to scrape Bluesky posts and profiles using Python.

Why Bluesky Data is Special

  • Open by design: The AT Protocol makes public data accessible via standard APIs
  • No auth required: Public posts, profiles, and feeds are openly available
  • Growing fast: Millions of users migrating from Twitter/X
  • Rich data: Posts, replies, likes, reposts, follows — all accessible
  • Decentralized: Data is portable and not locked behind one corporation

Understanding the AT Protocol

Each user has a data repository identified by their DID (Decentralized Identifier). The public API endpoints let you read this data directly.

import requests
import json
import time

BSKY_PUBLIC_API = "https://public.api.bsky.app"

def resolve_handle(handle):
    """Convert a Bluesky handle to a DID."""
    url = f"{BSKY_PUBLIC_API}/xrpc/com.atproto.identity.resolveHandle"
    response = requests.get(url, params={"handle": handle})
    if response.status_code == 200:
        return response.json().get("did")
    return None

def get_profile(handle_or_did):
    """Get a user's profile information."""
    url = f"{BSKY_PUBLIC_API}/xrpc/app.bsky.actor.getProfile"
    response = requests.get(url, params={"actor": handle_or_did})
    if response.status_code != 200:
        return None
    data = response.json()
    return {
        "did": data.get("did"),
        "handle": data.get("handle"),
        "display_name": data.get("displayName", ""),
        "followers_count": data.get("followersCount", 0),
        "follows_count": data.get("followsCount", 0),
        "posts_count": data.get("postsCount", 0),
    }
Enter fullscreen mode Exit fullscreen mode

Fetching User Posts

def get_author_feed(handle_or_did, limit=50):
    """Get posts from a specific user."""
    url = f"{BSKY_PUBLIC_API}/xrpc/app.bsky.feed.getAuthorFeed"
    all_posts = []
    cursor = None

    while len(all_posts) < limit:
        params = {"actor": handle_or_did, "limit": 30}
        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, params=params)
        if response.status_code != 200:
            break

        data = response.json()
        feed = data.get("feed", [])
        if not feed:
            break

        for item in feed:
            post = item.get("post", {})
            record = post.get("record", {})
            all_posts.append({
                "uri": post.get("uri", ""),
                "author_handle": post.get("author", {}).get("handle", ""),
                "text": record.get("text", ""),
                "created_at": record.get("createdAt", ""),
                "like_count": post.get("likeCount", 0),
                "repost_count": post.get("repostCount", 0),
                "reply_count": post.get("replyCount", 0),
            })

        cursor = data.get("cursor")
        if not cursor:
            break
        time.sleep(0.5)

    return all_posts[:limit]
Enter fullscreen mode Exit fullscreen mode

Searching Bluesky Posts

def search_posts(query, limit=50):
    """Search for posts containing specific terms."""
    url = f"{BSKY_PUBLIC_API}/xrpc/app.bsky.feed.searchPosts"
    all_results = []
    cursor = None

    while len(all_results) < limit:
        params = {"q": query, "limit": 25}
        if cursor:
            params["cursor"] = cursor

        response = requests.get(url, params=params)
        if response.status_code != 200:
            break

        data = response.json()
        posts = data.get("posts", [])
        if not posts:
            break

        for post in posts:
            record = post.get("record", {})
            all_results.append({
                "uri": post.get("uri"),
                "author": post.get("author", {}).get("handle"),
                "text": record.get("text", ""),
                "created_at": record.get("createdAt"),
                "likes": post.get("likeCount", 0),
                "reposts": post.get("repostCount", 0),
            })

        cursor = data.get("cursor")
        if not cursor:
            break
        time.sleep(0.5)

    return all_results[:limit]
Enter fullscreen mode Exit fullscreen mode

Building a Bluesky Monitor

def monitor_keywords(keywords, interval_seconds=300):
    """Monitor Bluesky for specific keywords."""
    seen_uris = set()
    while True:
        for keyword in keywords:
            results = search_posts(keyword, limit=25)
            new_posts = [p for p in results if p["uri"] not in seen_uris]
            for post in new_posts:
                seen_uris.add(post["uri"])
                print(f"[NEW] @{post['author']}: {post['text'][:100]}")
        print(f"Checked {len(keywords)} keywords, sleeping {interval_seconds}s...")
        time.sleep(interval_seconds)

monitor_keywords(["web scraping", "data extraction", "python scraper"])
Enter fullscreen mode Exit fullscreen mode

Scaling Bluesky Scraping

For production-scale Bluesky data collection, the Bluesky Scraper on Apify handles the heavy lifting with pagination, rate limits, and data normalization.

For proxy management when making high-volume API calls, ScrapeOps provides rotating proxy infrastructure that works perfectly with the AT Protocol endpoints.

Best Practices

  1. No auth needed: Public data is freely available — don't over-complicate it
  2. Use the public API: public.api.bsky.app is the correct endpoint
  3. Rate limit gently: 0.5-1 second between requests
  4. Use ScrapeOps for proxy rotation at scale

Conclusion

Bluesky's open AT Protocol makes it the most scraper-friendly social network today. Whether you use the public API directly or the Bluesky Scraper on Apify, the data is readily accessible for analysis.

Happy scraping!

Top comments (0)