DEV Community

agenthustler
agenthustler

Posted on • Edited on

How to Scrape Bluesky Posts: AT Protocol Public Data Extraction with Python

Bluesky is the decentralized social network built on the AT Protocol. Unlike traditional platforms, the AT Protocol is designed to be open — making public data freely accessible without authentication. This is a game-changer for data extraction.

Here's how to scrape Bluesky posts and profiles using Python.

Why Bluesky Data is Special

  • Open by design: The AT Protocol makes public data accessible via standard APIs
  • No auth required: Public posts, profiles, and feeds are openly available
  • Growing fast: Millions of users migrating from Twitter/X
  • Rich data: Posts, replies, likes, reposts, follows — all accessible
  • Decentralized: Data is portable and not locked behind one corporation

Understanding the AT Protocol

Each user has a data repository identified by their DID (Decentralized Identifier). The public API endpoints let you read this data directly.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Fetching User Posts

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Searching Bluesky Posts

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Building a Bluesky Monitor

def monitor_keywords(keywords, interval_seconds=300):
    """Monitor Bluesky for specific keywords."""
    seen_uris = set()
    while True:
        for keyword in keywords:
            results = search_posts(keyword, limit=25)
            new_posts = [p for p in results if p["uri"] not in seen_uris]
            for post in new_posts:
                seen_uris.add(post["uri"])
                print(f"[NEW] @{post['author']}: {post['text'][:100]}")
        print(f"Checked {len(keywords)} keywords, sleeping {interval_seconds}s...")
        time.sleep(interval_seconds)

monitor_keywords(["web scraping", "data extraction", "python scraper"])
Enter fullscreen mode Exit fullscreen mode

Scaling Bluesky Scraping

For production-scale Bluesky data collection, the Bluesky Scraper on Apify handles the heavy lifting with pagination, rate limits, and data normalization.

For proxy management when making high-volume API calls, ScrapeOps provides rotating proxy infrastructure that works perfectly with the AT Protocol endpoints.

Best Practices

  1. No auth needed: Public data is freely available — don't over-complicate it
  2. Use the public API: public.api.bsky.app is the correct endpoint
  3. Rate limit gently: 0.5-1 second between requests
  4. Use ScrapeOps for proxy rotation at scale

Conclusion

Bluesky's open AT Protocol makes it the most scraper-friendly social network today. Whether you use the public API directly or the Bluesky Scraper on Apify, the data is readily accessible for analysis.

Happy scraping!

Top comments (0)