How to Scrape Bluesky Posts: AT Protocol Public Data Extraction with Python

#python #webdev #tutorial #webscraping

Bluesky is the decentralized social network built on the AT Protocol. Unlike traditional platforms, the AT Protocol is designed to be open — making public data freely accessible without authentication. This is a game-changer for data extraction.

Here's how to scrape Bluesky posts and profiles using Python.

Why Bluesky Data is Special

Open by design: The AT Protocol makes public data accessible via standard APIs
No auth required: Public posts, profiles, and feeds are openly available
Growing fast: Millions of users migrating from Twitter/X
Rich data: Posts, replies, likes, reposts, follows — all accessible
Decentralized: Data is portable and not locked behind one corporation

Understanding the AT Protocol

Each user has a data repository identified by their DID (Decentralized Identifier). The public API endpoints let you read this data directly.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Fetching User Posts

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Searching Bluesky Posts

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Building a Bluesky Monitor

def monitor_keywords(keywords, interval_seconds=300):
    """Monitor Bluesky for specific keywords."""
    seen_uris = set()
    while True:
        for keyword in keywords:
            results = search_posts(keyword, limit=25)
            new_posts = [p for p in results if p["uri"] not in seen_uris]
            for post in new_posts:
                seen_uris.add(post["uri"])
                print(f"[NEW] @{post['author']}: {post['text'][:100]}")
        print(f"Checked {len(keywords)} keywords, sleeping {interval_seconds}s...")
        time.sleep(interval_seconds)

monitor_keywords(["web scraping", "data extraction", "python scraper"])

Scaling Bluesky Scraping

For production-scale Bluesky data collection, the Bluesky Scraper on Apify handles the heavy lifting with pagination, rate limits, and data normalization.

For proxy management when making high-volume API calls, ScrapeOps provides rotating proxy infrastructure that works perfectly with the AT Protocol endpoints.

Best Practices

No auth needed: Public data is freely available — don't over-complicate it
Use the public API: public.api.bsky.app is the correct endpoint
Rate limit gently: 0.5-1 second between requests
Use ScrapeOps for proxy rotation at scale

Conclusion

Bluesky's open AT Protocol makes it the most scraper-friendly social network today. Whether you use the public API directly or the Bluesky Scraper on Apify, the data is readily accessible for analysis.

Happy scraping!

DEV Community