Bluesky is the decentralized social network built on the AT Protocol. Unlike traditional platforms, the AT Protocol is designed to be open — making public data freely accessible without authentication. This is a game-changer for data extraction.
Here's how to scrape Bluesky posts and profiles using Python.
Why Bluesky Data is Special
- Open by design: The AT Protocol makes public data accessible via standard APIs
- No auth required: Public posts, profiles, and feeds are openly available
- Growing fast: Millions of users migrating from Twitter/X
- Rich data: Posts, replies, likes, reposts, follows — all accessible
- Decentralized: Data is portable and not locked behind one corporation
Understanding the AT Protocol
Each user has a data repository identified by their DID (Decentralized Identifier). The public API endpoints let you read this data directly.
import requests
import json
import time
BSKY_PUBLIC_API = "https://public.api.bsky.app"
def resolve_handle(handle):
"""Convert a Bluesky handle to a DID."""
url = f"{BSKY_PUBLIC_API}/xrpc/com.atproto.identity.resolveHandle"
response = requests.get(url, params={"handle": handle})
if response.status_code == 200:
return response.json().get("did")
return None
def get_profile(handle_or_did):
"""Get a user's profile information."""
url = f"{BSKY_PUBLIC_API}/xrpc/app.bsky.actor.getProfile"
response = requests.get(url, params={"actor": handle_or_did})
if response.status_code != 200:
return None
data = response.json()
return {
"did": data.get("did"),
"handle": data.get("handle"),
"display_name": data.get("displayName", ""),
"followers_count": data.get("followersCount", 0),
"follows_count": data.get("followsCount", 0),
"posts_count": data.get("postsCount", 0),
}
Fetching User Posts
def get_author_feed(handle_or_did, limit=50):
"""Get posts from a specific user."""
url = f"{BSKY_PUBLIC_API}/xrpc/app.bsky.feed.getAuthorFeed"
all_posts = []
cursor = None
while len(all_posts) < limit:
params = {"actor": handle_or_did, "limit": 30}
if cursor:
params["cursor"] = cursor
response = requests.get(url, params=params)
if response.status_code != 200:
break
data = response.json()
feed = data.get("feed", [])
if not feed:
break
for item in feed:
post = item.get("post", {})
record = post.get("record", {})
all_posts.append({
"uri": post.get("uri", ""),
"author_handle": post.get("author", {}).get("handle", ""),
"text": record.get("text", ""),
"created_at": record.get("createdAt", ""),
"like_count": post.get("likeCount", 0),
"repost_count": post.get("repostCount", 0),
"reply_count": post.get("replyCount", 0),
})
cursor = data.get("cursor")
if not cursor:
break
time.sleep(0.5)
return all_posts[:limit]
Searching Bluesky Posts
def search_posts(query, limit=50):
"""Search for posts containing specific terms."""
url = f"{BSKY_PUBLIC_API}/xrpc/app.bsky.feed.searchPosts"
all_results = []
cursor = None
while len(all_results) < limit:
params = {"q": query, "limit": 25}
if cursor:
params["cursor"] = cursor
response = requests.get(url, params=params)
if response.status_code != 200:
break
data = response.json()
posts = data.get("posts", [])
if not posts:
break
for post in posts:
record = post.get("record", {})
all_results.append({
"uri": post.get("uri"),
"author": post.get("author", {}).get("handle"),
"text": record.get("text", ""),
"created_at": record.get("createdAt"),
"likes": post.get("likeCount", 0),
"reposts": post.get("repostCount", 0),
})
cursor = data.get("cursor")
if not cursor:
break
time.sleep(0.5)
return all_results[:limit]
Building a Bluesky Monitor
def monitor_keywords(keywords, interval_seconds=300):
"""Monitor Bluesky for specific keywords."""
seen_uris = set()
while True:
for keyword in keywords:
results = search_posts(keyword, limit=25)
new_posts = [p for p in results if p["uri"] not in seen_uris]
for post in new_posts:
seen_uris.add(post["uri"])
print(f"[NEW] @{post['author']}: {post['text'][:100]}")
print(f"Checked {len(keywords)} keywords, sleeping {interval_seconds}s...")
time.sleep(interval_seconds)
monitor_keywords(["web scraping", "data extraction", "python scraper"])
Scaling Bluesky Scraping
For production-scale Bluesky data collection, the Bluesky Scraper on Apify handles the heavy lifting with pagination, rate limits, and data normalization.
For proxy management when making high-volume API calls, ScrapeOps provides rotating proxy infrastructure that works perfectly with the AT Protocol endpoints.
Best Practices
- No auth needed: Public data is freely available — don't over-complicate it
-
Use the public API:
public.api.bsky.appis the correct endpoint - Rate limit gently: 0.5-1 second between requests
- Use ScrapeOps for proxy rotation at scale
Conclusion
Bluesky's open AT Protocol makes it the most scraper-friendly social network today. Whether you use the public API directly or the Bluesky Scraper on Apify, the data is readily accessible for analysis.
Happy scraping!
Top comments (0)