Bluesky crossed 40 million users in early 2026. For anyone doing social media analysis, brand monitoring, or influencer research, that's a dataset growing faster than any other decentralized platform — and unlike Twitter/X, the data is genuinely accessible.
The reason comes down to architecture. Bluesky runs on the AT Protocol, an open, decentralized protocol where public posts are truly public. No API keys. No OAuth flow. No $5,000/month Enterprise tier. You can query the public API right now with curl and get structured JSON back.
In this guide, I'll walk through every mode of the Bluesky Scraper on Apify — search, profile, and followers — with Python code examples for each. Whether you're building a social listening dashboard or doing academic research on decentralized social networks, this is the complete reference.
Why Bluesky Data Matters in 2026
Before diving into the technical walkthrough, here's why Bluesky deserves a spot in your data pipeline:
It's where early adopters moved. When Twitter/X locked down API access in 2023 and raised prices to $5,000/month for basic API access, a significant chunk of the developer, researcher, and journalist community migrated to Bluesky. The conversations happening there often lead mainstream coverage by days.
The data is open by design. The AT Protocol treats every user's public data as a signed, portable data repository. This isn't a policy decision that could change next quarter — it's baked into the protocol's architecture. Public posts are served over unauthenticated HTTP endpoints.
Growing fast. With 40M+ users and over 3.5 million daily active users, Bluesky has reached critical mass for meaningful social media analysis across tech, politics, media, and academic communities.
Understanding the AT Protocol (Quick Primer)
The AT Protocol uses a few concepts you should understand:
-
XRPC endpoints — API calls are namespaced methods like
app.bsky.feed.searchPosts. You call them via HTTPS GET/POST. -
DIDs — Decentralized Identifiers. Every user has a DID (like
did:plc:abc123) that's their permanent identity, separate from their handle. -
Public AppView — Bluesky runs a public aggregator at
https://public.api.bsky.appthat serves read-only data with zero authentication. - Lexicon schemas — Every data type is formally defined, so API responses are predictable and well-typed.
The practical upshot: you can search all of Bluesky right now:
curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&limit=25"
No headers. No tokens. That returns up to 25 posts with full text, author handle, timestamps, and engagement metrics.
Mode 1: Search Mode (Keywords & Hashtags)
The search mode is the most common starting point. You provide keywords or hashtag terms, and the scraper returns matching posts with full metadata.
Running via the Apify Console
- Open cryptosignals/bluesky-scraper
- Set your search terms (e.g.,
["machine learning", "LLM agents"]) - Set
maxResultsto control volume (start with 100-200 for testing) - Choose sort order:
latestfor chronological,topfor highest engagement - Click Start
Running via Python
import requests
import json
# Apify API endpoint for running actors
APIFY_TOKEN = "YOUR_APIFY_API_TOKEN"
ACTOR_ID = "cryptosignals/bluesky-scraper"
# Start the actor run
run_response = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"searchTerms": ["artificial intelligence", "AI agents"],
"maxResults": 200,
"sort": "latest",
},
)
run_data = run_response.json()["data"]
run_id = run_data["id"]
dataset_id = run_data["defaultDatasetId"]
print(f"Run started: {run_id}")
# Poll until the run finishes
import time
while True:
status_resp = requests.get(
f"https://api.apify.com/v2/actor-runs/{run_id}",
headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
)
status = status_resp.json()["data"]["status"]
if status in ("SUCCEEDED", "FAILED", "ABORTED"):
break
time.sleep(5)
print(f"Run finished with status: {status}")
# Fetch the results
results = requests.get(
f"https://api.apify.com/v2/datasets/{dataset_id}/items",
headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
).json()
for post in results[:5]:
print(f"@{post['author']['handle']}: {post['text'][:100]}...")
print(f" Likes: {post.get('likeCount', 0)} | Reposts: {post.get('repostCount', 0)}")
print(f" Posted: {post['createdAt']}")
print()
What You Get Back
Each post object includes:
{
"text": "Just shipped our new AI agent framework...",
"author": {
"handle": "developer.bsky.social",
"displayName": "Jane Dev",
"did": "did:plc:abc123..."
},
"createdAt": "2026-03-15T14:30:00.000Z",
"likeCount": 42,
"repostCount": 12,
"replyCount": 8,
"uri": "at://did:plc:abc123/app.bsky.feed.post/3k..."
}
Search Tips
-
Quoted phrases work:
"machine learning"matches the exact phrase -
Hashtags: search for
#buildinpublicor#pythonto track community hashtags -
Date filtering: use
sinceanduntilparameters to narrow to a time window - Author filtering: combine with an author handle to search one person's posts
Mode 2: Profile Mode
Profile mode lets you fetch detailed information about specific Bluesky users — their bio, follower/following counts, post counts, and metadata.
Use Case: Influencer Research
Say you're building a list of AI thought leaders on Bluesky. You have a list of handles, and you need their follower counts, bio text, and activity levels.
Python Example
# Fetch profiles for a list of handles
run_response = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"handles": [
"jay.bsky.team",
"pfrazee.com",
"samuel.bsky.team",
],
"mode": "profiles",
},
)
Profile Data Fields
| Field | Description |
|---|---|
handle |
User's Bluesky handle |
displayName |
Display name |
did |
Decentralized Identifier |
description |
Bio text |
followersCount |
Number of followers |
followsCount |
Number of accounts followed |
postsCount |
Total post count |
avatar |
Avatar image URL |
createdAt |
Account creation date |
This is invaluable for influencer scoring. You can rank users by follower count, calculate follower-to-following ratios, and identify accounts that are growing fastest by tracking profiles over time with scheduled runs.
Mode 3: Followers Mode
Followers mode extracts the follower list for a specific account. This is the mode you need for network analysis and audience overlap studies.
Use Case: Audience Overlap Analysis
Want to know how much overlap exists between two competing brands on Bluesky? Pull the follower lists for both accounts, then compute the intersection:
import requests
def get_followers(handle, token, actor_id):
"""Fetch all followers for a Bluesky handle via the scraper."""
resp = requests.post(
f"https://api.apify.com/v2/acts/{actor_id}/runs",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
},
json={"handle": handle, "mode": "followers", "maxResults": 5000},
)
run = resp.json()["data"]
# ... poll for completion, then fetch dataset ...
return set(f["handle"] for f in followers_data)
brand_a_followers = get_followers("brand-a.bsky.social", APIFY_TOKEN, ACTOR_ID)
brand_b_followers = get_followers("brand-b.bsky.social", APIFY_TOKEN, ACTOR_ID)
overlap = brand_a_followers & brand_b_followers
print(f"Overlap: {len(overlap)} shared followers")
print(f"Brand A unique: {len(brand_a_followers - overlap)}")
print(f"Brand B unique: {len(brand_b_followers - overlap)}")
Scheduling Recurring Scrapes
One-off collection is useful, but most real use cases need ongoing monitoring. Apify's scheduling runs the scraper on any cron schedule — hourly, daily, weekly.
# Create a daily schedule via the API
schedule_resp = requests.post(
"https://api.apify.com/v2/schedules",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"name": "bluesky-daily-brand-monitor",
"cronExpression": "0 9 * * *",
"timezone": "America/New_York",
"actions": [{
"type": "RUN_ACTOR",
"actorId": "cryptosignals/bluesky-scraper",
"runInput": {
"body": json.dumps({
"searchTerms": ["your brand", "competitor brand"],
"maxResults": 500,
"sort": "latest",
}),
"contentType": "application/json",
},
}],
},
)
print(f"Schedule created: {schedule_resp.json()['data']['id']}")
Each scheduled run creates a new dataset, giving you a clean time series of results.
Real-World Use Cases
Social Listening & Brand Monitoring
Track mentions of your brand, product, or competitors daily. Export to CSV and feed into your analytics dashboard. With Bluesky's user base concentrated in tech, media, and academic communities, it's high-signal data for B2B companies.
Trend Detection
Run broad keyword searches on a schedule and compare volume over time. Spot emerging topics — new frameworks, security vulnerabilities, industry shifts — before they hit mainstream platforms.
Sentiment Analysis Pipeline
The structured output (clean text + timestamps + engagement metrics) plugs directly into NLP pipelines:
from textblob import TextBlob
for post in results:
blob = TextBlob(post["text"])
print(f"Sentiment: {blob.sentiment.polarity:.2f} — {post['text'][:80]}")
Academic Research
Bluesky is one of the few platforms where large-scale social media research doesn't require a special academic API tier. The AT Protocol's openness makes it ideal for studying information spread, community dynamics, and platform migration patterns.
Handling Proxies and Rate Limits at Scale
For most use cases, the Apify Actor handles rate limiting automatically. But if you're also hitting the AT Protocol directly from your own infrastructure (for real-time streaming or custom collection), you'll want a proxy rotation layer.
Services like ScraperAPI provide automatic proxy rotation with residential and datacenter IPs, handling retries and CAPTCHAs for you. This is especially useful when combining Bluesky data collection with scraping from other platforms that do require authentication.
Monitoring Your Scraping Pipeline
When you're running scheduled scrapes across multiple search terms, monitoring becomes important. You need to know when a run fails, when output volume drops unexpectedly, or when your data pipeline breaks.
ScrapeOps provides a monitoring dashboard for scraping operations — track success rates, run durations, and data volumes across all your Apify actors and custom scrapers from a single interface.
Bluesky vs. Twitter/X API: Cost Comparison
| Feature | Twitter/X API | Bluesky AT Protocol |
|---|---|---|
| Authentication | OAuth 2.0 required | None for public data |
| Cost | $5,000/month (Pro) | Free |
| Rate limits | 300 requests/15 min (Basic) | Generous, undocumented |
| Historical data | Limited on lower tiers | Full search via Algolia |
| Export formats | JSON only | JSON, CSV, Excel, XML (via Apify) |
| Developer approval | Required, can take weeks | Not needed |
For social media analysis, the economics are clear. You can monitor Bluesky for the cost of Apify compute credits (pennies per run) versus $5,000/month for comparable Twitter/X access.
Getting Started
The shortest path from zero to data:
- Create a free Apify account
- Open the Bluesky Scraper
- Enter search terms, set a result limit, click Start
- Download results as JSON, CSV, or Excel
For production pipelines, grab your API token from the Apify Console and use the Python examples above to integrate into your workflow.
Bluesky's open protocol is a rare thing in social media: a platform where public data is actually public. Whether you're building a monitoring dashboard, training a sentiment model, or studying platform migration, the data is there — and it costs orders of magnitude less than the alternatives.
The Bluesky Scraper is available at apify.com/cryptosignals/bluesky-scraper. For AT Protocol documentation, see docs.bsky.app.
Top comments (0)