DEV Community

agenthustler
agenthustler

Posted on

How to Use the Bluesky Scraper: AT Protocol Data for Social Media Analysis

Bluesky crossed 40 million users in early 2026. For anyone doing social media analysis, brand monitoring, or influencer research, that's a dataset growing faster than any other decentralized platform — and unlike Twitter/X, the data is genuinely accessible.

The reason comes down to architecture. Bluesky runs on the AT Protocol, an open, decentralized protocol where public posts are truly public. No API keys. No OAuth flow. No $5,000/month Enterprise tier. You can query the public API right now with curl and get structured JSON back.

In this guide, I'll walk through every mode of the Bluesky Scraper on Apify — search, profile, and followers — with Python code examples for each. Whether you're building a social listening dashboard or doing academic research on decentralized social networks, this is the complete reference.


Why Bluesky Data Matters in 2026

Before diving into the technical walkthrough, here's why Bluesky deserves a spot in your data pipeline:

It's where early adopters moved. When Twitter/X locked down API access in 2023 and raised prices to $5,000/month for basic API access, a significant chunk of the developer, researcher, and journalist community migrated to Bluesky. The conversations happening there often lead mainstream coverage by days.

The data is open by design. The AT Protocol treats every user's public data as a signed, portable data repository. This isn't a policy decision that could change next quarter — it's baked into the protocol's architecture. Public posts are served over unauthenticated HTTP endpoints.

Growing fast. With 40M+ users and over 3.5 million daily active users, Bluesky has reached critical mass for meaningful social media analysis across tech, politics, media, and academic communities.


Understanding the AT Protocol (Quick Primer)

The AT Protocol uses a few concepts you should understand:

  • XRPC endpoints — API calls are namespaced methods like app.bsky.feed.searchPosts. You call them via HTTPS GET/POST.
  • DIDs — Decentralized Identifiers. Every user has a DID (like did:plc:abc123) that's their permanent identity, separate from their handle.
  • Public AppView — Bluesky runs a public aggregator at https://public.api.bsky.app that serves read-only data with zero authentication.
  • Lexicon schemas — Every data type is formally defined, so API responses are predictable and well-typed.

The practical upshot: you can search all of Bluesky right now:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&limit=25"
Enter fullscreen mode Exit fullscreen mode

No headers. No tokens. That returns up to 25 posts with full text, author handle, timestamps, and engagement metrics.


Mode 1: Search Mode (Keywords & Hashtags)

The search mode is the most common starting point. You provide keywords or hashtag terms, and the scraper returns matching posts with full metadata.

Running via the Apify Console

  1. Open cryptosignals/bluesky-scraper
  2. Set your search terms (e.g., ["machine learning", "LLM agents"])
  3. Set maxResults to control volume (start with 100-200 for testing)
  4. Choose sort order: latest for chronological, top for highest engagement
  5. Click Start

Running via Python

import requests
import json

# Apify API endpoint for running actors
APIFY_TOKEN = "YOUR_APIFY_API_TOKEN"
ACTOR_ID = "cryptosignals/bluesky-scraper"

# Start the actor run
run_response = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "searchTerms": ["artificial intelligence", "AI agents"],
        "maxResults": 200,
        "sort": "latest",
    },
)

run_data = run_response.json()["data"]
run_id = run_data["id"]
dataset_id = run_data["defaultDatasetId"]
print(f"Run started: {run_id}")

# Poll until the run finishes
import time

while True:
    status_resp = requests.get(
        f"https://api.apify.com/v2/actor-runs/{run_id}",
        headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
    )
    status = status_resp.json()["data"]["status"]
    if status in ("SUCCEEDED", "FAILED", "ABORTED"):
        break
    time.sleep(5)

print(f"Run finished with status: {status}")

# Fetch the results
results = requests.get(
    f"https://api.apify.com/v2/datasets/{dataset_id}/items",
    headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
).json()

for post in results[:5]:
    print(f"@{post['author']['handle']}: {post['text'][:100]}...")
    print(f"  Likes: {post.get('likeCount', 0)} | Reposts: {post.get('repostCount', 0)}")
    print(f"  Posted: {post['createdAt']}")
    print()
Enter fullscreen mode Exit fullscreen mode

What You Get Back

Each post object includes:

{
  "text": "Just shipped our new AI agent framework...",
  "author": {
    "handle": "developer.bsky.social",
    "displayName": "Jane Dev",
    "did": "did:plc:abc123..."
  },
  "createdAt": "2026-03-15T14:30:00.000Z",
  "likeCount": 42,
  "repostCount": 12,
  "replyCount": 8,
  "uri": "at://did:plc:abc123/app.bsky.feed.post/3k..."
}
Enter fullscreen mode Exit fullscreen mode

Search Tips

  • Quoted phrases work: "machine learning" matches the exact phrase
  • Hashtags: search for #buildinpublic or #python to track community hashtags
  • Date filtering: use since and until parameters to narrow to a time window
  • Author filtering: combine with an author handle to search one person's posts

Mode 2: Profile Mode

Profile mode lets you fetch detailed information about specific Bluesky users — their bio, follower/following counts, post counts, and metadata.

Use Case: Influencer Research

Say you're building a list of AI thought leaders on Bluesky. You have a list of handles, and you need their follower counts, bio text, and activity levels.

Python Example

# Fetch profiles for a list of handles
run_response = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "handles": [
            "jay.bsky.team",
            "pfrazee.com",
            "samuel.bsky.team",
        ],
        "mode": "profiles",
    },
)
Enter fullscreen mode Exit fullscreen mode

Profile Data Fields

Field Description
handle User's Bluesky handle
displayName Display name
did Decentralized Identifier
description Bio text
followersCount Number of followers
followsCount Number of accounts followed
postsCount Total post count
avatar Avatar image URL
createdAt Account creation date

This is invaluable for influencer scoring. You can rank users by follower count, calculate follower-to-following ratios, and identify accounts that are growing fastest by tracking profiles over time with scheduled runs.


Mode 3: Followers Mode

Followers mode extracts the follower list for a specific account. This is the mode you need for network analysis and audience overlap studies.

Use Case: Audience Overlap Analysis

Want to know how much overlap exists between two competing brands on Bluesky? Pull the follower lists for both accounts, then compute the intersection:

import requests

def get_followers(handle, token, actor_id):
    """Fetch all followers for a Bluesky handle via the scraper."""
    resp = requests.post(
        f"https://api.apify.com/v2/acts/{actor_id}/runs",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json",
        },
        json={"handle": handle, "mode": "followers", "maxResults": 5000},
    )
    run = resp.json()["data"]
    # ... poll for completion, then fetch dataset ...
    return set(f["handle"] for f in followers_data)

brand_a_followers = get_followers("brand-a.bsky.social", APIFY_TOKEN, ACTOR_ID)
brand_b_followers = get_followers("brand-b.bsky.social", APIFY_TOKEN, ACTOR_ID)

overlap = brand_a_followers & brand_b_followers
print(f"Overlap: {len(overlap)} shared followers")
print(f"Brand A unique: {len(brand_a_followers - overlap)}")
print(f"Brand B unique: {len(brand_b_followers - overlap)}")
Enter fullscreen mode Exit fullscreen mode

Scheduling Recurring Scrapes

One-off collection is useful, but most real use cases need ongoing monitoring. Apify's scheduling runs the scraper on any cron schedule — hourly, daily, weekly.

# Create a daily schedule via the API
schedule_resp = requests.post(
    "https://api.apify.com/v2/schedules",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "name": "bluesky-daily-brand-monitor",
        "cronExpression": "0 9 * * *",
        "timezone": "America/New_York",
        "actions": [{
            "type": "RUN_ACTOR",
            "actorId": "cryptosignals/bluesky-scraper",
            "runInput": {
                "body": json.dumps({
                    "searchTerms": ["your brand", "competitor brand"],
                    "maxResults": 500,
                    "sort": "latest",
                }),
                "contentType": "application/json",
            },
        }],
    },
)
print(f"Schedule created: {schedule_resp.json()['data']['id']}")
Enter fullscreen mode Exit fullscreen mode

Each scheduled run creates a new dataset, giving you a clean time series of results.


Real-World Use Cases

Social Listening & Brand Monitoring

Track mentions of your brand, product, or competitors daily. Export to CSV and feed into your analytics dashboard. With Bluesky's user base concentrated in tech, media, and academic communities, it's high-signal data for B2B companies.

Trend Detection

Run broad keyword searches on a schedule and compare volume over time. Spot emerging topics — new frameworks, security vulnerabilities, industry shifts — before they hit mainstream platforms.

Sentiment Analysis Pipeline

The structured output (clean text + timestamps + engagement metrics) plugs directly into NLP pipelines:

from textblob import TextBlob

for post in results:
    blob = TextBlob(post["text"])
    print(f"Sentiment: {blob.sentiment.polarity:.2f}{post['text'][:80]}")
Enter fullscreen mode Exit fullscreen mode

Academic Research

Bluesky is one of the few platforms where large-scale social media research doesn't require a special academic API tier. The AT Protocol's openness makes it ideal for studying information spread, community dynamics, and platform migration patterns.


Handling Proxies and Rate Limits at Scale

For most use cases, the Apify Actor handles rate limiting automatically. But if you're also hitting the AT Protocol directly from your own infrastructure (for real-time streaming or custom collection), you'll want a proxy rotation layer.

Services like ScraperAPI provide automatic proxy rotation with residential and datacenter IPs, handling retries and CAPTCHAs for you. This is especially useful when combining Bluesky data collection with scraping from other platforms that do require authentication.


Monitoring Your Scraping Pipeline

When you're running scheduled scrapes across multiple search terms, monitoring becomes important. You need to know when a run fails, when output volume drops unexpectedly, or when your data pipeline breaks.

ScrapeOps provides a monitoring dashboard for scraping operations — track success rates, run durations, and data volumes across all your Apify actors and custom scrapers from a single interface.


Bluesky vs. Twitter/X API: Cost Comparison

Feature Twitter/X API Bluesky AT Protocol
Authentication OAuth 2.0 required None for public data
Cost $5,000/month (Pro) Free
Rate limits 300 requests/15 min (Basic) Generous, undocumented
Historical data Limited on lower tiers Full search via Algolia
Export formats JSON only JSON, CSV, Excel, XML (via Apify)
Developer approval Required, can take weeks Not needed

For social media analysis, the economics are clear. You can monitor Bluesky for the cost of Apify compute credits (pennies per run) versus $5,000/month for comparable Twitter/X access.


Getting Started

The shortest path from zero to data:

  1. Create a free Apify account
  2. Open the Bluesky Scraper
  3. Enter search terms, set a result limit, click Start
  4. Download results as JSON, CSV, or Excel

For production pipelines, grab your API token from the Apify Console and use the Python examples above to integrate into your workflow.

Bluesky's open protocol is a rare thing in social media: a platform where public data is actually public. Whether you're building a monitoring dashboard, training a sentiment model, or studying platform migration, the data is there — and it costs orders of magnitude less than the alternatives.


The Bluesky Scraper is available at apify.com/cryptosignals/bluesky-scraper. For AT Protocol documentation, see docs.bsky.app.

Top comments (0)