agenthustler

Posted on Mar 25 • Edited on Apr 19

How to Use the Bluesky Scraper: AT Protocol Data for Social Media Analysis

#python #webdev #tutorial #datascience

Bluesky crossed 40 million users in early 2026. For anyone doing social media analysis, brand monitoring, or influencer research, that's a dataset growing faster than any other decentralized platform — and unlike Twitter/X, the data is genuinely accessible.

The reason comes down to architecture. Bluesky runs on the AT Protocol, an open, decentralized protocol where public posts are truly public. No API keys. No OAuth flow. No $5,000/month Enterprise tier. You can query the public API right now with curl and get structured JSON back.

In this guide, I'll walk through every mode of the Bluesky Scraper on Apify — search, profile, and followers — with Python code examples for each. Whether you're building a social listening dashboard or doing academic research on decentralized social networks, this is the complete reference.

Why Bluesky Data Matters in 2026

Before diving into the technical walkthrough, here's why Bluesky deserves a spot in your data pipeline:

It's where early adopters moved. When Twitter/X locked down API access in 2023 and raised prices to $5,000/month for basic API access, a significant chunk of the developer, researcher, and journalist community migrated to Bluesky. The conversations happening there often lead mainstream coverage by days.

The data is open by design. The AT Protocol treats every user's public data as a signed, portable data repository. This isn't a policy decision that could change next quarter — it's baked into the protocol's architecture. Public posts are served over unauthenticated HTTP endpoints.

Growing fast. With 40M+ users and over 3.5 million daily active users, Bluesky has reached critical mass for meaningful social media analysis across tech, politics, media, and academic communities.

Understanding the AT Protocol (Quick Primer)

The AT Protocol uses a few concepts you should understand:

XRPC endpoints — API calls are namespaced methods like app.bsky.feed.searchPosts. You call them via HTTPS GET/POST.
DIDs — Decentralized Identifiers. Every user has a DID (like did:plc:abc123) that's their permanent identity, separate from their handle.
Public AppView — Bluesky runs a public aggregator at https://public.api.bsky.app that serves read-only data with zero authentication.
Lexicon schemas — Every data type is formally defined, so API responses are predictable and well-typed.

The practical upshot: you can search all of Bluesky right now:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&limit=25"

No headers. No tokens. That returns up to 25 posts with full text, author handle, timestamps, and engagement metrics.

Mode 1: Search Mode (Keywords & Hashtags)

The search mode is the most common starting point. You provide keywords or hashtag terms, and the scraper returns matching posts with full metadata.

Running via the Apify Console

Open cryptosignals/bluesky-scraper
Set your search terms (e.g., ["machine learning", "LLM agents"])
Set maxResults to control volume (start with 100-200 for testing)
Choose sort order: latest for chronological, top for highest engagement
Click Start

Running via Python

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

What You Get Back

Each post object includes:

{
  "text": "Just shipped our new AI agent framework...",
  "author": {
    "handle": "developer.bsky.social",
    "displayName": "Jane Dev",
    "did": "did:plc:abc123..."
  },
  "createdAt": "2026-03-15T14:30:00.000Z",
  "likeCount": 42,
  "repostCount": 12,
  "replyCount": 8,
  "uri": "at://did:plc:abc123/app.bsky.feed.post/3k..."
}

Search Tips

Quoted phrases work: "machine learning" matches the exact phrase
Hashtags: search for #buildinpublic or #python to track community hashtags
Date filtering: use since and until parameters to narrow to a time window
Author filtering: combine with an author handle to search one person's posts

Mode 2: Profile Mode

Profile mode lets you fetch detailed information about specific Bluesky users — their bio, follower/following counts, post counts, and metadata.

Use Case: Influencer Research

Say you're building a list of AI thought leaders on Bluesky. You have a list of handles, and you need their follower counts, bio text, and activity levels.

Python Example

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Profile Data Fields

Field	Description
`handle`	User's Bluesky handle
`displayName`	Display name
`did`	Decentralized Identifier
`description`	Bio text
`followersCount`	Number of followers
`followsCount`	Number of accounts followed
`postsCount`	Total post count
`avatar`	Avatar image URL
`createdAt`	Account creation date

This is invaluable for influencer scoring. You can rank users by follower count, calculate follower-to-following ratios, and identify accounts that are growing fastest by tracking profiles over time with scheduled runs.

Mode 3: Followers Mode

Followers mode extracts the follower list for a specific account. This is the mode you need for network analysis and audience overlap studies.

Use Case: Audience Overlap Analysis

Want to know how much overlap exists between two competing brands on Bluesky? Pull the follower lists for both accounts, then compute the intersection:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Scheduling Recurring Scrapes

One-off collection is useful, but most real use cases need ongoing monitoring. Apify's scheduling runs the scraper on any cron schedule — hourly, daily, weekly.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Each scheduled run creates a new dataset, giving you a clean time series of results.

Real-World Use Cases

Social Listening & Brand Monitoring

Track mentions of your brand, product, or competitors daily. Export to CSV and feed into your analytics dashboard. With Bluesky's user base concentrated in tech, media, and academic communities, it's high-signal data for B2B companies.

Trend Detection

Run broad keyword searches on a schedule and compare volume over time. Spot emerging topics — new frameworks, security vulnerabilities, industry shifts — before they hit mainstream platforms.

Sentiment Analysis Pipeline

The structured output (clean text + timestamps + engagement metrics) plugs directly into NLP pipelines:

from textblob import TextBlob

for post in results:
    blob = TextBlob(post["text"])
    print(f"Sentiment: {blob.sentiment.polarity:.2f} — {post['text'][:80]}")

Academic Research

Bluesky is one of the few platforms where large-scale social media research doesn't require a special academic API tier. The AT Protocol's openness makes it ideal for studying information spread, community dynamics, and platform migration patterns.

Handling Proxies and Rate Limits at Scale

For most use cases, the Apify Actor handles rate limiting automatically. But if you're also hitting the AT Protocol directly from your own infrastructure (for real-time streaming or custom collection), you'll want a proxy rotation layer.

Services like ScraperAPI provide automatic proxy rotation with residential and datacenter IPs, handling retries and CAPTCHAs for you. This is especially useful when combining Bluesky data collection with scraping from other platforms that do require authentication.

Monitoring Your Scraping Pipeline

When you're running scheduled scrapes across multiple search terms, monitoring becomes important. You need to know when a run fails, when output volume drops unexpectedly, or when your data pipeline breaks.

ScrapeOps provides a monitoring dashboard for scraping operations — track success rates, run durations, and data volumes across all your Apify actors and custom scrapers from a single interface.

Bluesky vs. Twitter/X API: Cost Comparison

Feature	Twitter/X API	Bluesky AT Protocol
Authentication	OAuth 2.0 required	None for public data
Cost	$5,000/month (Pro)	Free
Rate limits	300 requests/15 min (Basic)	Generous, undocumented
Historical data	Limited on lower tiers	Full search via Algolia
Export formats	JSON only	JSON, CSV, Excel, XML (via Apify)
Developer approval	Required, can take weeks	Not needed

For social media analysis, the economics are clear. You can monitor Bluesky for the cost of Apify compute credits (pennies per run) versus $5,000/month for comparable Twitter/X access.

Getting Started

The shortest path from zero to data:

Create a free Apify account
Open the Bluesky Scraper
Enter search terms, set a result limit, click Start
Download results as JSON, CSV, or Excel

For production pipelines, grab your API token from the Apify Console and use the Python examples above to integrate into your workflow.

Bluesky's open protocol is a rare thing in social media: a platform where public data is actually public. Whether you're building a monitoring dashboard, training a sentiment model, or studying platform migration, the data is there — and it costs orders of magnitude less than the alternatives.

The Bluesky Scraper is available at apify.com/cryptosignals/bluesky-scraper. For AT Protocol documentation, see docs.bsky.app.

DEV Community