agenthustler

Posted on Mar 16 • Edited on Apr 19 • Originally published at thedatacollector.substack.com

How to Scrape Bluesky Posts in 2026 (No Code Required)

#webscraping #bluesky #atprotocol #tutorial

What Is the AT Protocol?

The AT Protocol (Authenticated Transfer Protocol) is an open-source framework for building decentralized social applications. Bluesky is its flagship app, but the protocol itself is platform-agnostic — anyone can build on it.

Here is what matters for data collection:

Data repositories. Every Bluesky user has a signed data repository — think of it like a personal Git repo for their social data. It stores their posts, likes, follows, reposts, and profile information as structured records.

Lexicon schema. The protocol uses a schema system called Lexicon to define the shape of every data type and API call. This means endpoints are predictable and well-typed. A post is always a post, with the same fields, regardless of which server hosts it.

Public API. Bluesky runs a public AppView at https://public.api.bsky.app that aggregates data from across the network. Read-only endpoints on this server require zero authentication. You send an HTTP GET request, you get JSON back.

XRPC. API calls use a convention called XRPC (Cross-Remote Procedure Call), where endpoints are namespaced method names. For example, searching posts uses app.bsky.feed.searchPosts, and fetching a profile uses app.bsky.actor.getProfile.

The practical upshot: if you want to search Bluesky posts by keyword, you can do this right now:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=artificial+intelligence&limit=25"

No headers, no tokens, no setup. That returns a JSON object with up to 25 posts matching "artificial intelligence," including the full post text, author handle, timestamp, like count, repost count, and reply count.

Querying the AT Protocol Directly

The public API is useful for quick lookups and small-scale collection. Here are the key endpoints:

Searching Posts

Endpoint: GET /xrpc/app.bsky.feed.searchPosts

Parameter	Type	Description
`q`	string	Search query (supports Lucene syntax)
`limit`	int	Results per page (max 100)
`cursor`	string	Pagination cursor from previous response
`sort`	string	`top` or `latest`
`since`	string	Filter to posts after this datetime
`until`	string	Filter to posts before this datetime
`author`	string	Filter to posts by a specific DID or handle

The search supports Lucene-style query syntax, so you can use quoted phrases ("machine learning"), boolean operators, and field-specific queries.

Example — searching for recent posts about web scraping:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&sort=latest&limit=10"

Fetching Profiles

Endpoint: GET /xrpc/app.bsky.actor.getProfile

curl "https://public.api.bsky.app/xrpc/app.bsky.actor.getProfile?actor=jay.bsky.team"

This returns the user's display name, bio, avatar URL, follower count, following count, and post count.

Getting an Author's Feed

Endpoint: GET /xrpc/app.bsky.feed.getAuthorFeed

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=jay.bsky.team&limit=30"

The Pagination Problem

Here is where the direct API approach starts to break down. Each response includes a cursor field. To get the next page, you pass that cursor back in your next request. For 100 posts, that is straightforward. For 10,000 posts across multiple search terms with date filtering, you are writing a pagination loop, handling edge cases, managing rate limits, and parsing nested JSON.

This is exactly the kind of plumbing that an Apify Actor handles for you.

Using the Bluesky Scraper on Apify

The Bluesky Scraper is a ready-to-use Actor on the Apify platform that wraps the AT Protocol API with production-grade data collection logic: automatic pagination, rate limit handling, structured output, and multiple export formats.

Step 1: Open the Actor

Go to apify.com/cryptosignals/bluesky-scraper. You can run it directly from the web console without writing any code.

Step 2: Configure Your Search

In the Actor's input configuration, set your parameters:

Search terms — one or more keywords or phrases to search for
Max results — cap the number of posts returned (useful for controlling costs and runtime)
Date range — filter posts to a specific time window using since and until parameters
Sort order — choose between top (most engaged) or latest (most recent)

Step 3: Run and Export

Click Start to run the Actor. Depending on the volume of data, runs typically complete in seconds to a few minutes. Once finished, you can export the results in multiple formats:

JSON — for programmatic consumption
CSV — for spreadsheets and data analysis tools
Excel — for business reporting
XML — for legacy system integration

Each post record includes the full text, author handle, author DID, timestamp, like count, repost count, reply count, any embedded links or images, and the original AT Protocol URI.

Running the Actor Programmatically

The web console is great for one-off runs, but the real power is in the API. You can trigger the Bluesky Scraper from your own code, wait for it to finish, and process the results — all programmatically.

JavaScript (Node.js)

Install the Apify client:

npm install apify-client

Then run the Actor and fetch results:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

// Run the Bluesky Scraper actor
const run = await client.actor('cryptosignals/bluesky-scraper').call({
    searchTerms: ['AI agents', 'LLM applications'],
    maxResults: 200,
    sort: 'latest',
});

// Fetch results from the default dataset
const { items } = await client
    .dataset(run.defaultDatasetId)
    .listItems();

console.log(`Collected ${items.length} posts`);

// Process each post
for (const post of items) {
    console.log(`@${post.author.handle}: ${post.text}`);
    console.log(`  Likes: ${post.likeCount} | Reposts: ${post.repostCount}`);
    console.log(`  Posted: ${post.createdAt}`);
    console.log('---');
}

The client.actor().call() method starts the Actor, waits for it to finish (with smart polling and exponential backoff built in), and returns the run metadata. You then use client.dataset().listItems() to pull the actual data.

Python

Install the Apify client:

pip install apify-client

Then run and fetch:

from apify_client import ApifyClient

client = ApifyClient('YOUR_APIFY_API_TOKEN')

# Configure and run the actor
run = client.actor('cryptosignals/bluesky-scraper').call(run_input={
    'searchTerms': ['AI agents', 'LLM applications'],
    'maxResults': 200,
    'sort': 'latest',
})

# Fetch results
dataset_items = client.dataset(run['defaultDatasetId']).list_items().items

print(f'Collected {len(dataset_items)} posts')

for post in dataset_items:
    print(f"@{post['author']['handle']}: {post['text']}")
    print(f"  Likes: {post.get('likeCount', 0)} | Reposts: {post.get('repostCount', 0)}")
    print(f"  Posted: {post['createdAt']}")
    print('---')

Both examples follow the same pattern: initialize the client with your API token, call the Actor with your desired input, wait for completion, then iterate over the dataset. The Apify client libraries handle all the HTTP polling and retry logic.

Scheduling Recurring Scrapes

One-off data collection is useful, but most real-world use cases require ongoing monitoring. Apify's built-in scheduling system lets you run the Bluesky Scraper on a recurring basis — hourly, daily, weekly, or on any custom cron schedule.

Setting Up a Schedule via the Console

Go to the Schedules section in your Apify Console
Click Create new schedule
Select the Bluesky Scraper actor
Define your cron expression (e.g., 0 9 * * * for every day at 9:00 AM)
Set the Actor input — the same search terms, max results, and filters you would use for a manual run
Save the schedule

Each scheduled run creates a new dataset, so you have a clean historical record of results over time.

Setting Up a Schedule via the API

You can also create schedules programmatically. Here is a JavaScript example:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

// Create a schedule that runs daily at 8:00 AM UTC
const schedule = await client.schedules().create({
    name: 'bluesky-daily-monitor',
    cronExpression: '0 8 * * *',
    timezone: 'UTC',
    actions: [{
        type: 'RUN_ACTOR',
        actorId: 'cryptosignals/bluesky-scraper',
        runInput: {
            body: JSON.stringify({
                searchTerms: ['your brand name', 'competitor name'],
                maxResults: 500,
                sort: 'latest',
            }),
            contentType: 'application/json',
        },
    }],
});

console.log(`Schedule created: ${schedule.id}`);

Integrating with Webhooks

For a fully automated pipeline, configure a webhook on the Actor run to trigger when it finishes. This lets you push fresh data to a database, a Slack channel, a Google Sheet, or any downstream system the moment the scrape completes.

Practical Use Cases

Social Media Monitoring

Track mentions of your brand, product, or competitors across Bluesky. Schedule a daily scrape for your brand keywords, export to CSV, and feed it into your existing analytics dashboard. With Bluesky's user base growing fast (over 3.5 million daily active users as of late 2025), it is increasingly important to include it in your social listening stack.

Sentiment Analysis

Collect posts about a topic over time and run them through a sentiment analysis model. The structured text output from the scraper (clean post text, timestamps, engagement metrics) is ready for NLP pipelines without additional cleaning.

Academic Research

Researchers studying online discourse, information spread, or platform dynamics can use the scraper to build datasets of public conversations. The AT Protocol's openness makes Bluesky one of the few platforms where large-scale social media research does not require a special academic API tier.

Competitive Intelligence

Monitor what people are saying about competing products or services. Track engagement patterns — which types of posts get the most likes and reposts in your industry. Identify influential voices in your niche by analyzing author metrics alongside post performance.

Trend Detection

Run broad keyword searches on a recurring schedule and compare results over time. Spot emerging topics before they hit mainstream platforms by watching what the Bluesky community is discussing.

Why Use an Apify Actor vs. Direct API Calls

You might wonder: if the AT Protocol API is free and open, why use an Apify Actor at all? Here is a practical comparison:

Concern	Direct API	Apify Actor
Pagination	Manual cursor loop implementation	Handled automatically
Rate limits	Must detect and handle 429 responses	Built-in backoff and retry
Data formatting	Raw nested JSON from XRPC responses	Cleaned, flattened output
Multiple queries	Sequential loops for each search term	Parallel execution across terms
Export formats	JSON only (you build the rest)	JSON, CSV, Excel, XML
Scheduling	You manage cron, hosting, error handling	Built-in cron with monitoring
Error handling	Build it yourself	Automatic retries and failure alerts
Storage	You manage where data lands	Managed datasets with 7-day retention

For a quick one-off query, the direct API is perfect — it is literally a curl command. But for production data collection (multiple search terms, daily runs, thousands of posts, integration with downstream systems), the Actor saves significant engineering time.

Getting Started

Here is the shortest path from zero to data:

Create a free Apify account
Open the Bluesky Scraper
Enter your search terms, set a result limit, and click Start
Download your results in your preferred format

For recurring collection, set up a schedule. For programmatic access, grab your API token from the Apify Console and use the code examples above.

Bluesky's open protocol is a rare thing in social media: a platform that treats public data as actually public. Whether you are building a monitoring dashboard, training a sentiment model, or studying online communities, the data is there. You just need to collect it.

The Bluesky Scraper Actor is available at apify.com/cryptosignals/bluesky-scraper. For questions about the AT Protocol, see the official Bluesky API documentation.

Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
Bright Data — Enterprise-grade proxy network, 72M+ residential IPs
Scrape.do — From $29/mo, strong Cloudflare bypass
ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hello@web-data-labs.com — fast turnaround, fair pricing.

📘 Get the Complete Web Scraping Playbook

Want the full guide? The Complete Web Scraping Playbook 2026 — 48 pages covering proxies, anti-bot bypass, stealth browsers, and production-ready architectures. Just $9.

Skip the Build

You don't have to reinvent this. We maintain a production-grade scraper as an Apify actor — proxies, anti-bot, retries, and schema all handled. You can run it on a pay-per-result basis and get clean JSON without writing a single line of scraping code.

Bluesky Scraper on Apify

DEV Community