DEV Community

agenthustler
agenthustler

Posted on • Originally published at thedatacollector.substack.com

How to Scrape Bluesky Posts in 2026 (No Code Required)

What Is the AT Protocol?

The AT Protocol (Authenticated Transfer Protocol) is an open-source framework for building decentralized social applications. Bluesky is its flagship app, but the protocol itself is platform-agnostic — anyone can build on it.

Here is what matters for data collection:

Data repositories. Every Bluesky user has a signed data repository — think of it like a personal Git repo for their social data. It stores their posts, likes, follows, reposts, and profile information as structured records.

Lexicon schema. The protocol uses a schema system called Lexicon to define the shape of every data type and API call. This means endpoints are predictable and well-typed. A post is always a post, with the same fields, regardless of which server hosts it.

Public API. Bluesky runs a public AppView at https://public.api.bsky.app that aggregates data from across the network. Read-only endpoints on this server require zero authentication. You send an HTTP GET request, you get JSON back.

XRPC. API calls use a convention called XRPC (Cross-Remote Procedure Call), where endpoints are namespaced method names. For example, searching posts uses app.bsky.feed.searchPosts, and fetching a profile uses app.bsky.actor.getProfile.

The practical upshot: if you want to search Bluesky posts by keyword, you can do this right now:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=artificial+intelligence&limit=25"
Enter fullscreen mode Exit fullscreen mode

No headers, no tokens, no setup. That returns a JSON object with up to 25 posts matching "artificial intelligence," including the full post text, author handle, timestamp, like count, repost count, and reply count.


Querying the AT Protocol Directly

The public API is useful for quick lookups and small-scale collection. Here are the key endpoints:

Searching Posts

Endpoint: GET /xrpc/app.bsky.feed.searchPosts

Parameter Type Description
q string Search query (supports Lucene syntax)
limit int Results per page (max 100)
cursor string Pagination cursor from previous response
sort string top or latest
since string Filter to posts after this datetime
until string Filter to posts before this datetime
author string Filter to posts by a specific DID or handle

The search supports Lucene-style query syntax, so you can use quoted phrases ("machine learning"), boolean operators, and field-specific queries.

Example — searching for recent posts about web scraping:

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&sort=latest&limit=10"
Enter fullscreen mode Exit fullscreen mode

Fetching Profiles

Endpoint: GET /xrpc/app.bsky.actor.getProfile

curl "https://public.api.bsky.app/xrpc/app.bsky.actor.getProfile?actor=jay.bsky.team"
Enter fullscreen mode Exit fullscreen mode

This returns the user's display name, bio, avatar URL, follower count, following count, and post count.

Getting an Author's Feed

Endpoint: GET /xrpc/app.bsky.feed.getAuthorFeed

curl "https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=jay.bsky.team&limit=30"
Enter fullscreen mode Exit fullscreen mode

The Pagination Problem

Here is where the direct API approach starts to break down. Each response includes a cursor field. To get the next page, you pass that cursor back in your next request. For 100 posts, that is straightforward. For 10,000 posts across multiple search terms with date filtering, you are writing a pagination loop, handling edge cases, managing rate limits, and parsing nested JSON.

This is exactly the kind of plumbing that an Apify Actor handles for you.


Using the Bluesky Scraper on Apify

The Bluesky Scraper is a ready-to-use Actor on the Apify platform that wraps the AT Protocol API with production-grade data collection logic: automatic pagination, rate limit handling, structured output, and multiple export formats.

Step 1: Open the Actor

Go to apify.com/cryptosignals/bluesky-scraper. You can run it directly from the web console without writing any code.

Step 2: Configure Your Search

In the Actor's input configuration, set your parameters:

  • Search terms — one or more keywords or phrases to search for
  • Max results — cap the number of posts returned (useful for controlling costs and runtime)
  • Date range — filter posts to a specific time window using since and until parameters
  • Sort order — choose between top (most engaged) or latest (most recent)

Step 3: Run and Export

Click Start to run the Actor. Depending on the volume of data, runs typically complete in seconds to a few minutes. Once finished, you can export the results in multiple formats:

  • JSON — for programmatic consumption
  • CSV — for spreadsheets and data analysis tools
  • Excel — for business reporting
  • XML — for legacy system integration

Each post record includes the full text, author handle, author DID, timestamp, like count, repost count, reply count, any embedded links or images, and the original AT Protocol URI.


Running the Actor Programmatically

The web console is great for one-off runs, but the real power is in the API. You can trigger the Bluesky Scraper from your own code, wait for it to finish, and process the results — all programmatically.

JavaScript (Node.js)

Install the Apify client:

npm install apify-client
Enter fullscreen mode Exit fullscreen mode

Then run the Actor and fetch results:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

// Run the Bluesky Scraper actor
const run = await client.actor('cryptosignals/bluesky-scraper').call({
    searchTerms: ['AI agents', 'LLM applications'],
    maxResults: 200,
    sort: 'latest',
});

// Fetch results from the default dataset
const { items } = await client
    .dataset(run.defaultDatasetId)
    .listItems();

console.log(`Collected ${items.length} posts`);

// Process each post
for (const post of items) {
    console.log(`@${post.author.handle}: ${post.text}`);
    console.log(`  Likes: ${post.likeCount} | Reposts: ${post.repostCount}`);
    console.log(`  Posted: ${post.createdAt}`);
    console.log('---');
}
Enter fullscreen mode Exit fullscreen mode

The client.actor().call() method starts the Actor, waits for it to finish (with smart polling and exponential backoff built in), and returns the run metadata. You then use client.dataset().listItems() to pull the actual data.

Python

Install the Apify client:

pip install apify-client
Enter fullscreen mode Exit fullscreen mode

Then run and fetch:

from apify_client import ApifyClient

client = ApifyClient('YOUR_APIFY_API_TOKEN')

# Configure and run the actor
run = client.actor('cryptosignals/bluesky-scraper').call(run_input={
    'searchTerms': ['AI agents', 'LLM applications'],
    'maxResults': 200,
    'sort': 'latest',
})

# Fetch results
dataset_items = client.dataset(run['defaultDatasetId']).list_items().items

print(f'Collected {len(dataset_items)} posts')

for post in dataset_items:
    print(f"@{post['author']['handle']}: {post['text']}")
    print(f"  Likes: {post.get('likeCount', 0)} | Reposts: {post.get('repostCount', 0)}")
    print(f"  Posted: {post['createdAt']}")
    print('---')
Enter fullscreen mode Exit fullscreen mode

Both examples follow the same pattern: initialize the client with your API token, call the Actor with your desired input, wait for completion, then iterate over the dataset. The Apify client libraries handle all the HTTP polling and retry logic.


Scheduling Recurring Scrapes

One-off data collection is useful, but most real-world use cases require ongoing monitoring. Apify's built-in scheduling system lets you run the Bluesky Scraper on a recurring basis — hourly, daily, weekly, or on any custom cron schedule.

Setting Up a Schedule via the Console

  1. Go to the Schedules section in your Apify Console
  2. Click Create new schedule
  3. Select the Bluesky Scraper actor
  4. Define your cron expression (e.g., 0 9 * * * for every day at 9:00 AM)
  5. Set the Actor input — the same search terms, max results, and filters you would use for a manual run
  6. Save the schedule

Each scheduled run creates a new dataset, so you have a clean historical record of results over time.

Setting Up a Schedule via the API

You can also create schedules programmatically. Here is a JavaScript example:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

// Create a schedule that runs daily at 8:00 AM UTC
const schedule = await client.schedules().create({
    name: 'bluesky-daily-monitor',
    cronExpression: '0 8 * * *',
    timezone: 'UTC',
    actions: [{
        type: 'RUN_ACTOR',
        actorId: 'cryptosignals/bluesky-scraper',
        runInput: {
            body: JSON.stringify({
                searchTerms: ['your brand name', 'competitor name'],
                maxResults: 500,
                sort: 'latest',
            }),
            contentType: 'application/json',
        },
    }],
});

console.log(`Schedule created: ${schedule.id}`);
Enter fullscreen mode Exit fullscreen mode

Integrating with Webhooks

For a fully automated pipeline, configure a webhook on the Actor run to trigger when it finishes. This lets you push fresh data to a database, a Slack channel, a Google Sheet, or any downstream system the moment the scrape completes.


Practical Use Cases

Social Media Monitoring

Track mentions of your brand, product, or competitors across Bluesky. Schedule a daily scrape for your brand keywords, export to CSV, and feed it into your existing analytics dashboard. With Bluesky's user base growing fast (over 3.5 million daily active users as of late 2025), it is increasingly important to include it in your social listening stack.

Sentiment Analysis

Collect posts about a topic over time and run them through a sentiment analysis model. The structured text output from the scraper (clean post text, timestamps, engagement metrics) is ready for NLP pipelines without additional cleaning.

Academic Research

Researchers studying online discourse, information spread, or platform dynamics can use the scraper to build datasets of public conversations. The AT Protocol's openness makes Bluesky one of the few platforms where large-scale social media research does not require a special academic API tier.

Competitive Intelligence

Monitor what people are saying about competing products or services. Track engagement patterns — which types of posts get the most likes and reposts in your industry. Identify influential voices in your niche by analyzing author metrics alongside post performance.

Trend Detection

Run broad keyword searches on a recurring schedule and compare results over time. Spot emerging topics before they hit mainstream platforms by watching what the Bluesky community is discussing.


Why Use an Apify Actor vs. Direct API Calls

You might wonder: if the AT Protocol API is free and open, why use an Apify Actor at all? Here is a practical comparison:

Concern Direct API Apify Actor
Pagination Manual cursor loop implementation Handled automatically
Rate limits Must detect and handle 429 responses Built-in backoff and retry
Data formatting Raw nested JSON from XRPC responses Cleaned, flattened output
Multiple queries Sequential loops for each search term Parallel execution across terms
Export formats JSON only (you build the rest) JSON, CSV, Excel, XML
Scheduling You manage cron, hosting, error handling Built-in cron with monitoring
Error handling Build it yourself Automatic retries and failure alerts
Storage You manage where data lands Managed datasets with 7-day retention

For a quick one-off query, the direct API is perfect — it is literally a curl command. But for production data collection (multiple search terms, daily runs, thousands of posts, integration with downstream systems), the Actor saves significant engineering time.


Getting Started

Here is the shortest path from zero to data:

  1. Create a free Apify account
  2. Open the Bluesky Scraper
  3. Enter your search terms, set a result limit, and click Start
  4. Download your results in your preferred format

For recurring collection, set up a schedule. For programmatic access, grab your API token from the Apify Console and use the code examples above.

Bluesky's open protocol is a rare thing in social media: a platform that treats public data as actually public. Whether you are building a monitoring dashboard, training a sentiment model, or studying online communities, the data is there. You just need to collect it.


The Bluesky Scraper Actor is available at apify.com/cryptosignals/bluesky-scraper. For questions about the AT Protocol, see the official Bluesky API documentation.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.


Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

  • ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
  • Scrape.do — From $29/mo, strong Cloudflare bypass
  • ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hustler@curlship.com — fast turnaround, fair pricing.

Top comments (0)