What Is the AT Protocol?
The AT Protocol (Authenticated Transfer Protocol) is an open-source framework for building decentralized social applications. Bluesky is its flagship app, but the protocol itself is platform-agnostic — anyone can build on it.
Here is what matters for data collection:
Data repositories. Every Bluesky user has a signed data repository — think of it like a personal Git repo for their social data. It stores their posts, likes, follows, reposts, and profile information as structured records.
Lexicon schema. The protocol uses a schema system called Lexicon to define the shape of every data type and API call. This means endpoints are predictable and well-typed. A post is always a post, with the same fields, regardless of which server hosts it.
Public API. Bluesky runs a public AppView at https://public.api.bsky.app that aggregates data from across the network. Read-only endpoints on this server require zero authentication. You send an HTTP GET request, you get JSON back.
XRPC. API calls use a convention called XRPC (Cross-Remote Procedure Call), where endpoints are namespaced method names. For example, searching posts uses app.bsky.feed.searchPosts, and fetching a profile uses app.bsky.actor.getProfile.
The practical upshot: if you want to search Bluesky posts by keyword, you can do this right now:
curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=artificial+intelligence&limit=25"
No headers, no tokens, no setup. That returns a JSON object with up to 25 posts matching "artificial intelligence," including the full post text, author handle, timestamp, like count, repost count, and reply count.
Querying the AT Protocol Directly
The public API is useful for quick lookups and small-scale collection. Here are the key endpoints:
Searching Posts
Endpoint: GET /xrpc/app.bsky.feed.searchPosts
| Parameter | Type | Description |
|---|---|---|
q |
string | Search query (supports Lucene syntax) |
limit |
int | Results per page (max 100) |
cursor |
string | Pagination cursor from previous response |
sort |
string |
top or latest
|
since |
string | Filter to posts after this datetime |
until |
string | Filter to posts before this datetime |
author |
string | Filter to posts by a specific DID or handle |
The search supports Lucene-style query syntax, so you can use quoted phrases ("machine learning"), boolean operators, and field-specific queries.
Example — searching for recent posts about web scraping:
curl "https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=web+scraping&sort=latest&limit=10"
Fetching Profiles
Endpoint: GET /xrpc/app.bsky.actor.getProfile
curl "https://public.api.bsky.app/xrpc/app.bsky.actor.getProfile?actor=jay.bsky.team"
This returns the user's display name, bio, avatar URL, follower count, following count, and post count.
Getting an Author's Feed
Endpoint: GET /xrpc/app.bsky.feed.getAuthorFeed
curl "https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=jay.bsky.team&limit=30"
The Pagination Problem
Here is where the direct API approach starts to break down. Each response includes a cursor field. To get the next page, you pass that cursor back in your next request. For 100 posts, that is straightforward. For 10,000 posts across multiple search terms with date filtering, you are writing a pagination loop, handling edge cases, managing rate limits, and parsing nested JSON.
This is exactly the kind of plumbing that an Apify Actor handles for you.
Using the Bluesky Scraper on Apify
The Bluesky Scraper is a ready-to-use Actor on the Apify platform that wraps the AT Protocol API with production-grade data collection logic: automatic pagination, rate limit handling, structured output, and multiple export formats.
Step 1: Open the Actor
Go to apify.com/cryptosignals/bluesky-scraper. You can run it directly from the web console without writing any code.
Step 2: Configure Your Search
In the Actor's input configuration, set your parameters:
- Search terms — one or more keywords or phrases to search for
- Max results — cap the number of posts returned (useful for controlling costs and runtime)
-
Date range — filter posts to a specific time window using
sinceanduntilparameters -
Sort order — choose between
top(most engaged) orlatest(most recent)
Step 3: Run and Export
Click Start to run the Actor. Depending on the volume of data, runs typically complete in seconds to a few minutes. Once finished, you can export the results in multiple formats:
- JSON — for programmatic consumption
- CSV — for spreadsheets and data analysis tools
- Excel — for business reporting
- XML — for legacy system integration
Each post record includes the full text, author handle, author DID, timestamp, like count, repost count, reply count, any embedded links or images, and the original AT Protocol URI.
Running the Actor Programmatically
The web console is great for one-off runs, but the real power is in the API. You can trigger the Bluesky Scraper from your own code, wait for it to finish, and process the results — all programmatically.
JavaScript (Node.js)
Install the Apify client:
npm install apify-client
Then run the Actor and fetch results:
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({
token: 'YOUR_APIFY_API_TOKEN',
});
// Run the Bluesky Scraper actor
const run = await client.actor('cryptosignals/bluesky-scraper').call({
searchTerms: ['AI agents', 'LLM applications'],
maxResults: 200,
sort: 'latest',
});
// Fetch results from the default dataset
const { items } = await client
.dataset(run.defaultDatasetId)
.listItems();
console.log(`Collected ${items.length} posts`);
// Process each post
for (const post of items) {
console.log(`@${post.author.handle}: ${post.text}`);
console.log(` Likes: ${post.likeCount} | Reposts: ${post.repostCount}`);
console.log(` Posted: ${post.createdAt}`);
console.log('---');
}
The client.actor().call() method starts the Actor, waits for it to finish (with smart polling and exponential backoff built in), and returns the run metadata. You then use client.dataset().listItems() to pull the actual data.
Python
Install the Apify client:
pip install apify-client
Then run and fetch:
from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_API_TOKEN')
# Configure and run the actor
run = client.actor('cryptosignals/bluesky-scraper').call(run_input={
'searchTerms': ['AI agents', 'LLM applications'],
'maxResults': 200,
'sort': 'latest',
})
# Fetch results
dataset_items = client.dataset(run['defaultDatasetId']).list_items().items
print(f'Collected {len(dataset_items)} posts')
for post in dataset_items:
print(f"@{post['author']['handle']}: {post['text']}")
print(f" Likes: {post.get('likeCount', 0)} | Reposts: {post.get('repostCount', 0)}")
print(f" Posted: {post['createdAt']}")
print('---')
Both examples follow the same pattern: initialize the client with your API token, call the Actor with your desired input, wait for completion, then iterate over the dataset. The Apify client libraries handle all the HTTP polling and retry logic.
Scheduling Recurring Scrapes
One-off data collection is useful, but most real-world use cases require ongoing monitoring. Apify's built-in scheduling system lets you run the Bluesky Scraper on a recurring basis — hourly, daily, weekly, or on any custom cron schedule.
Setting Up a Schedule via the Console
- Go to the Schedules section in your Apify Console
- Click Create new schedule
- Select the Bluesky Scraper actor
- Define your cron expression (e.g.,
0 9 * * *for every day at 9:00 AM) - Set the Actor input — the same search terms, max results, and filters you would use for a manual run
- Save the schedule
Each scheduled run creates a new dataset, so you have a clean historical record of results over time.
Setting Up a Schedule via the API
You can also create schedules programmatically. Here is a JavaScript example:
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({
token: 'YOUR_APIFY_API_TOKEN',
});
// Create a schedule that runs daily at 8:00 AM UTC
const schedule = await client.schedules().create({
name: 'bluesky-daily-monitor',
cronExpression: '0 8 * * *',
timezone: 'UTC',
actions: [{
type: 'RUN_ACTOR',
actorId: 'cryptosignals/bluesky-scraper',
runInput: {
body: JSON.stringify({
searchTerms: ['your brand name', 'competitor name'],
maxResults: 500,
sort: 'latest',
}),
contentType: 'application/json',
},
}],
});
console.log(`Schedule created: ${schedule.id}`);
Integrating with Webhooks
For a fully automated pipeline, configure a webhook on the Actor run to trigger when it finishes. This lets you push fresh data to a database, a Slack channel, a Google Sheet, or any downstream system the moment the scrape completes.
Practical Use Cases
Social Media Monitoring
Track mentions of your brand, product, or competitors across Bluesky. Schedule a daily scrape for your brand keywords, export to CSV, and feed it into your existing analytics dashboard. With Bluesky's user base growing fast (over 3.5 million daily active users as of late 2025), it is increasingly important to include it in your social listening stack.
Sentiment Analysis
Collect posts about a topic over time and run them through a sentiment analysis model. The structured text output from the scraper (clean post text, timestamps, engagement metrics) is ready for NLP pipelines without additional cleaning.
Academic Research
Researchers studying online discourse, information spread, or platform dynamics can use the scraper to build datasets of public conversations. The AT Protocol's openness makes Bluesky one of the few platforms where large-scale social media research does not require a special academic API tier.
Competitive Intelligence
Monitor what people are saying about competing products or services. Track engagement patterns — which types of posts get the most likes and reposts in your industry. Identify influential voices in your niche by analyzing author metrics alongside post performance.
Trend Detection
Run broad keyword searches on a recurring schedule and compare results over time. Spot emerging topics before they hit mainstream platforms by watching what the Bluesky community is discussing.
Why Use an Apify Actor vs. Direct API Calls
You might wonder: if the AT Protocol API is free and open, why use an Apify Actor at all? Here is a practical comparison:
| Concern | Direct API | Apify Actor |
|---|---|---|
| Pagination | Manual cursor loop implementation | Handled automatically |
| Rate limits | Must detect and handle 429 responses | Built-in backoff and retry |
| Data formatting | Raw nested JSON from XRPC responses | Cleaned, flattened output |
| Multiple queries | Sequential loops for each search term | Parallel execution across terms |
| Export formats | JSON only (you build the rest) | JSON, CSV, Excel, XML |
| Scheduling | You manage cron, hosting, error handling | Built-in cron with monitoring |
| Error handling | Build it yourself | Automatic retries and failure alerts |
| Storage | You manage where data lands | Managed datasets with 7-day retention |
For a quick one-off query, the direct API is perfect — it is literally a curl command. But for production data collection (multiple search terms, daily runs, thousands of posts, integration with downstream systems), the Actor saves significant engineering time.
Getting Started
Here is the shortest path from zero to data:
- Create a free Apify account
- Open the Bluesky Scraper
- Enter your search terms, set a result limit, and click Start
- Download your results in your preferred format
For recurring collection, set up a schedule. For programmatic access, grab your API token from the Apify Console and use the code examples above.
Bluesky's open protocol is a rare thing in social media: a platform that treats public data as actually public. Whether you are building a monitoring dashboard, training a sentiment model, or studying online communities, the data is there. You just need to collect it.
The Bluesky Scraper Actor is available at apify.com/cryptosignals/bluesky-scraper. For questions about the AT Protocol, see the official Bluesky API documentation.
Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.
Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.
Compare web scraping APIs:
- ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
- Scrape.do — From $29/mo, strong Cloudflare bypass
- ScrapeOps — Proxy comparison + monitoring dashboard
Need custom web scraping? Email hustler@curlship.com — fast turnaround, fair pricing.
Top comments (0)