Devil Scrapes

Posted on May 31

Bluesky Feed Scraper: export any custom or algorithm feed to clean JSON

#webscraping #python #apify #data

Quick answer: The Bluesky AT Protocol exposes a public, unauthenticated app.bsky.feed.getFeed endpoint that returns posts from any custom or algorithm feed. To get that data as a flat, analytics-ready CSV or JSON — with engagement counts and feed metadata on every row — you call that endpoint with cursor pagination, stitch in metadata from getFeedGenerator, normalise DIDs, and handle backoff on rate limits. The Bluesky Feed Posts Scraper does it for $0.002 per post (~$2.05 per 1,000), no Bluesky account required.

Bluesky ships a Feed Generator protocol that lets any developer publish an algorithm. The result is hundreds of community-curated feeds — topic feeds, language feeds, niche hobby feeds — each one a named curator's idea of what the most relevant posts look like. And unlike Twitter/X's opaque ranked feed, the curation logic and the post inventory are publicly queryable.

The catch is that "publicly queryable" and "easily queryable" are not the same sentence. The AT Protocol surfaces the data across three separate endpoints, each cursor-paginated, each returning nested structures that need denormalising before they're useful in a spreadsheet. This is the gap this Actor fills.

What is Bluesky? 🔎

Bluesky is a decentralised social platform built on the AT Protocol — an open federated standard for social data. Every post, like, and follow is a signed, addressable record stored in a Personal Data Server (PDS). A public AppView aggregates those records and serves them over a JSON RPC API at public.api.bsky.app/xrpc/ — no authentication required, no API key, no OAuth dance.

The feed system sits on top of this: a Feed Generator is a small service that responds to getFeed calls with a list of post URIs. Bluesky's own algorithms ("Discover", "What's Hot", "With Friends") are feed generators, and so are the thousands of community-built ones. Each has a stable AT URI in the form at://did:plc:.../app.bsky.feed.generator/<rkey>.

Does Bluesky have an API for feed posts? 🔌

Yes — and it is intentionally public. The AT Protocol AppView exposes app.bsky.feed.getFeed for cursor-paginated post retrieval, app.bsky.feed.getFeedGenerator for feed metadata, and app.bsky.feed.getActorFeeds to enumerate every feed a creator has published. All three are unauthenticated. Bluesky's open-protocol commitments mean this is not an accident: the data portability is by design.

What the API does not give you: a flat, analytics-ready row with feed metadata and engagement counts already joined. It gives you nested JSON, positional cursors, and a DID per author that you carry through the schema yourself. That joining and validation is the whole job.

What the data looks like 📋

Each post comes back as one typed, denormalised row. Feed metadata — display name, creator handle, description — lands on every row so a CSV export is self-contained. Here is a real record from the "Discover" feed:

{
  "feed_uri": "at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot",
  "feed_display_name": "Discover",
  "feed_creator_handle": "bsky.app",
  "feed_description": "Trending content from your personal network",
  "post_uri": "at://did:plc:sj5wj7libgr7omqiotenxadx/app.bsky.feed.post/3mlxmr4jyfs2s",
  "post_cid": "bafyreidgimgd7v3g3pazsp5oq7ur6bvedpnwohul26mss7cbffg6bdqjkm",
  "post_indexed_at": "2026-05-16T10:20:40.467Z",
  "post_text": "If you never read the book or saw the movie, you missed one of the greatest Pulitzer Prize winning sagas ever written.",
  "post_lang": "en",
  "post_reply_count": 89,
  "post_repost_count": 414,
  "post_like_count": 1288,
  "post_quote_count": 27,
  "author_did": "did:plc:sj5wj7libgr7omqiotenxadx",
  "author_handle": "louiseplease.bsky.social",
  "author_display_name": "Louise",
  "scraped_at": "2026-05-16T12:00:00+00:00"
}

Seventeen fields, Pydantic-validated before they're written. Every optional field (feed_description, post_lang, author_display_name) is null when the API omits it — rows are never dropped for missing optional data.

The naive approach (and why it falls apart) 🛠️

The obvious path: hit https://public.api.bsky.app/xrpc/app.bsky.feed.getFeed?feed=at://...&limit=100, parse the JSON, paginate until the cursor runs out. It works for the first page. Then the edges appear.

1. DID resolution and AT URI construction. Bluesky web URLs look like bsky.app/profile/bsky.app/feed/whats-hot. The AT Protocol wants at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot. Those are not the same string. The DID is what app.bsky.actor.getProfile returns when you look up the handle bsky.app, and a single typo in it returns {"error":"InvalidRequest","message":"could not find feed"}. We resolve web URLs to AT URIs via getProfile on every run so you never have to track which DID is current.

2. Feed metadata is on a different endpoint. getFeed returns post URIs and engagement counts. It does not return the feed's display name, description, or creator handle — those come from getFeedGenerator. For a self-contained dataset you need both calls stitched together. We make the getFeedGenerator call once per feed and denormalise the result onto every row.

3. Cursor pagination and the client-side cap. The public AppView paginates at up to 100 posts per page. A feed with 500 posts needs 5 round trips with cursor threading. A feed with an undefined number of posts needs a sensible client-side cap so the run does not accumulate unbounded cost. We thread cursors, respect the per-feed cap you set (maxPostsPerFeed, default 100, max 5,000), and stop cleanly when the cursor is exhausted or the cap is hit.

4. Rate limits at scale. We retry on 408, 429, and 503 with exponential backoff — base 2 seconds, doubling each attempt, capped at 30 seconds, up to 5 attempts per request — and honour Retry-After headers when the API sets them. We rotate browser fingerprints via curl-cffi so the TLS handshake looks like a real browser client, not a Python script. On a partial-success run we surface the count via Actor.set_status_message rather than returning a green status with a silently truncated dataset.

None of this is conceptually hard. It is just engineering tax that adds up to a weekend from scratch.

The Actor 🚀

I packaged the result as an Apify Actor: Bluesky Feed Posts Scraper.

Paste a feed URI or creator handle in the Apify Console and click Start, or run it programmatically via the Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Single-feed mode — pull up to 200 posts from Bluesky's Discover feed
run = client.actor("DevilScrapes/bluesky-feed-posts").call(
    run_input={
        "feedUri": "at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot",
        "maxPostsPerFeed": 200,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["author_handle"], item["post_like_count"], item["post_text"][:60])

Or in creator-discovery mode — enumerate every feed a creator publishes and pull posts from each:

run = client.actor("DevilScrapes/bluesky-feed-posts").call(
    run_input={
        "creatorHandle": "bsky.app",
        "maxPostsPerFeed": 100,
        "maxFeeds": 10,
    }
)

Two input modes, mutually exclusive. Setting both feedUri and creatorHandle causes the Actor to fail fast with a clear error before any network call — no silent half-runs. You can also paste a bsky.app web URL directly into feedUri — the Actor converts it to AT URI form automatically. The raw AT URI skips the resolution step and runs slightly faster.

Use cases 💡

Algorithm research. Sample what the "Discover" or "What's Hot" feeds surface across multiple days or weeks and track topic drift, amplification patterns, or the ratio of original posts to reposts. The AT Protocol's open data makes this the most researcher-accessible large social feed available right now.

Newsroom social listening. Subscribe to curated topic feeds and pipe new posts into Slack or a Google Sheet via Apify Webhooks. Because feed metadata is denormalised onto every row, the Slack message template needs no join.

NLP corpus building. Collect labelled training data from topic-curated feeds for sentiment models, topic classifiers, or RAG systems. A feed labelled "AI news" or "climate science" gives you weakly supervised labels without manual tagging of raw timelines.

Creator and feed analytics. Pull every post a niche feed generator surfaces and rank by like / repost / quote ratios. Benchmark your own Bluesky posts against what the feed amplifies, and see which content formats dominate the engagement distribution.

Competitive monitoring. Track community-curated feeds that aggregate competitor announcements, support complaints, or product mentions. Creator-discovery mode pulls a creator's full feed catalogue in a single run.

Pricing — exact numbers 💰

Pay-per-event. You pay for posts that land in your dataset. No data, no charge (beyond the $0.05 run warm-up).

Event	Price
Actor start	$0.05 per run
Post row written	$0.002 per row

Posts scraped	Cost
100	$0.25
500	$1.05
1,000	$2.05
5,000	$10.05

The maximum single-run input (50 feeds × 100 posts = 5,000 rows) comes out to around $10.05. Apify's $5 free trial credit covers roughly 2,475 posts — no credit card needed.

The technically interesting part

The AT Protocol uses Content Identifiers (CIDs) — IPLD content-addressed hashes — as the stable identifier for every post record. The post_cid field in each row is the cryptographic fingerprint of the exact post record at time of indexing. Two runs returning the same post_cid for a post_uri are guaranteed to be the same record; two different post_cid values mean the post was edited in between. This makes longitudinal feed studies possible — you can track not just which posts appeared in the feed, but whether their content changed over time.

Limitations 🚧

Private or access-restricted feeds are not exposed by the public AppView API. Only feeds visible at public.api.bsky.app can be scraped.
Global feed discovery by keyword is not supported. Bluesky's getPopularFeedGenerators endpoint returns MethodNotImplemented on the public AppView. Use creator-discovery mode (creatorHandle) to enumerate one creator's feeds.
Post images, embeds, and quoted-post bodies are not extracted. Only the plain-text post_text is captured. Image ALT text and quoted-post content are outside the current schema.
Reply thread expansion is out of scope. Only the top-level post row is emitted. Thread context (parent/root posts) would require additional getPostThread calls.
The maxPostsPerFeed cap is client-side. If a feed has fewer posts than the cap, fewer rows are returned — expected behaviour, not a failure.
Storage retention on Apify's FREE plan is 7 days. Export your dataset immediately after the run, or use a named dataset for longer retention.

FAQ ❓

Is scraping public Bluesky feeds legal?
The AT Protocol is an open, federated standard. public.api.bsky.app is explicitly unauthenticated and publicly accessible without login. The Bluesky Terms of Service permit accessing public data programmatically, as long as you do not impersonate users or violate AT Protocol data-portability principles. Always verify the current Terms of Service and your local jurisdiction's data-protection rules before using scraped data commercially.

Is this a replacement for the Twitter/X API?
No. Bluesky's AT Protocol is a different architecture: the open feed-generator system, content-addressed records, and unauthenticated public AppView are native to its design, not workarounds. If you need Twitter/X data, use a Twitter scraper. If you want Bluesky's unique feed-curation graph, this Actor is built for that.

Can I export to Google Sheets or a data warehouse?
Yes. Export CSV/Excel/JSON from the Apify Console after the run, webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull via the Apify REST API: GET /datasets/{id}/items?format=csv&clean=true. Because feed metadata is already denormalised onto every row, no pivot or VLOOKUP is needed.

What is a feed URI and how do I find one?
An AT URI looks like at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot. Every Bluesky feed also has a bsky.app web URL in the form https://bsky.app/profile/<creator>/feed/<rkey>. You can paste either format into the feedUri field — the Actor converts web URLs automatically. To get all feeds from a creator, use creatorHandle and set maxFeeds.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/bluesky-feed-posts.

Free $5 trial credit, no credit card. Run it against the whats-hot feed URI above with maxPostsPerFeed: 100 and you will have a clean dataset of today's Discover feed posts in under 30 seconds. The AT Protocol documentation and Apify Python SDK docs are the two reference links you will reach for most.

Have a feed analysis use case I haven't covered, or a field you wish was in the output? Drop it in the comments — I ship based on what people actually need.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community