Building real-time Bluesky analytics: ingesting 2.2M posts/day from the firehose

#python #atprotocol #fastapi #ai

Bluesky publishes every post, like, follow, and block through a public firehose — a WebSocket stream of every event on the network in real-time. I built a system that ingests all of it, classifies every post with AI, and turns it into analytics anyone can use.

Here's how it works and what I learned processing ~2.2 million posts per day on a single server.

The Architecture

The stack is straightforward: Python 3.11, FastAPI, PostgreSQL 16, and Redis 7, all running on a single Hetzner CPX52 (~$50/month). Docker Compose orchestrates 13+ services.

The firehose consumer connects to Bluesky's relay via WebSocket and receives every event on the network. At peak hours, that's 130K+ posts per hour. The consumer writes raw post data (text, author DID, timestamps) to PostgreSQL, where an enricher service resolves author handles and profile metadata in batches.

AI Classification at Scale

The interesting part is the content intelligence pipeline. Every ingested post gets sampled and sent to Claude Haiku for classification. Each post gets tagged with:

Topics (politics, tech, art, humor, news, etc.)
Language
Content type (original, reply, quote)

The challenge is budget. AI classification at 2M+ posts/day would cost hundreds of dollars. So the pipeline samples strategically — not every post, but enough to build statistically useful topic distributions. The current budget is $8/day, which classifies roughly 20-25K posts — enough to surface real trends.

The budget is tracked in Redis with a TTL-based daily counter. When the limit hits, the pipeline pauses until the TTL expires. Simple, but it took a few late-night debugging sessions to get the reset timing right.

What the Data Shows

Some things I've learned from watching the firehose:

454K unique accounts post on any given day (out of ~3-8M total)
Peak activity hits around 12 PM UTC consistently
"Personal" is always the top topic — people mostly post about their lives, not news
The engagement distribution is brutal: the vast majority of posts get zero likes

Turning It Into a Product

I turned all of this into BlueData — a free tool where you type any Bluesky handle and get an instant profile analysis: follower count, engagement rate, top topics, posting patterns, and growth trends.

For developers building Bluesky tools, bots, or dashboards, there's a REST API. A single GET request returns clean JSON with full profile analytics:

curl -H "X-API-Key: your_key" \
  https://bskydatalive.com/api/v1/pro/profile/jay.bsky.team

The free web version is at bskydatalive.com. API access starts at $9/month for 100 requests.

Lessons Learned

WebSocket connections drop. The firehose consumer disconnects every few hours (keepalive timeouts). Auto-reconnect with exponential backoff is essential.
AI budget management is its own system. You need real-time cost tracking, automatic pausing, and easy reset mechanisms. Ours is Redis-based with TTL counters.
Snapshot data lies. We initially captured like counts at analysis time (seconds after posting) — they were all zero. Engagement data needs delayed re-scoring or a separate collection pass.
A single server handles more than you think. 2.2M posts/day, 13 Docker services, AI classification, API serving, and analytics — all on one $50/month box. Don't scale prematurely.
Distribution is harder than building. The product took weeks to build. Getting anyone to see it is the actual hard problem.

Open Questions

I'm still figuring out the best way to handle engagement scoring (delayed re-check vs. separate collection) and whether topic classification at the current sample rate is representative enough. If you're working with the AT Protocol or building Bluesky tools, I'd love to compare notes.

The tool is free to try: bskydatalive.com