<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arielle Houlier</title>
    <description>The latest articles on DEV Community by Arielle Houlier (@arielle_houlier_21f42996e).</description>
    <link>https://dev.to/arielle_houlier_21f42996e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3841239%2Fada9f52c-bfb5-4c70-b11a-983a0eebdc98.jpg</url>
      <title>DEV Community: Arielle Houlier</title>
      <link>https://dev.to/arielle_houlier_21f42996e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arielle_houlier_21f42996e"/>
    <language>en</language>
    <item>
      <title>Building real-time Bluesky analytics: ingesting 2.2M posts/day from the firehose</title>
      <dc:creator>Arielle Houlier</dc:creator>
      <pubDate>Tue, 24 Mar 2026 07:13:34 +0000</pubDate>
      <link>https://dev.to/arielle_houlier_21f42996e/building-real-time-bluesky-analytics-ingesting-22m-postsday-from-the-firehose-ccm</link>
      <guid>https://dev.to/arielle_houlier_21f42996e/building-real-time-bluesky-analytics-ingesting-22m-postsday-from-the-firehose-ccm</guid>
      <description>&lt;p&gt;Bluesky publishes every post, like, follow, and block through a public firehose — a WebSocket stream of every event on the network in real-time. I built a system that ingests all of it, classifies every post with AI, and turns it into analytics anyone can use.&lt;/p&gt;

&lt;p&gt;Here's how it works and what I learned processing ~2.2 million posts per day on a single server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The stack is straightforward: Python 3.11, FastAPI, PostgreSQL 16, and Redis 7, all running on a single Hetzner CPX52 (~$50/month). Docker Compose orchestrates 13+ services.&lt;/p&gt;

&lt;p&gt;The firehose consumer connects to Bluesky's relay via WebSocket and receives every event on the network. At peak hours, that's 130K+ posts per hour. The consumer writes raw post data (text, author DID, timestamps) to PostgreSQL, where an enricher service resolves author handles and profile metadata in batches.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Classification at Scale
&lt;/h2&gt;

&lt;p&gt;The interesting part is the content intelligence pipeline. Every ingested post gets sampled and sent to Claude Haiku for classification. Each post gets tagged with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Topics&lt;/strong&gt; (politics, tech, art, humor, news, etc.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content type&lt;/strong&gt; (original, reply, quote)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The challenge is budget. AI classification at 2M+ posts/day would cost hundreds of dollars. So the pipeline samples strategically — not every post, but enough to build statistically useful topic distributions. The current budget is $8/day, which classifies roughly 20-25K posts — enough to surface real trends.&lt;/p&gt;

&lt;p&gt;The budget is tracked in Redis with a TTL-based daily counter. When the limit hits, the pipeline pauses until the TTL expires. Simple, but it took a few late-night debugging sessions to get the reset timing right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Shows
&lt;/h2&gt;

&lt;p&gt;Some things I've learned from watching the firehose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;454K unique accounts&lt;/strong&gt; post on any given day (out of ~3-8M total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak activity&lt;/strong&gt; hits around 12 PM UTC consistently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Personal"&lt;/strong&gt; is always the top topic — people mostly post about their lives, not news&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;engagement distribution&lt;/strong&gt; is brutal: the vast majority of posts get zero likes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Turning It Into a Product
&lt;/h2&gt;

&lt;p&gt;I turned all of this into &lt;a href="https://bskydatalive.com" rel="noopener noreferrer"&gt;BlueData&lt;/a&gt; — a free tool where you type any Bluesky handle and get an instant profile analysis: follower count, engagement rate, top topics, posting patterns, and growth trends.&lt;/p&gt;

&lt;p&gt;For developers building Bluesky tools, bots, or dashboards, there's a &lt;a href="https://bskydatalive.com/docs" rel="noopener noreferrer"&gt;REST API&lt;/a&gt;. A single GET request returns clean JSON with full profile analytics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"X-API-Key: your_key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://bskydatalive.com/api/v1/pro/profile/jay.bsky.team
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The free web version is at &lt;a href="https://bskydatalive.com" rel="noopener noreferrer"&gt;bskydatalive.com&lt;/a&gt;. API access starts at $9/month for 100 requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WebSocket connections drop.&lt;/strong&gt; The firehose consumer disconnects every few hours (keepalive timeouts). Auto-reconnect with exponential backoff is essential.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI budget management is its own system.&lt;/strong&gt; You need real-time cost tracking, automatic pausing, and easy reset mechanisms. Ours is Redis-based with TTL counters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Snapshot data lies.&lt;/strong&gt; We initially captured like counts at analysis time (seconds after posting) — they were all zero. Engagement data needs delayed re-scoring or a separate collection pass.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A single server handles more than you think.&lt;/strong&gt; 2.2M posts/day, 13 Docker services, AI classification, API serving, and analytics — all on one $50/month box. Don't scale prematurely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distribution is harder than building.&lt;/strong&gt; The product took weeks to build. Getting anyone to see it is the actual hard problem.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Open Questions
&lt;/h2&gt;

&lt;p&gt;I'm still figuring out the best way to handle engagement scoring (delayed re-check vs. separate collection) and whether topic classification at the current sample rate is representative enough. If you're working with the AT Protocol or building Bluesky tools, I'd love to compare notes.&lt;/p&gt;

&lt;p&gt;The tool is free to try: &lt;a href="https://bskydatalive.com" rel="noopener noreferrer"&gt;bskydatalive.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>atprotocol</category>
      <category>fastapi</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
