<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Guilherme Cavalcante</title>
    <description>The latest articles on DEV Community by Guilherme Cavalcante (@gscdataanalytic).</description>
    <link>https://dev.to/gscdataanalytic</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940424%2F95cbe02d-c71e-4b20-96f8-70e8a4ac0f69.jpeg</url>
      <title>DEV Community: Guilherme Cavalcante</title>
      <link>https://dev.to/gscdataanalytic</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gscdataanalytic"/>
    <language>en</language>
    <item>
      <title>A data platform tracking news and social media across 167 cities in Rio Grande do Norte for under R$5/month.</title>
      <dc:creator>Guilherme Cavalcante</dc:creator>
      <pubDate>Tue, 19 May 2026 13:58:25 +0000</pubDate>
      <link>https://dev.to/gscdataanalytic/a-data-platform-tracking-news-and-social-media-across-167-cities-in-rio-grande-do-norte-for-under-5hh8</link>
      <guid>https://dev.to/gscdataanalytic/a-data-platform-tracking-news-and-social-media-across-167-cities-in-rio-grande-do-norte-for-under-5hh8</guid>
      <description>&lt;p&gt;I put a data pipeline into production paying 30x less than a traditional setup would require.&lt;/p&gt;

&lt;p&gt;Over the last few months, I built a platform that monitors all 167 municipalities in Rio Grande do Norte using data from news outlets, Facebook, Instagram, X, and TikTok. The data goes through Portuguese text processing, is served through a FastAPI API, and reaches a React dashboard almost in real time.&lt;/p&gt;

&lt;p&gt;What reduced the cost the most? Here it is:&lt;/p&gt;

&lt;p&gt;→ Cloud Run Jobs instead of keeping Airflow running 24/7. The container starts, runs the pipeline, and shuts down. Since processing takes only a few minutes and happens a few times a day, the bill practically disappears.&lt;br&gt;
→ Local DuckDB during development, BigQuery only in production. I can test everything without burning cloud quota.&lt;br&gt;
→ NLP without relying on LLMs for everything. I used spaCy with deterministic rules in Portuguese: fast, cheap, and auditable. LLMs only come in when someone asks for an explanation of specific content.&lt;br&gt;
→ Data Lake in Parquet before BigQuery. Reprocessing became trivial and raw data stays preserved.&lt;br&gt;
→ GitHub Actions authenticating to GCP through Workload Identity Federation with OIDC. Zero private keys stored in secrets.&lt;/p&gt;

&lt;p&gt;A lot of expensive architecture exists because copying tutorial stacks became a habit. Every project decision (and every alternative I discarded) is documented in the README.&lt;br&gt;
Stack: Python, dbt, BigQuery, Terraform, GCP, FastAPI, React, TypeScript, spaCy, and uv workspace.&lt;/p&gt;

&lt;p&gt;Part of the development was done through pair programming with Claude Code. I make that explicit because the tool accelerates writing, but does not replace technical decision-making.&lt;/p&gt;

&lt;p&gt;More details, architecture, and dashboard: guilhermecavalcante.works&lt;br&gt;
Open source: &lt;a href="https://lnkd.in/da3ZZiZ3" rel="noopener noreferrer"&gt;https://lnkd.in/da3ZZiZ3&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m open to opportunities in Data Engineering, Analytics Engineering, and AI solutions.&lt;/p&gt;

&lt;p&gt;Question for people working with infrastructure and data: do you also feel that many stacks today already start oversized?&lt;/p&gt;

&lt;h1&gt;
  
  
  DataEngineering #AnalyticsEngineering #MLOps #Python #BigQuery #GCP
&lt;/h1&gt;

</description>
      <category>dataengineering</category>
      <category>googlecloud</category>
      <category>python</category>
      <category>serverless</category>
    </item>
  </channel>
</rss>
