<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ivan-digital</title>
    <description>The latest articles on DEV Community by ivan-digital (@aufklarer).</description>
    <link>https://dev.to/aufklarer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3870529%2Ff331cc2f-71fd-455b-9b04-9c986e588dd9.jpeg</url>
      <title>DEV Community: ivan-digital</title>
      <link>https://dev.to/aufklarer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aufklarer"/>
    <language>en</language>
    <item>
      <title>Building an NLP Pipeline to Classify 225,000 Central Bank Sentences</title>
      <dc:creator>ivan-digital</dc:creator>
      <pubDate>Thu, 09 Apr 2026 21:21:13 +0000</pubDate>
      <link>https://dev.to/aufklarer/building-an-nlp-pipeline-to-classify-225000-central-bank-sentences-gaf</link>
      <guid>https://dev.to/aufklarer/building-an-nlp-pipeline-to-classify-225000-central-bank-sentences-gaf</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Central banks communicate through dense, jargon-heavy documents — policy statements, meeting minutes, press conferences. A single Fed statement is 1,500+ words. The ECB publishes minutes in 10,000+ word documents. Multiply that by 26 central banks, each publishing monthly or quarterly, and you have an impossible amount of text to track manually.&lt;/p&gt;

&lt;p&gt;I wanted to answer a simple question: &lt;strong&gt;which central banks are turning hawkish and which are turning dovish — right now?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Approach
&lt;/h2&gt;

&lt;p&gt;Instead of summarizing entire documents, I break them into individual sentences and classify each one. Every sentence gets two labels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sentiment&lt;/strong&gt; (what policy direction does it signal?):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;rate_hike&lt;/code&gt;, &lt;code&gt;rate_cut&lt;/code&gt;, &lt;code&gt;rate_hold&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;guidance_hawkish&lt;/code&gt;, &lt;code&gt;guidance_dovish&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dissent_hawkish&lt;/code&gt;, &lt;code&gt;dissent_dovish&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;liquidity_easing&lt;/code&gt;, &lt;code&gt;liquidity_tightening&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;neutral&lt;/code&gt;, &lt;code&gt;irrelevant&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Topic&lt;/strong&gt; (what economic area?):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mp_inflation&lt;/code&gt;, &lt;code&gt;mp_interest_rate&lt;/code&gt;, &lt;code&gt;mp_economic_activity&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mp_labor_market&lt;/code&gt;, &lt;code&gt;mp_exchange_rate&lt;/code&gt;, &lt;code&gt;mp_credit&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;financial_stability&lt;/code&gt;, &lt;code&gt;fiscal_policy&lt;/code&gt;, &lt;code&gt;governance&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives a granular view — not just "the Fed is hawkish" but "the Fed's inflation language is hawkish while its labor market language is turning dovish."&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The pipeline has four stages:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Crawling
&lt;/h3&gt;

&lt;p&gt;Each central bank has a custom async crawler (Python + aiohttp). Some banks publish clean HTML, others only PDFs, a few require Playwright for JavaScript-rendered pages. The crawlers run daily via Airflow.&lt;/p&gt;

&lt;p&gt;Sources per bank:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Policy statements and decisions&lt;/li&gt;
&lt;li&gt;Meeting minutes&lt;/li&gt;
&lt;li&gt;Press conference transcripts&lt;/li&gt;
&lt;li&gt;Speeches (for some banks)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Sentence Splitting
&lt;/h3&gt;

&lt;p&gt;Documents are split into sentences using rule-based splitting tuned for central bank language. This matters because naive splitting breaks on abbreviations like "Fed." or "Q4." or numbered lists common in policy documents.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Classification
&lt;/h3&gt;

&lt;p&gt;Each sentence is classified by an LLM with bank-specific prompt rules. The key insight: &lt;strong&gt;central bank language is domain-specific enough that generic sentiment analysis fails badly.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples that trip up generic classifiers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sentence&lt;/th&gt;
&lt;th&gt;Naive Classification&lt;/th&gt;
&lt;th&gt;Correct&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Future monetary policy decisions will be conditional on the inflation outlook"&lt;/td&gt;
&lt;td&gt;guidance_hawkish&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;neutral&lt;/strong&gt; (boilerplate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"The member voted against the rate increase"&lt;/td&gt;
&lt;td&gt;dissent_hawkish&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;dissent_dovish&lt;/strong&gt; (wanted lower rates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Average interest rate on ruble loans rose to 8.5%"&lt;/td&gt;
&lt;td&gt;rate_hike&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;neutral&lt;/strong&gt; (market rate description, not policy)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To catch these errors, each sentence is classified twice at different temperatures (0.0 and 0.1). Disagreements are flagged for review.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Aggregation
&lt;/h3&gt;

&lt;p&gt;Sentence-level classifications are aggregated into document-level and bank-level metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hawk/dove ratio per document&lt;/li&gt;
&lt;li&gt;Stance shifts over time&lt;/li&gt;
&lt;li&gt;Dissent tracking (who dissented and in which direction)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dissent direction is counterintuitive.&lt;/strong&gt; If the majority voted to hike rates and one member dissented, that dissent is &lt;em&gt;dovish&lt;/em&gt; — the dissenter wanted lower rates. This seems obvious in retrospect, but getting the prompts right took several iterations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Boilerplate is the enemy.&lt;/strong&gt; Every central bank repeats the same conditional phrases meeting after meeting: "future decisions will depend on incoming data." These aren't signals — they're filler. The classifier needed explicit examples of common boilerplate to avoid false positives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bank-specific rules matter.&lt;/strong&gt; The &lt;a href="https://monetary.live/pboc.html" rel="noopener noreferrer"&gt;PBOC&lt;/a&gt; communicates completely differently from the &lt;a href="https://monetary.live/fed.html" rel="noopener noreferrer"&gt;Fed&lt;/a&gt;. PBOC statements are short and formulaic. Fed minutes are discursive with extensive debate. The &lt;a href="https://monetary.live/cbr.html" rel="noopener noreferrer"&gt;Bank of Russia&lt;/a&gt; quarterly reviews describe market conditions that look like policy decisions but aren't. Each required tailored prompt rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Scale
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;26 central banks&lt;/strong&gt;: &lt;a href="https://monetary.live/fed.html" rel="noopener noreferrer"&gt;Fed&lt;/a&gt;, &lt;a href="https://monetary.live/ecb.html" rel="noopener noreferrer"&gt;ECB&lt;/a&gt;, &lt;a href="https://monetary.live/boj.html" rel="noopener noreferrer"&gt;BOJ&lt;/a&gt;, &lt;a href="https://monetary.live/boe.html" rel="noopener noreferrer"&gt;BoE&lt;/a&gt;, &lt;a href="https://monetary.live/pboc.html" rel="noopener noreferrer"&gt;PBOC&lt;/a&gt;, &lt;a href="https://monetary.live/rbi.html" rel="noopener noreferrer"&gt;RBI&lt;/a&gt;, &lt;a href="https://monetary.live/bcb.html" rel="noopener noreferrer"&gt;BCB&lt;/a&gt;, &lt;a href="https://monetary.live/boc.html" rel="noopener noreferrer"&gt;BoC&lt;/a&gt;, &lt;a href="https://monetary.live/rba.html" rel="noopener noreferrer"&gt;RBA&lt;/a&gt;, &lt;a href="https://monetary.live/tcmb.html" rel="noopener noreferrer"&gt;TCMB&lt;/a&gt;, &lt;a href="https://monetary.live/snb.html" rel="noopener noreferrer"&gt;SNB&lt;/a&gt;, &lt;a href="https://monetary.live/cbr.html" rel="noopener noreferrer"&gt;CBR&lt;/a&gt;, &lt;a href="https://monetary.live/bok.html" rel="noopener noreferrer"&gt;BoK&lt;/a&gt;, &lt;a href="https://monetary.live/banxico.html" rel="noopener noreferrer"&gt;Banxico&lt;/a&gt;, &lt;a href="https://monetary.live/sarb.html" rel="noopener noreferrer"&gt;SARB&lt;/a&gt;, &lt;a href="https://monetary.live/cbn.html" rel="noopener noreferrer"&gt;CBN&lt;/a&gt;, &lt;a href="https://monetary.live/mas.html" rel="noopener noreferrer"&gt;MAS&lt;/a&gt;, &lt;a href="https://monetary.live/boi.html" rel="noopener noreferrer"&gt;BoI&lt;/a&gt;, &lt;a href="https://monetary.live/nbp.html" rel="noopener noreferrer"&gt;NBP&lt;/a&gt;, &lt;a href="https://monetary.live/norges.html" rel="noopener noreferrer"&gt;Norges&lt;/a&gt;, &lt;a href="https://monetary.live/riksbank.html" rel="noopener noreferrer"&gt;Riksbank&lt;/a&gt;, &lt;a href="https://monetary.live/rbnz.html" rel="noopener noreferrer"&gt;RBNZ&lt;/a&gt;, &lt;a href="https://monetary.live/mnb.html" rel="noopener noreferrer"&gt;MNB&lt;/a&gt;, &lt;a href="https://monetary.live/nbs.html" rel="noopener noreferrer"&gt;NBS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;225,000+ classified sentences&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;12 sentiment classes, 9 topic categories&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Daily updates via Airflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Divergences Right Now
&lt;/h2&gt;

&lt;p&gt;Some current policy stances that stand out:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bank&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;th&gt;Stance&lt;/th&gt;
&lt;th&gt;Notable&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://monetary.live/boj.html" rel="noopener noreferrer"&gt;BOJ&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.75%&lt;/td&gt;
&lt;td&gt;Cautiously hawkish&lt;/td&gt;
&lt;td&gt;Normalizing after decades at zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://monetary.live/snb.html" rel="noopener noreferrer"&gt;SNB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;Neutral&lt;/td&gt;
&lt;td&gt;Back to the floor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://monetary.live/tcmb.html" rel="noopener noreferrer"&gt;TCMB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;td&gt;Hawkish&lt;/td&gt;
&lt;td&gt;Emergency tightening&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://monetary.live/pboc.html" rel="noopener noreferrer"&gt;PBOC&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3.00%&lt;/td&gt;
&lt;td&gt;Dovish&lt;/td&gt;
&lt;td&gt;Supporting growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://monetary.live/bcb.html" rel="noopener noreferrer"&gt;BCB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14.75%&lt;/td&gt;
&lt;td&gt;Hawkish&lt;/td&gt;
&lt;td&gt;Among highest G20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://monetary.live/fed.html" rel="noopener noreferrer"&gt;Fed&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3.75%&lt;/td&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;Cutting but cautious language&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Live Dashboard
&lt;/h2&gt;

&lt;p&gt;The full dashboard is at &lt;strong&gt;&lt;a href="https://monetary.live" rel="noopener noreferrer"&gt;monetary.live&lt;/a&gt;&lt;/strong&gt; — each bank has its own page with statement history, sentiment breakdowns, and policy metrics.&lt;/p&gt;

&lt;p&gt;Also tracking tech trends with a separate pipeline at &lt;a href="https://pulsar.ivan.digital" rel="noopener noreferrer"&gt;pulsar.ivan.digital&lt;/a&gt; (arXiv papers, GitHub repos, Reddit discussions).&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python async crawlers (aiohttp, Playwright)&lt;/li&gt;
&lt;li&gt;LLM classification with self-validation&lt;/li&gt;
&lt;li&gt;SQLite for storage&lt;/li&gt;
&lt;li&gt;Airflow for orchestration&lt;/li&gt;
&lt;li&gt;Firebase Hosting for the dashboard&lt;/li&gt;
&lt;li&gt;structlog for logging&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Would love feedback on the methodology. If you work with central bank text data or NLP for finance, I'd be curious to hear what approaches you've tried.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
