<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Emmanuel Boakye</title>
    <description>The latest articles on DEV Community by Emmanuel Boakye (@emmanuel_boakye_6b990ae6c).</description>
    <link>https://dev.to/emmanuel_boakye_6b990ae6c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3909067%2F7f829aa3-71d9-43c5-99b2-cebf8a808027.jpg</url>
      <title>DEV Community: Emmanuel Boakye</title>
      <link>https://dev.to/emmanuel_boakye_6b990ae6c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/emmanuel_boakye_6b990ae6c"/>
    <language>en</language>
    <item>
      <title>Building a RAG Pipeline for Greenwashing Detection in Oil &amp; Gas</title>
      <dc:creator>Emmanuel Boakye</dc:creator>
      <pubDate>Sat, 30 May 2026 15:56:03 +0000</pubDate>
      <link>https://dev.to/emmanuel_boakye_6b990ae6c/building-a-rag-pipeline-for-greenwashing-detection-in-oil-gas-3bep</link>
      <guid>https://dev.to/emmanuel_boakye_6b990ae6c/building-a-rag-pipeline-for-greenwashing-detection-in-oil-gas-3bep</guid>
      <description>&lt;h1&gt;
  
  
  Automated Greenwashing Detection for Oil &amp;amp; Gas Sustainability Reporting
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; Python · React · Vercel | &lt;strong&gt;Live:&lt;/strong&gt; &lt;a href="https://claimify-esg.vercel.app" rel="noopener noreferrer"&gt;claimify-esg.vercel.app&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stat&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claims scored&lt;/td&gt;
&lt;td&gt;2,203 across 10 oil &amp;amp; gas majors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eval accuracy&lt;/td&gt;
&lt;td&gt;86.7% (52/60 hand-labelled set)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Historical pledges tracked&lt;/td&gt;
&lt;td&gt;710 from 2021 sustainability reports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence corpus&lt;/td&gt;
&lt;td&gt;290 chunks across 7 NGO sources + Guardian&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Abstract
&lt;/h2&gt;

&lt;p&gt;Corporate sustainability reports are structurally difficult to audit. Claims range from specific quantified commitments ("reduced Scope 1 emissions by 37% against our 2019 baseline") to vague aspiration ("we support the energy transition"). Claimify is a &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; pipeline that scores each claim against a curated NGO evidence corpus rather than relying on the LLM's training weights alone. Retrieval gives the model citable, up-to-date source material for each claim; generation produces a structured verdict grounded in that material, not in generalities about the sector.&lt;/p&gt;

&lt;p&gt;The pipeline has five stages: PDF ingestion, NLP filtering (ClimateBERT + GPT-4o structured extraction), two-stage retrieval (SBERT shortlisting + cross-encoder reranking), LLM scoring (a structured two-step prompt), and materiality adjustment (category-specific multipliers). A separate Commitment Tracker layer reuses the same retrieval stack to score 2021 pledges against 2023–2025 NGO evidence, measuring how much follow-through actually happened.&lt;/p&gt;

&lt;p&gt;The pipeline runs entirely offline. No API calls happen at read time. The frontend reads two pre-generated JSON files, served as static files on Vercel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline stages:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;PDF download&lt;/code&gt; → &lt;code&gt;pdfminer parse&lt;/code&gt; → &lt;code&gt;ClimateBERT filter&lt;/code&gt; → &lt;code&gt;GPT-4o extract&lt;/code&gt; → &lt;code&gt;SBERT + rerank&lt;/code&gt; → &lt;code&gt;GPT-4o-mini score&lt;/code&gt; → &lt;code&gt;Materiality weight&lt;/code&gt; → &lt;code&gt;React / Vercel&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  §01 — Background and Design Decisions
&lt;/h2&gt;

&lt;p&gt;Oil and gas sustainability reports are written for multiple audiences at once: shareholder, regulator, and general public. The same company can, on the same page, disclose a precise Scope 1 reduction figure and an unfalsifiable aspiration in the next paragraph. Distinguishing between these is not a keyword problem or a sentiment problem. It requires knowing whether a claim makes a falsifiable factual assertion and, separately, whether available evidence refutes it.&lt;/p&gt;

&lt;p&gt;Existing approaches had a shared gap: they either flagged anything containing "net-zero" or "carbon neutral" (keyword matching) or relied on journalists manually cross-referencing company statements with NGO reports. Neither approach scales, and neither produces a traceable reasoning chain. This pipeline is the automation of that intermediate layer.&lt;/p&gt;

&lt;p&gt;The choice to use RAG rather than a prompt-only classifier is deliberate. LLM training weights are frozen at a cutoff date and do not contain the specific Carbon Tracker reports, Reclaim Finance assessments, or InfluenceMap briefings that contradict individual company claims. Giving the model that material at inference time means every verdict can be traced back to specific retrieved documents, not to generalised knowledge about the sector.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architectural decisions and their reasons
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Why we made it&lt;/th&gt;
&lt;th&gt;What we accepted in exchange&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Offline batch, not live API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All scoring runs once per ingestion cycle. The frontend reads a static JSON file. No cost per page-view, no failure modes at read time.&lt;/td&gt;
&lt;td&gt;New reports need a full pipeline re-run. There is no live update between cycles.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Two-step LLM scoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A single-step prompt that asked the model to weigh evidence and classify simultaneously inflated the &lt;em&gt;contradicted&lt;/em&gt; rate whenever NGO evidence was critical, even for well-quantified claims. Separating the two decisions fixed it.&lt;/td&gt;
&lt;td&gt;Two LLM calls per claim rather than one.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ClimateBERT before GPT-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A 310-page PDF produces thousands of sentences. Running GPT-4o over all of them is expensive and introduces noise. ClimateBERT drops roughly 65% of sentences before extraction, cheaply and locally.&lt;/td&gt;
&lt;td&gt;ClimateBERT misclassifies a small fraction of climate sentences as off-topic and drops them permanently.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Company-scoped retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Each corpus chunk carries an &lt;code&gt;applies_to&lt;/code&gt; field listing relevant company IDs. Retrieval filters by this before cosine search so BP evidence never surfaces for a Shell claim.&lt;/td&gt;
&lt;td&gt;Cross-company sector patterns must be duplicated into each company's scope list manually.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Materiality multipliers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A net-zero claim and a biodiversity mention with the same raw LLM score carry different reputational and legal risk. The multipliers encode this as an explicit, version-controlled domain judgement rather than leaving it implicit.&lt;/td&gt;
&lt;td&gt;The multiplier values are not empirically calibrated. They are expert judgement and should be revisited.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gpt-4o-mini, not gpt-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Roughly 10x cheaper per token at comparable accuracy on structured JSON classification. The prompt examples are the primary quality driver; the model choice is a cost trade-off once the eval bar is met.&lt;/td&gt;
&lt;td&gt;Higher error rate on edge cases in the well/weakly boundary.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Platform note:&lt;/strong&gt; The Commitment Tracker extracts historical claims in a background subprocess. On Windows, PyTorch DLL loading fails in that context. The Tracker substitutes a keyword regex filter for ClimateBERT. The main ingestion pipeline, which runs in the foreground, keeps ClimateBERT. This is a platform limitation, not a design choice.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  §02 — System Architecture
&lt;/h2&gt;

&lt;p&gt;Two input streams meet at the retrieval stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claims stream:&lt;/strong&gt; Sustainability PDFs → pdfminer.six → ClimateBERT filter → GPT-4o extract → &lt;code&gt;claims.jsonl&lt;/code&gt; + SBERT embeddings&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence stream:&lt;/strong&gt; NGO sources + Guardian API → HTTP/PDF scraping with &lt;code&gt;applies_to&lt;/code&gt; tagging → 350-word chunks / 50-word overlap → SBERT corpus embeddings → &lt;code&gt;corpus_vectors.npy&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both streams converge at:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;SBERT ANN search&lt;/strong&gt; (company-scoped, k=20) — cosine similarity over &lt;code&gt;applies_to&lt;/code&gt;-filtered corpus subset&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-encoder rerank&lt;/strong&gt; (ms-marco-MiniLM-L-6-v2) — pairwise (claim, evidence) attention scoring → top-5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o-mini scorer&lt;/strong&gt; — CLAIM + RETRIEVED EVIDENCE in prompt → verdict grounded in docs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Materiality weight&lt;/strong&gt; → &lt;code&gt;rationales.json&lt;/code&gt; → React / Vite / Vercel static&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Commitment Tracker branch (separate):&lt;/strong&gt; 2021 PDFs → keyword filter → 710 pledges → same SBERT + cross-encoder retrieval → GPT-4o-mini gap scoring&lt;/p&gt;




&lt;h2&gt;
  
  
  §03 — NLP Pipeline: Filtering and Extraction
&lt;/h2&gt;

&lt;p&gt;Before retrieval can happen, the pipeline needs claims. A 300-page corporate sustainability report typically contains thousands of sentences. Most of them are financial tables, legal disclaimers, headers, and general corporate narrative that has nothing to do with climate commitments. Running GPT-4o over all of them would be slow, expensive, and would pull in a lot of noise. Three steps handle this: sentence splitting, ClimateBERT relevance filtering, and structured claim extraction.&lt;/p&gt;

&lt;p&gt;ClimateBERT is the gating step. It runs locally, no API call required, and processes sentences in batches of 32. In practice it drops 60–70% of a typical report. Only what passes goes to GPT-4o for extraction.&lt;/p&gt;

&lt;h3&gt;
  
  
  ClimateBERT relevance filter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# nlp/relevance_filter.py

MODEL_NAME = (
    "climatebert/"
    "distilroberta-base-climate-detector"
)
BATCH_SIZE = 32

def predict_batch(sentences: list[str]) -&amp;gt; list[bool]:
    tokenizer, model = get_model()
    inputs = tokenizer(
        sentences,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="pt"
    )
    with torch.no_grad():
        outputs = model(**inputs)
    preds = torch.argmax(outputs.logits, dim=-1).tolist()
    # label 1 = climate-relevant
    return [p == 1 for p in preds]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
