Emmanuel Boakye

Posted on May 30

Building a RAG Pipeline for Greenwashing Detection in Oil & Gas

#python #machinelearning #nlp #rag

Automated Greenwashing Detection for Oil & Gas Sustainability Reporting

Stack: Python · React · Vercel | Live: claimify-esg.vercel.app

Stat	Value
Claims scored	2,203 across 10 oil & gas majors
Eval accuracy	86.7% (52/60 hand-labelled set)
Historical pledges tracked	710 from 2021 sustainability reports
Evidence corpus	290 chunks across 7 NGO sources + Guardian

Abstract

Corporate sustainability reports are structurally difficult to audit. Claims range from specific quantified commitments ("reduced Scope 1 emissions by 37% against our 2019 baseline") to vague aspiration ("we support the energy transition"). Claimify is a RAG (Retrieval-Augmented Generation) pipeline that scores each claim against a curated NGO evidence corpus rather than relying on the LLM's training weights alone. Retrieval gives the model citable, up-to-date source material for each claim; generation produces a structured verdict grounded in that material, not in generalities about the sector.

The pipeline has five stages: PDF ingestion, NLP filtering (ClimateBERT + GPT-4o structured extraction), two-stage retrieval (SBERT shortlisting + cross-encoder reranking), LLM scoring (a structured two-step prompt), and materiality adjustment (category-specific multipliers). A separate Commitment Tracker layer reuses the same retrieval stack to score 2021 pledges against 2023–2025 NGO evidence, measuring how much follow-through actually happened.

The pipeline runs entirely offline. No API calls happen at read time. The frontend reads two pre-generated JSON files, served as static files on Vercel.

Pipeline stages:

PDF download → pdfminer parse → ClimateBERT filter → GPT-4o extract → SBERT + rerank → GPT-4o-mini score → Materiality weight → React / Vercel

§01 — Background and Design Decisions

Oil and gas sustainability reports are written for multiple audiences at once: shareholder, regulator, and general public. The same company can, on the same page, disclose a precise Scope 1 reduction figure and an unfalsifiable aspiration in the next paragraph. Distinguishing between these is not a keyword problem or a sentiment problem. It requires knowing whether a claim makes a falsifiable factual assertion and, separately, whether available evidence refutes it.

Existing approaches had a shared gap: they either flagged anything containing "net-zero" or "carbon neutral" (keyword matching) or relied on journalists manually cross-referencing company statements with NGO reports. Neither approach scales, and neither produces a traceable reasoning chain. This pipeline is the automation of that intermediate layer.

The choice to use RAG rather than a prompt-only classifier is deliberate. LLM training weights are frozen at a cutoff date and do not contain the specific Carbon Tracker reports, Reclaim Finance assessments, or InfluenceMap briefings that contradict individual company claims. Giving the model that material at inference time means every verdict can be traced back to specific retrieved documents, not to generalised knowledge about the sector.

Architectural decisions and their reasons

Decision	Why we made it	What we accepted in exchange
Offline batch, not live API	All scoring runs once per ingestion cycle. The frontend reads a static JSON file. No cost per page-view, no failure modes at read time.	New reports need a full pipeline re-run. There is no live update between cycles.
Two-step LLM scoring	A single-step prompt that asked the model to weigh evidence and classify simultaneously inflated the contradicted rate whenever NGO evidence was critical, even for well-quantified claims. Separating the two decisions fixed it.	Two LLM calls per claim rather than one.
ClimateBERT before GPT-4o	A 310-page PDF produces thousands of sentences. Running GPT-4o over all of them is expensive and introduces noise. ClimateBERT drops roughly 65% of sentences before extraction, cheaply and locally.	ClimateBERT misclassifies a small fraction of climate sentences as off-topic and drops them permanently.
Company-scoped retrieval	Each corpus chunk carries an `applies_to` field listing relevant company IDs. Retrieval filters by this before cosine search so BP evidence never surfaces for a Shell claim.	Cross-company sector patterns must be duplicated into each company's scope list manually.
Materiality multipliers	A net-zero claim and a biodiversity mention with the same raw LLM score carry different reputational and legal risk. The multipliers encode this as an explicit, version-controlled domain judgement rather than leaving it implicit.	The multiplier values are not empirically calibrated. They are expert judgement and should be revisited.
gpt-4o-mini, not gpt-4o	Roughly 10x cheaper per token at comparable accuracy on structured JSON classification. The prompt examples are the primary quality driver; the model choice is a cost trade-off once the eval bar is met.	Higher error rate on edge cases in the well/weakly boundary.

Platform note: The Commitment Tracker extracts historical claims in a background subprocess. On Windows, PyTorch DLL loading fails in that context. The Tracker substitutes a keyword regex filter for ClimateBERT. The main ingestion pipeline, which runs in the foreground, keeps ClimateBERT. This is a platform limitation, not a design choice.

§02 — System Architecture

Two input streams meet at the retrieval stage.

Claims stream: Sustainability PDFs → pdfminer.six → ClimateBERT filter → GPT-4o extract → claims.jsonl + SBERT embeddings

Evidence stream: NGO sources + Guardian API → HTTP/PDF scraping with applies_to tagging → 350-word chunks / 50-word overlap → SBERT corpus embeddings → corpus_vectors.npy

Both streams converge at:

SBERT ANN search (company-scoped, k=20) — cosine similarity over applies_to-filtered corpus subset
Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) — pairwise (claim, evidence) attention scoring → top-5
GPT-4o-mini scorer — CLAIM + RETRIEVED EVIDENCE in prompt → verdict grounded in docs
Materiality weight → rationales.json → React / Vite / Vercel static

Commitment Tracker branch (separate): 2021 PDFs → keyword filter → 710 pledges → same SBERT + cross-encoder retrieval → GPT-4o-mini gap scoring

§03 — NLP Pipeline: Filtering and Extraction

Before retrieval can happen, the pipeline needs claims. A 300-page corporate sustainability report typically contains thousands of sentences. Most of them are financial tables, legal disclaimers, headers, and general corporate narrative that has nothing to do with climate commitments. Running GPT-4o over all of them would be slow, expensive, and would pull in a lot of noise. Three steps handle this: sentence splitting, ClimateBERT relevance filtering, and structured claim extraction.

ClimateBERT is the gating step. It runs locally, no API call required, and processes sentences in batches of 32. In practice it drops 60–70% of a typical report. Only what passes goes to GPT-4o for extraction.

ClimateBERT relevance filter


python
# nlp/relevance_filter.py

MODEL_NAME = (
    "climatebert/"
    "distilroberta-base-climate-detector"
)
BATCH_SIZE = 32

def predict_batch(sentences: list[str]) -> list[bool]:
    tokenizer, model = get_model()
    inputs = tokenizer(
        sentences,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="pt"
    )
    with torch.no_grad():
        outputs = model(**inputs)
    preds = torch.argmax(outputs.logits, dim=-1).tolist()
    # label 1 = climate-relevant
    return [p == 1 for p in preds]

Top comments (2)

Harjot Singh • May 31

Greenwashing detection is a genuinely interesting RAG application because it's not "find the relevant passage" - it's "compare claims against evidence and flag the gap," which is a harder, more adversarial retrieval problem. You're not just retrieving what a company SAYS about sustainability; you need to retrieve the contradicting facts (emissions data, actual practices) and surface the mismatch. That cross-referencing-for-contradiction is a meaningfully tougher pipeline than standard Q&A RAG.

The part I'd guard hardest: an accusatory output (this is greenwashing) had better be grounded and defensible, because a false positive here isn't just a wrong answer, it's potentially defamatory. So the verification/citation layer matters more than usual - every flag should trace to specific evidence, not the model's vibe. That ground-and-cite-or-don't-claim discipline is exactly what I build into Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) for any high-stakes output. Fascinating domain application. How are you handling the false-positive risk - confidence thresholds, requiring N pieces of contradicting evidence before flagging, or human review on flags? In a domain this sensitive the precision side seems as important as the recall.

Harjot Singh • May 31

Greenwashing detection is a genuinely good RAG use-case, it's fundamentally a claim-vs-evidence problem, exactly what retrieval-grounded systems are for: does this corporate statement hold up against the actual filings and data. The risk is the same as all high-stakes RAG, a confident wrong classification (flagging something as greenwashing, or clearing it) carries real reputational and legal weight, so retrieval quality and provenance matter more than the model. Every flag should trace to the source text it's based on. That grounding-and-cite discipline is how I think about trustworthy output in Moonshift. How are you handling false positives, a confidence threshold or human review on the flags?