SIKOUTRIS

Posted on Mar 11

Using NLP to Detect Greenwashing: Building a Claim Verification Engine

#ai #python #sustainability #nlp

Greenwashing is everywhere. Companies claim to be "eco-friendly," "sustainable," and "carbon-neutral" without evidence to back it up. We built Greenwashing Checker to automatically analyze environmental claims and flag potential greenwashing.

The technical challenge: teaching a system to distinguish genuine sustainability commitments from marketing fluff.

What Makes a Claim Greenwashing?

Before writing any code, we needed a taxonomy. Based on research from the European Commission and the FTC Green Guides, we identified seven patterns of greenwashing:

Vague claims: "eco-friendly" with no specifics
Irrelevant claims: Technically true but meaningless ("CFC-free" when CFCs are banned)
Hidden tradeoffs: Highlighting one green attribute while ignoring larger impacts
No proof: Claims without third-party certification or data
Lesser of two evils: "Green" cigarettes, "sustainable" fast fashion
Fake labels: Made-up certifications that look official
Outright lies: Fabricated data or false certifications

Our system focuses on patterns 1, 2, 4, and 6 — the ones most amenable to automated detection.

The Analysis Pipeline

When a user submits a URL or text for analysis, it goes through four stages:

Input Text → Claim Extraction → Pattern Matching → Evidence Check → Score

Stage 1: Claim Extraction

First, we need to identify which sentences are actually making environmental claims. Not every sentence on a sustainability page is a claim — many are just filler.

import re

GREEN_KEYWORDS = [
    "sustainable", "eco-friendly", "green", "carbon neutral",
    "carbon negative", "net zero", "renewable", "biodegradable",
    "recyclable", "organic", "natural", "clean energy",
    "zero waste", "climate positive", "ethically sourced",
    "fair trade", "cruelty free", "plant based", "compostable"
]

def extract_claims(text):
    sentences = text.split(".")
    claims = []
    for sentence in sentences:
        sentence = sentence.strip().lower()
        keyword_matches = [
            kw for kw in GREEN_KEYWORDS 
            if kw in sentence
        ]
        if keyword_matches:
            claims.append({
                "text": sentence,
                "keywords": keyword_matches,
                "has_quantifier": bool(re.search(r"\d+%|\d+ tons?|\d+ tonnes?", sentence))
            })
    return claims

Stage 2: Vagueness Detection

This is where the NLP gets interesting. We score each claim on a specificity scale:

def vagueness_score(claim):
    score = 0
    text = claim["text"]

    # Vague qualifiers increase score
    vague_patterns = [
        (r"\b(some|many|various|several)\b", 0.2),
        (r"\b(striving|working towards|committed to)\b", 0.3),
        (r"\b(eco-friendly|green|sustainable)\b", 0.15),  # Without specifics
        (r"\b(better for|good for|helps?)\b", 0.2),
    ]

    for pattern, weight in vague_patterns:
        if re.search(pattern, text):
            score += weight

    # Specific quantifiers reduce score
    if claim["has_quantifier"]:
        score -= 0.3

    # Named certifications reduce score
    certifications = ["iso 14001", "b corp", "leed", "energy star", 
                      "fsc", "rainforest alliance", "cradle to cradle"]
    for cert in certifications:
        if cert in text:
            score -= 0.4

    return max(0, min(1, score))

A claim like "We are committed to being more sustainable" scores high on vagueness. A claim like "We reduced Scope 1 emissions by 23% between 2023-2025, verified by ISO 14064" scores low.

Stage 3: Certification Verification

We maintain a database of legitimate environmental certifications and their visual identifiers:

CREATE TABLE certifications (
    id INT PRIMARY KEY,
    name VARCHAR(200),
    issuing_body VARCHAR(200),
    verification_url VARCHAR(500),
    is_legitimate BOOLEAN,
    category VARCHAR(100)
);

-- Known fake/misleading certifications
INSERT INTO certifications (name, is_legitimate) VALUES
("Green Approved", false),
("Eco Safe", false),
("Nature Certified", false),
("100% Green", false);

-- Legitimate certifications
INSERT INTO certifications (name, is_legitimate, issuing_body) VALUES
("B Corporation", true, "B Lab"),
("ISO 14001", true, "ISO"),
("Energy Star", true, "EPA"),
("FSC Certified", true, "Forest Stewardship Council");

When our system detects a certification mentioned in text, it cross-references this database. Fake or self-awarded certifications are flagged immediately.

Stage 4: AI-Powered Deep Analysis

For nuanced cases that rule-based patterns miss, we use an LLM as a second opinion:

def ai_deep_analysis(claims, company_context):
    prompt = f"""Analyze these environmental claims for potential greenwashing.

    Company: {company_context}
    Claims: {json.dumps(claims)}

    For each claim, evaluate:
    1. Specificity (1-10): How specific and measurable is this claim?
    2. Verifiability (1-10): Could a third party verify this?
    3. Materiality (1-10): Does this address the company actual environmental impact?
    4. Red flags: Any classic greenwashing patterns?

    Return structured JSON."""

    return call_ai_api(prompt)

The AI layer catches subtle patterns like a fast fashion brand highlighting their use of recycled hangers while ignoring the environmental impact of producing millions of garments.

The Scoring System

We output a "Greenwashing Risk Score" from 0 to 100:

0-25: Low risk. Claims are specific, verified, and material.
26-50: Moderate risk. Some vague claims but generally substantiated.
51-75: High risk. Multiple vague or unsubstantiated claims.
76-100: Very high risk. Classic greenwashing patterns detected.

The score is a weighted combination of all four analysis stages. We deliberately avoid binary "greenwashing yes/no" labels because the reality is always a spectrum.

Technical Challenges We Solved

Multi-language support: Greenwashing is not an English-only problem. We support French, German, Spanish, and Italian claim detection. Each language has its own set of vague qualifiers and green buzzwords.

PDF parsing: Many sustainability reports are published as PDFs. We use a combination of pdftotext and custom layout parsing to extract meaningful text from formatted reports.

Rate limiting the AI layer: To keep costs manageable, we cache AI analyses per domain and only re-analyze when the page content changes significantly (measured by content hash similarity above 85%).

Accuracy and Limitations

We are transparent about what the tool can and cannot do. It excels at:

Detecting vague, unsubstantiated claims
Identifying fake certifications
Flagging missing evidence

It struggles with:

Evaluating whether specific numbers are accurate (we cannot audit a company)
Detecting hidden tradeoffs without industry-specific context
Analyzing claims in images or videos

We clearly state these limitations on every analysis result page at greenwashing-checker.com.

Why Open Methodology Matters

We publish our scoring methodology in full. If a company disagrees with their score, they can see exactly which claims triggered which flags. This transparency has led to several companies actually improving their sustainability communications after seeing their analysis — which might be the best outcome we could hope for.

Analyze any company sustainability page at greenwashing-checker.com — the methodology is fully transparent.

DEV Community