Building an Automated Fact-Checker with Web Scraping

#python #programming #tutorial #webdev

Misinformation is everywhere. In this guide, we'll build a Python-based fact-checker that scrapes multiple sources to verify claims automatically.

How Automated Fact-Checking Works

Our fact-checker will:

Parse a claim into searchable components
Search multiple authoritative sources
Compare findings against the claim
Return a confidence score

Setting Up

pip install requests beautifulsoup4 newspaper3k

Step 1: Query Builder

import re
from typing import List

class QueryBuilder:
    STOP_WORDS = {
        "the", "a", "an", "is", "are", "was", "were", "be", "been",
        "have", "has", "had", "do", "does", "did", "will", "would",
        "could", "should", "that", "this", "what", "which", "who"
    }

    def build_queries(self, claim: str) -> List[str]:
        words = re.findall(r'\b[a-zA-Z]+\b', claim.lower())
        keywords = [w for w in words if w not in self.STOP_WORDS and len(w) > 2]
        return [
            " ".join(keywords),
            " ".join(keywords[:5]),
            f'"{claim[:80]}"',
        ]

Step 2: Multi-Source Scraper

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Step 3: Evidence Analyzer

from collections import Counter

class EvidenceAnalyzer:
    VERDICT_KEYWORDS = {
        "true": ["true", "correct", "confirmed", "verified", "accurate"],
        "false": ["false", "incorrect", "debunked", "fake", "misleading",
                  "pants on fire", "fabricated", "hoax"],
        "mixed": ["partly true", "half true", "mixture", "mostly",
                  "context", "out of context"],
    }

    def analyze(self, results, claim):
        verdicts = Counter()
        for result in results:
            title = result.get("title", "").lower()
            for verdict, keywords in self.VERDICT_KEYWORDS.items():
                if any(kw in title for kw in keywords):
                    verdicts[verdict] += 1

        total = sum(verdicts.values())
        if total == 0:
            return {
                "verdict": "unverified",
                "confidence": 0,
                "sources_checked": len(results)
            }

        top_verdict = verdicts.most_common(1)[0]
        confidence = top_verdict[1] / total
        return {
            "verdict": top_verdict[0],
            "confidence": round(confidence * 100, 1),
            "sources_checked": len(results),
            "breakdown": dict(verdicts)
        }

Step 4: Putting It All Together

class FactChecker:
    def __init__(self):
        self.query_builder = QueryBuilder()
        self.scraper = SourceScraper()
        self.analyzer = EvidenceAnalyzer()

    def check(self, claim):
        print(f"Checking: {claim}")
        queries = self.query_builder.build_queries(claim)

        all_results = []
        for query in queries[:2]:
            results = self.scraper.search_all_sources(query)
            all_results.extend(results)

        seen_urls = set()
        unique_results = []
        for r in all_results:
            if r["url"] not in seen_urls:
                seen_urls.add(r["url"])
                unique_results.append(r)

        verdict = self.analyzer.analyze(unique_results, claim)
        verdict["claim"] = claim
        verdict["articles"] = unique_results[:5]
        return verdict

checker = FactChecker()
result = checker.check("The Great Wall of China is visible from space")
print(f"Verdict: {result['verdict']} ({result['confidence']}% confidence)")

Scaling the Fact-Checker

For production use, ScraperAPI handles proxy rotation when checking multiple sources. ThorData provides residential IPs for news sites. Monitor with ScrapeOps.

Limitations

Not a replacement for human judgment
Only as good as the sources checked
Nuanced claims may be oversimplified
Always present confidence levels transparently

Conclusion

An automated fact-checker is a powerful screening tool. By scraping multiple authoritative sources, you can quickly assess claims at scale. Always show your sources and confidence levels — let users make the final judgment.

DEV Community