DEV Community

agenthustler
agenthustler

Posted on

How to Build a Disinformation Tracker with Web Scraping

How to Build a Disinformation Tracker with Web Scraping

Disinformation spreads faster than fact-checks can keep up. Researchers, journalists, and analysts need automated tools to monitor narratives across the web. In this guide, we'll build a Python-based disinformation tracker that scrapes news sources, detects recurring false claims, and flags coordinated behavior patterns.

Why Track Disinformation Programmatically?

Manual monitoring doesn't scale. A single false claim can appear on hundreds of sites within hours. By scraping multiple sources simultaneously, you can detect patterns that human analysts would miss — like identical phrasing across unrelated domains or synchronized posting times.

Architecture Overview

Our tracker has three components:

  1. Multi-source scraper — collects articles from news sites and aggregators
  2. Similarity engine — detects duplicate/near-duplicate content across sources
  3. Timeline analyzer — identifies coordinated posting patterns

Setting Up the Scraper

First, install dependencies:

pip install requests beautifulsoup4 scikit-learn pandas
Enter fullscreen mode Exit fullscreen mode

We need a proxy service to avoid getting blocked when scraping multiple news sources. ScraperAPI handles rotating proxies and CAPTCHA solving automatically.

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import hashlib

SCRAPER_API_KEY = "YOUR_KEY"

def scrape_article(url):
    payload = {
        "api_key": SCRAPER_API_KEY,
        "url": url,
        "render": "false"
    }
    response = requests.get(
        "http://api.scraperapi.com",
        params=payload,
        timeout=30
    )
    soup = BeautifulSoup(response.text, "html.parser")
    paragraphs = soup.find_all("p")
    text = " ".join(p.get_text() for p in paragraphs)
    return {
        "url": url,
        "text": text,
        "scraped_at": datetime.utcnow().isoformat(),
        "content_hash": hashlib.sha256(text.encode()).hexdigest()
    }
Enter fullscreen mode Exit fullscreen mode

Detecting Near-Duplicate Content

The key to tracking disinformation is finding articles that say essentially the same thing but appear on different sites. We use TF-IDF vectorization and cosine similarity:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_duplicates(articles, threshold=0.85):
    texts = [a["text"] for a in articles]
    vectorizer = TfidfVectorizer(
        max_features=5000,
        stop_words="english",
        ngram_range=(1, 3)
    )
    tfidf_matrix = vectorizer.fit_transform(texts)
    sim_matrix = cosine_similarity(tfidf_matrix)

    clusters = []
    for i in range(len(articles)):
        for j in range(i + 1, len(articles)):
            if sim_matrix[i][j] >= threshold:
                clusters.append({
                    "article_a": articles[i]["url"],
                    "article_b": articles[j]["url"],
                    "similarity": float(sim_matrix[i][j]),
                    "time_delta": abs(
                        datetime.fromisoformat(articles[i]["scraped_at"])
                        - datetime.fromisoformat(articles[j]["scraped_at"])
                    ).total_seconds()
                })
    return sorted(clusters, key=lambda x: x["similarity"], reverse=True)
Enter fullscreen mode Exit fullscreen mode

Coordination Detection

When multiple sites publish near-identical content within a short window, it suggests coordination rather than organic spread:

def detect_coordination(clusters, time_window_hours=4):
    coordinated = []
    window_seconds = time_window_hours * 3600
    for cluster in clusters:
        if cluster["time_delta"] < window_seconds:
            coordinated.append({
                **cluster,
                "flag": "COORDINATED",
                "confidence": min(
                    cluster["similarity"] * (1 - cluster["time_delta"] / window_seconds),
                    1.0
                )
            })
    return coordinated
Enter fullscreen mode Exit fullscreen mode

Scaling with Proxy Infrastructure

When tracking disinformation at scale, you'll scrape hundreds of domains daily. Residential proxies from ThorData provide diverse IP pools that avoid detection, while ScrapeOps offers proxy aggregation across multiple providers for cost optimization.

Building a Monitoring Dashboard

Store results in a SQLite database and generate daily reports:

import sqlite3
import pandas as pd

def store_results(clusters, db_path="disinfo_tracker.db"):
    conn = sqlite3.connect(db_path)
    df = pd.DataFrame(clusters)
    df.to_sql("clusters", conn, if_exists="append", index=False)
    conn.close()

def daily_report(db_path="disinfo_tracker.db"):
    conn = sqlite3.connect(db_path)
    query = '''
        SELECT flag, COUNT(*) as count, AVG(similarity) as avg_sim
        FROM clusters
        WHERE date(scraped_at) = date('now')
        GROUP BY flag
    '''
    report = pd.read_sql(query, conn)
    conn.close()
    return report
Enter fullscreen mode Exit fullscreen mode

Ethical Considerations

Disinformation tracking is a sensitive domain. Always respect robots.txt, rate-limit your requests, and handle data responsibly. This tool is designed for research and journalism — not for censorship or surveillance.

Next Steps

  • Add NLP-based claim extraction to identify specific false claims
  • Integrate with fact-checking APIs for automated verification
  • Build alert systems for newly detected coordinated campaigns

The complete code provides a foundation for systematic disinformation research. Scale responsibly and verify findings before publishing conclusions.

Top comments (0)