How to Build a Disinformation Tracker with Web Scraping
Disinformation spreads faster than fact-checks can keep up. Researchers, journalists, and analysts need automated tools to monitor narratives across the web. In this guide, we'll build a Python-based disinformation tracker that scrapes news sources, detects recurring false claims, and flags coordinated behavior patterns.
Why Track Disinformation Programmatically?
Manual monitoring doesn't scale. A single false claim can appear on hundreds of sites within hours. By scraping multiple sources simultaneously, you can detect patterns that human analysts would miss — like identical phrasing across unrelated domains or synchronized posting times.
Architecture Overview
Our tracker has three components:
- Multi-source scraper — collects articles from news sites and aggregators
- Similarity engine — detects duplicate/near-duplicate content across sources
- Timeline analyzer — identifies coordinated posting patterns
Setting Up the Scraper
First, install dependencies:
pip install requests beautifulsoup4 scikit-learn pandas
We need a proxy service to avoid getting blocked when scraping multiple news sources. ScraperAPI handles rotating proxies and CAPTCHA solving automatically.
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import hashlib
SCRAPER_API_KEY = "YOUR_KEY"
def scrape_article(url):
payload = {
"api_key": SCRAPER_API_KEY,
"url": url,
"render": "false"
}
response = requests.get(
"http://api.scraperapi.com",
params=payload,
timeout=30
)
soup = BeautifulSoup(response.text, "html.parser")
paragraphs = soup.find_all("p")
text = " ".join(p.get_text() for p in paragraphs)
return {
"url": url,
"text": text,
"scraped_at": datetime.utcnow().isoformat(),
"content_hash": hashlib.sha256(text.encode()).hexdigest()
}
Detecting Near-Duplicate Content
The key to tracking disinformation is finding articles that say essentially the same thing but appear on different sites. We use TF-IDF vectorization and cosine similarity:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def find_duplicates(articles, threshold=0.85):
texts = [a["text"] for a in articles]
vectorizer = TfidfVectorizer(
max_features=5000,
stop_words="english",
ngram_range=(1, 3)
)
tfidf_matrix = vectorizer.fit_transform(texts)
sim_matrix = cosine_similarity(tfidf_matrix)
clusters = []
for i in range(len(articles)):
for j in range(i + 1, len(articles)):
if sim_matrix[i][j] >= threshold:
clusters.append({
"article_a": articles[i]["url"],
"article_b": articles[j]["url"],
"similarity": float(sim_matrix[i][j]),
"time_delta": abs(
datetime.fromisoformat(articles[i]["scraped_at"])
- datetime.fromisoformat(articles[j]["scraped_at"])
).total_seconds()
})
return sorted(clusters, key=lambda x: x["similarity"], reverse=True)
Coordination Detection
When multiple sites publish near-identical content within a short window, it suggests coordination rather than organic spread:
def detect_coordination(clusters, time_window_hours=4):
coordinated = []
window_seconds = time_window_hours * 3600
for cluster in clusters:
if cluster["time_delta"] < window_seconds:
coordinated.append({
**cluster,
"flag": "COORDINATED",
"confidence": min(
cluster["similarity"] * (1 - cluster["time_delta"] / window_seconds),
1.0
)
})
return coordinated
Scaling with Proxy Infrastructure
When tracking disinformation at scale, you'll scrape hundreds of domains daily. Residential proxies from ThorData provide diverse IP pools that avoid detection, while ScrapeOps offers proxy aggregation across multiple providers for cost optimization.
Building a Monitoring Dashboard
Store results in a SQLite database and generate daily reports:
import sqlite3
import pandas as pd
def store_results(clusters, db_path="disinfo_tracker.db"):
conn = sqlite3.connect(db_path)
df = pd.DataFrame(clusters)
df.to_sql("clusters", conn, if_exists="append", index=False)
conn.close()
def daily_report(db_path="disinfo_tracker.db"):
conn = sqlite3.connect(db_path)
query = '''
SELECT flag, COUNT(*) as count, AVG(similarity) as avg_sim
FROM clusters
WHERE date(scraped_at) = date('now')
GROUP BY flag
'''
report = pd.read_sql(query, conn)
conn.close()
return report
Ethical Considerations
Disinformation tracking is a sensitive domain. Always respect robots.txt, rate-limit your requests, and handle data responsibly. This tool is designed for research and journalism — not for censorship or surveillance.
Next Steps
- Add NLP-based claim extraction to identify specific false claims
- Integrate with fact-checking APIs for automated verification
- Build alert systems for newly detected coordinated campaigns
The complete code provides a foundation for systematic disinformation research. Scale responsibly and verify findings before publishing conclusions.
Top comments (0)