DEV Community

agenthustler
agenthustler

Posted on

Building a Hate Speech Dataset with Responsible Web Scraping

Why Build Hate Speech Datasets?

AI moderation models are only as good as their training data. Researchers and companies building content moderation systems need labeled datasets of harmful content. Building these datasets responsibly requires careful ethical consideration and technical skill.

Ethical Framework First

Before writing any code, establish guidelines:

  • Purpose limitation — data used only for building detection models
  • Minimization — collect only what is needed for training
  • No amplification — never republish or redistribute raw hate speech
  • IRB approval — get institutional review board clearance for academic work
  • Secure storage — encrypt datasets, limit access

Architecture

Scraper -> Anonymizer -> Labeler -> Encrypted Storage
Enter fullscreen mode Exit fullscreen mode

Setup

pip install requests beautifulsoup4 pandas cryptography
Enter fullscreen mode Exit fullscreen mode

For accessing forums at scale, ScraperAPI handles proxy rotation and rate limiting.

The Responsible Scraper

import requests
from bs4 import BeautifulSoup
import pandas as pd
import hashlib
from datetime import datetime

SCRAPER_API_KEY = "YOUR_KEY"

def scrape_forum_posts(forum_url, max_pages=5):
    posts = []
    for page in range(1, max_pages + 1):
        url = f"http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={forum_url}?page={page}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")

        for post in soup.select(".post, .comment, .message"):
            content = post.select_one(".content, .body")
            if content:
                posts.append({
                    "text": content.text.strip(),
                    "source_hash": hashlib.sha256(forum_url.encode()).hexdigest()[:12],
                    "scraped_at": datetime.now().isoformat()
                })
    return posts
Enter fullscreen mode Exit fullscreen mode

Anonymization Pipeline

Critical step: remove all personally identifiable information before labeling.

import re

def anonymize_text(text):
    # Remove usernames, emails, URLs
    text = re.sub(r"@\w+", "[USER]", text)
    text = re.sub(r"[\w.+-]+@[\w-]+\.[\w.]+", "[EMAIL]", text)
    text = re.sub(r"https?://\S+", "[URL]", text)
    # Remove phone numbers
    text = re.sub(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "[PHONE]", text)
    # Remove names (basic NER approach)
    text = re.sub(r"\b[A-Z][a-z]+ [A-Z][a-z]+\b", "[NAME]", text)
    return text

def process_posts(raw_posts):
    processed = []
    for post in raw_posts:
        processed.append({
            "text": anonymize_text(post["text"]),
            "source_hash": post["source_hash"],
            "scraped_at": post["scraped_at"],
            "anonymized": True
        })
    return processed
Enter fullscreen mode Exit fullscreen mode

Labeling Framework

LABEL_SCHEMA = {
    0: "clean",
    1: "offensive_language",
    2: "hate_speech",
    3: "threat"
}

def create_labeling_batch(posts, batch_size=50):
    df = pd.DataFrame(posts)
    df["label"] = None
    df["annotator"] = None
    df["confidence"] = None

    batches = [df[i:i+batch_size] for i in range(0, len(df), batch_size)]

    for i, batch in enumerate(batches):
        batch.to_csv(f"batch_{i:03d}.csv", index=False)
        print(f"Created batch_{i:03d}.csv with {len(batch)} samples")

    return len(batches)
Enter fullscreen mode Exit fullscreen mode

Encrypted Storage

from cryptography.fernet import Fernet
import json

def encrypt_dataset(data, key_file="dataset.key"):
    key = Fernet.generate_key()
    with open(key_file, "wb") as f:
        f.write(key)

    cipher = Fernet(key)
    json_data = json.dumps(data).encode()
    encrypted = cipher.encrypt(json_data)

    with open("dataset.enc", "wb") as f:
        f.write(encrypted)

    print(f"Dataset encrypted. Key saved to {key_file}")
    print("Store the key separately from the dataset!")

def decrypt_dataset(key_file="dataset.key"):
    with open(key_file, "rb") as f:
        key = f.read()
    cipher = Fernet(key)
    with open("dataset.enc", "rb") as f:
        encrypted = f.read()
    return json.loads(cipher.decrypt(encrypted))
Enter fullscreen mode Exit fullscreen mode

Quality Metrics

def calculate_inter_annotator_agreement(labels_a, labels_b):
    agreement = sum(a == b for a, b in zip(labels_a, labels_b))
    return agreement / len(labels_a)

def dataset_statistics(df):
    print(f"Total samples: {len(df)}")
    print(f"Label distribution:")
    for code, name in LABEL_SCHEMA.items():
        count = (df["label"] == code).sum()
        print(f"  {name}: {count} ({count/len(df)*100:.1f}%)")
Enter fullscreen mode Exit fullscreen mode

Handling Protected Sources

Some forums require authentication or have aggressive bot detection. Use ThorData residential proxies for access, and ScrapeOps to monitor success rates.

Best Practices

  1. Never scrape private conversations or DMs
  2. Get ethics board approval for academic research
  3. Use content warnings when sharing results
  4. Balance your dataset to avoid bias
  5. Document your methodology thoroughly

Conclusion

Building hate speech datasets is essential for AI safety but must be done responsibly. With tools like ScraperAPI, proper anonymization, and encrypted storage, you can create valuable training data while respecting privacy and ethics.

Top comments (0)