agenthustler

Posted on Mar 27 • Edited on Apr 19

Building a Hate Speech Dataset with Responsible Web Scraping

#python #programming #tutorial #webdev

Why Build Hate Speech Datasets?

AI moderation models are only as good as their training data. Researchers and companies building content moderation systems need labeled datasets of harmful content. Building these datasets responsibly requires careful ethical consideration and technical skill.

Ethical Framework First

Before writing any code, establish guidelines:

Purpose limitation — data used only for building detection models
Minimization — collect only what is needed for training
No amplification — never republish or redistribute raw hate speech
IRB approval — get institutional review board clearance for academic work
Secure storage — encrypt datasets, limit access

Architecture

Scraper -> Anonymizer -> Labeler -> Encrypted Storage

Setup

pip install requests beautifulsoup4 pandas cryptography

For accessing forums at scale, ScraperAPI handles proxy rotation and rate limiting.

The Responsible Scraper

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Anonymization Pipeline

Critical step: remove all personally identifiable information before labeling.

import re

def anonymize_text(text):
    # Remove usernames, emails, URLs
    text = re.sub(r"@\w+", "[USER]", text)
    text = re.sub(r"[\w.+-]+@[\w-]+\.[\w.]+", "[EMAIL]", text)
    text = re.sub(r"https?://\S+", "[URL]", text)
    # Remove phone numbers
    text = re.sub(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "[PHONE]", text)
    # Remove names (basic NER approach)
    text = re.sub(r"\b[A-Z][a-z]+ [A-Z][a-z]+\b", "[NAME]", text)
    return text

def process_posts(raw_posts):
    processed = []
    for post in raw_posts:
        processed.append({
            "text": anonymize_text(post["text"]),
            "source_hash": post["source_hash"],
            "scraped_at": post["scraped_at"],
            "anonymized": True
        })
    return processed

Labeling Framework

LABEL_SCHEMA = {
    0: "clean",
    1: "offensive_language",
    2: "hate_speech",
    3: "threat"
}

def create_labeling_batch(posts, batch_size=50):
    df = pd.DataFrame(posts)
    df["label"] = None
    df["annotator"] = None
    df["confidence"] = None

    batches = [df[i:i+batch_size] for i in range(0, len(df), batch_size)]

    for i, batch in enumerate(batches):
        batch.to_csv(f"batch_{i:03d}.csv", index=False)
        print(f"Created batch_{i:03d}.csv with {len(batch)} samples")

    return len(batches)

Encrypted Storage

from cryptography.fernet import Fernet
import json

def encrypt_dataset(data, key_file="dataset.key"):
    key = Fernet.generate_key()
    with open(key_file, "wb") as f:
        f.write(key)

    cipher = Fernet(key)
    json_data = json.dumps(data).encode()
    encrypted = cipher.encrypt(json_data)

    with open("dataset.enc", "wb") as f:
        f.write(encrypted)

    print(f"Dataset encrypted. Key saved to {key_file}")
    print("Store the key separately from the dataset!")

def decrypt_dataset(key_file="dataset.key"):
    with open(key_file, "rb") as f:
        key = f.read()
    cipher = Fernet(key)
    with open("dataset.enc", "rb") as f:
        encrypted = f.read()
    return json.loads(cipher.decrypt(encrypted))

Quality Metrics

def calculate_inter_annotator_agreement(labels_a, labels_b):
    agreement = sum(a == b for a, b in zip(labels_a, labels_b))
    return agreement / len(labels_a)

def dataset_statistics(df):
    print(f"Total samples: {len(df)}")
    print(f"Label distribution:")
    for code, name in LABEL_SCHEMA.items():
        count = (df["label"] == code).sum()
        print(f"  {name}: {count} ({count/len(df)*100:.1f}%)")

Handling Protected Sources

Some forums require authentication or have aggressive bot detection. Use ThorData residential proxies for access, and ScrapeOps to monitor success rates.

Best Practices

Never scrape private conversations or DMs
Get ethics board approval for academic research
Use content warnings when sharing results
Balance your dataset to avoid bias
Document your methodology thoroughly

Conclusion

Building hate speech datasets is essential for AI safety but must be done responsibly. With tools like ScraperAPI, proper anonymization, and encrypted storage, you can create valuable training data while respecting privacy and ethics.

DEV Community