Scraping Dark Web Data: Tor Hidden Services with Python

#python #tutorial #webdev #programming

Web scraping isn't limited to the surface web. Researchers, journalists, and security analysts often need to collect data from Tor hidden services for threat intelligence and academic research.

Setting Up Tor with Python

pip install requests[socks] stem beautifulsoup4
sudo apt install tor  # Ubuntu/Debian

Connecting Through Tor

import requests
from stem import Signal
from stem.control import Controller

def get_tor_session():
    session = requests.Session()
    session.proxies = {
        'http': 'socks5h://127.0.0.1:9050',
        'https': 'socks5h://127.0.0.1:9050'
    }
    return session

def renew_tor_identity():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate()
        controller.signal(Signal.NEWNYM)

session = get_tor_session()
response = session.get('https://check.torproject.org/api/ip')
print(response.json())

Scraping an Onion Site

from bs4 import BeautifulSoup

def scrape_onion(url, session):
    try:
        resp = session.get(url, timeout=30)
        soup = BeautifulSoup(resp.text, 'html.parser')
        return {
            'title': soup.title.string if soup.title else 'N/A',
            'paragraphs': [p.get_text(strip=True) for p in soup.find_all('p')],
            'links': [a['href'] for a in soup.find_all('a', href=True)]
        }
    except Exception as e:
        print(f"Error: {e}")
        return None

Rotating Identities

import time

def scrape_with_rotation(urls, delay=10):
    session = get_tor_session()
    results = []
    for i, url in enumerate(urls):
        if i > 0 and i % 5 == 0:
            renew_tor_identity()
            time.sleep(5)
            session = get_tor_session()
        result = scrape_onion(url, session)
        if result:
            results.append(result)
        time.sleep(delay)
    return results

Storing Results

import json, hashlib
from datetime import datetime

def save_results(results, filename='dark_web_data.json'):
    for r in results:
        r['scraped_at'] = datetime.utcnow().isoformat()
        r['hash'] = hashlib.sha256(json.dumps(r).encode()).hexdigest()
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)

Ethical Considerations

Only scrape publicly accessible content
Respect robots.txt even on onion sites
Never interact with illegal marketplaces
Check legal requirements in your jurisdiction
Store data securely

Scaling Your Research

For surface-web scraping that complements dark web research, ScraperAPI handles proxy rotation automatically. ThorData offers residential proxies for geo-targeted collection, and ScrapeOps monitors your pipelines.

Conclusion

Dark web scraping with Python is technically straightforward but requires careful ethical consideration. Always ensure your research serves a legitimate purpose and handle data responsibly.