DEV Community

agenthustler
agenthustler

Posted on

Scraping Dark Web Data: Tor Hidden Services with Python

Web scraping isn't limited to the surface web. Researchers, journalists, and security analysts often need to collect data from Tor hidden services for threat intelligence and academic research.

Setting Up Tor with Python

pip install requests[socks] stem beautifulsoup4
sudo apt install tor  # Ubuntu/Debian
Enter fullscreen mode Exit fullscreen mode

Connecting Through Tor

import requests
from stem import Signal
from stem.control import Controller

def get_tor_session():
    session = requests.Session()
    session.proxies = {
        'http': 'socks5h://127.0.0.1:9050',
        'https': 'socks5h://127.0.0.1:9050'
    }
    return session

def renew_tor_identity():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate()
        controller.signal(Signal.NEWNYM)

session = get_tor_session()
response = session.get('https://check.torproject.org/api/ip')
print(response.json())
Enter fullscreen mode Exit fullscreen mode

Scraping an Onion Site

from bs4 import BeautifulSoup

def scrape_onion(url, session):
    try:
        resp = session.get(url, timeout=30)
        soup = BeautifulSoup(resp.text, 'html.parser')
        return {
            'title': soup.title.string if soup.title else 'N/A',
            'paragraphs': [p.get_text(strip=True) for p in soup.find_all('p')],
            'links': [a['href'] for a in soup.find_all('a', href=True)]
        }
    except Exception as e:
        print(f"Error: {e}")
        return None
Enter fullscreen mode Exit fullscreen mode

Rotating Identities

import time

def scrape_with_rotation(urls, delay=10):
    session = get_tor_session()
    results = []
    for i, url in enumerate(urls):
        if i > 0 and i % 5 == 0:
            renew_tor_identity()
            time.sleep(5)
            session = get_tor_session()
        result = scrape_onion(url, session)
        if result:
            results.append(result)
        time.sleep(delay)
    return results
Enter fullscreen mode Exit fullscreen mode

Storing Results

import json, hashlib
from datetime import datetime

def save_results(results, filename='dark_web_data.json'):
    for r in results:
        r['scraped_at'] = datetime.utcnow().isoformat()
        r['hash'] = hashlib.sha256(json.dumps(r).encode()).hexdigest()
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

Ethical Considerations

  • Only scrape publicly accessible content
  • Respect robots.txt even on onion sites
  • Never interact with illegal marketplaces
  • Check legal requirements in your jurisdiction
  • Store data securely

Scaling Your Research

For surface-web scraping that complements dark web research, ScraperAPI handles proxy rotation automatically. ThorData offers residential proxies for geo-targeted collection, and ScrapeOps monitors your pipelines.

Conclusion

Dark web scraping with Python is technically straightforward but requires careful ethical consideration. Always ensure your research serves a legitimate purpose and handle data responsibly.

Top comments (0)