Web scraping isn't limited to the surface web. Researchers, journalists, and security analysts often need to collect data from Tor hidden services for threat intelligence and academic research.
Setting Up Tor with Python
pip install requests[socks] stem beautifulsoup4
sudo apt install tor # Ubuntu/Debian
Connecting Through Tor
import requests
from stem import Signal
from stem.control import Controller
def get_tor_session():
session = requests.Session()
session.proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
return session
def renew_tor_identity():
with Controller.from_port(port=9051) as controller:
controller.authenticate()
controller.signal(Signal.NEWNYM)
session = get_tor_session()
response = session.get('https://check.torproject.org/api/ip')
print(response.json())
Scraping an Onion Site
from bs4 import BeautifulSoup
def scrape_onion(url, session):
try:
resp = session.get(url, timeout=30)
soup = BeautifulSoup(resp.text, 'html.parser')
return {
'title': soup.title.string if soup.title else 'N/A',
'paragraphs': [p.get_text(strip=True) for p in soup.find_all('p')],
'links': [a['href'] for a in soup.find_all('a', href=True)]
}
except Exception as e:
print(f"Error: {e}")
return None
Rotating Identities
import time
def scrape_with_rotation(urls, delay=10):
session = get_tor_session()
results = []
for i, url in enumerate(urls):
if i > 0 and i % 5 == 0:
renew_tor_identity()
time.sleep(5)
session = get_tor_session()
result = scrape_onion(url, session)
if result:
results.append(result)
time.sleep(delay)
return results
Storing Results
import json, hashlib
from datetime import datetime
def save_results(results, filename='dark_web_data.json'):
for r in results:
r['scraped_at'] = datetime.utcnow().isoformat()
r['hash'] = hashlib.sha256(json.dumps(r).encode()).hexdigest()
with open(filename, 'w') as f:
json.dump(results, f, indent=2)
Ethical Considerations
- Only scrape publicly accessible content
- Respect robots.txt even on onion sites
- Never interact with illegal marketplaces
- Check legal requirements in your jurisdiction
- Store data securely
Scaling Your Research
For surface-web scraping that complements dark web research, ScraperAPI handles proxy rotation automatically. ThorData offers residential proxies for geo-targeted collection, and ScrapeOps monitors your pipelines.
Conclusion
Dark web scraping with Python is technically straightforward but requires careful ethical consideration. Always ensure your research serves a legitimate purpose and handle data responsibly.
Top comments (0)