As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!
Collecting web data efficiently requires navigating modern complexities. I've found that combining several Python techniques creates robust scrapers capable of handling dynamic content while maintaining ethical standards. Here's what works reliably in production environments.
JavaScript-heavy sites often require full browser rendering. I use Playwright because it handles single-page applications effectively. Here's how I extract content after client-side rendering completes:
from playwright.sync_api import sync_playwright
def extract_dynamic_content(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
try:
page.goto(url, timeout=60000)
page.wait_for_selector("#dynamic-content", state="visible", timeout=15000)
# Handle lazy-loaded elements
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
return page.inner_html("#content-container")
finally:
browser.close()
# Usage
html_content = extract_dynamic_content("https://modern-web-app.com/data-feed")
print(f"Extracted {len(html_content)} characters of rendered content")
Server defenses often block repetitive requests. Rotating headers helps significantly. I combine user agent rotation with proxy switching for better results:
from fake_useragent import UserAgent
import requests
from itertools import cycle
proxies = cycle([
"http://user:pass@192.168.1.1:8080",
"http://user:pass@192.168.1.2:8080"
])
def safe_request(url):
ua = UserAgent()
headers = {
"User-Agent": ua.random,
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://google.com"
}
try:
response = requests.get(url,
headers=headers,
proxies={"http": next(proxies)},
timeout=15)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"Request failed: {str(e)}")
return None
# Usage
content = safe_request("https://protected-site.com/inventory")
For precise element targeting, XPath consistently outperforms CSS selectors in complex documents. This approach handles deeply nested structures:
from lxml import html, etree
def parse_product_page(html_content):
tree = html.fromstring(html_content)
# Extract variant prices using XPath axes
results = []
for product in tree.xpath('//div[contains(@class, "product-card")]'):
name = product.xpath('.//h2[@itemprop="name"]/text()')[0].strip()
# Handle price variations
base_price = product.xpath('.//span[@class="base-price"]/text()')
sale_price = product.xpath('.//span[@class="sale-price"]/text()')
price = sale_price[0] if sale_price else base_price[0]
# Extract metadata using sibling selectors
sku = product.xpath('.//dt[text()="SKU"]/following-sibling::dd/text()')[0]
results.append({"name": name, "price": price, "sku": sku})
return results
# Usage
products = parse_product_page(html_content)
print(f"Found {len(products)} product listings")
CAPTCHAs require specialized services. I integrate solvers directly into automation scripts:
from twocaptcha import TwoCaptcha
from selenium.webdriver.common.by import By
def solve_captcha(driver):
solver = TwoCaptcha("YOUR_API_KEY")
# Identify CAPTCHA parameters
sitekey = driver.find_element(By.CSS_SELECTOR, ".h-captcha").get_attribute("data-sitekey")
page_url = driver.current_url
# Solve and inject solution
result = solver.hcaptcha(sitekey=sitekey, url=page_url)
driver.execute_script(
f"document.querySelector('[name=h-captcha-response]').value = '{result['code']}'"
)
driver.find_element(By.ID, "submit-btn").click()
return "Verification Success" in driver.page_source
# Usage in Selenium
# driver.get("https://secure-site.com/login")
# if "CAPTCHA" in driver.page_source:
# solve_captcha(driver)
Large projects demand distributed systems. Scrapy-Redis handles queue management effectively:
# Scrapy project structure
# ├── scrapy.cfg
# ├── myproject/
# │ ├── __init__.py
# │ ├── items.py
# │ ├── middlewares.py
# │ ├── pipelines.py
# │ ├── settings.py
# │ └── spiders/
# │ └── distributed_spider.py
# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = "redis://:password@server-ip:6379/0"
ITEM_PIPELINES = {
'myproject.pipelines.DatabasePipeline': 300,
}
# distributed_spider.py
import scrapy
from scrapy_redis.spiders import RedisSpider
class MyDistributedSpider(RedisSpider):
name = "distributed_crawler"
redis_key = "crawler:start_urls"
def parse(self, response):
# Extraction logic here
yield {"url": response.url, "data": response.css("title::text").get()}
Database integration keeps pipelines flowing. This PostgreSQL loader handles continuous inserts:
import psycopg2
from contextlib import contextmanager
@contextmanager
def db_connection():
conn = psycopg2.connect(
dbname="scraped_data",
user="loader",
password="securepass",
host="db-server.com"
)
try:
yield conn
finally:
conn.close()
class DatabasePipeline:
def process_item(self, item, spider):
with db_connection() as conn:
with conn.cursor() as cur:
cur.execute("""
INSERT INTO scraped_items (url, content, timestamp)
VALUES (%s, %s, NOW())
ON CONFLICT (url) DO UPDATE
SET content = EXCLUDED.content,
timestamp = NOW()
""", (item["url"], item["content"]))
return item
Respecting website rules is non-negotiable. This robots.txt checker prevents policy violations:
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
def check_crawl_permission(target_url):
parsed = urlparse(target_url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
crawl_delay = rp.crawl_delay("*")
if crawl_delay:
print(f"Respecting crawl delay: {crawl_delay} seconds")
return rp.can_fetch("MyBot/1.0", target_url)
# Usage
if check_crawl_permission("https://example.com/restricted-area"):
print("Proceeding with extraction")
else:
print("Access prohibited by robots.txt")
Monitoring site changes prevents unexpected extraction failures. This version tracker detects layout modifications:
import hashlib
import requests
from difflib import HtmlDiff
class SiteMonitor:
def __init__(self, url):
self.url = url
self.last_hash = None
def fetch_snapshot(self):
response = requests.get(self.url)
content = response.text
# Normalize HTML for consistent comparison
content = content.replace(" ", "").replace("\n", "")
current_hash = hashlib.sha256(content.encode()).hexdigest()
return content, current_hash
def detect_changes(self):
content, current_hash = self.fetch_snapshot()
if not self.last_hash:
print("Initial version stored")
self.last_hash = current_hash
return False
if current_hash != self.last_hash:
print("Structure change detected")
# Generate change report
diff = HtmlDiff().make_file(
self.previous_content.splitlines(),
content.splitlines(),
context=True
)
with open("change_report.html", "w") as f:
f.write(diff)
return True
return False
# Usage
monitor = SiteMonitor("https://frequently-updated.com")
if monitor.detect_changes():
print("Site structure modified - update selectors")
These methods form a comprehensive approach to modern web data collection. Each addresses specific challenges I've encountered in real projects. The key is balancing extraction capability with respectful crawling practices. Always verify legality and respect website terms before scraping. Proper rate limiting and error handling make the difference between sustainable data collection and blocked IPs. Start with small requests and scale gradually while monitoring server responses.
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | Java Elite Dev | Golang Elite Dev | Python Elite Dev | JS Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva
Top comments (0)