DEV Community

Cover image for Build a Robust Web Scraper with Python: A Complete Guide
VARUN
VARUN

Posted on • Originally published at varunn.hashnode.dev

Build a Robust Web Scraper with Python: A Complete Guide

Overview

Over the past few weeks, I’ve been working on scraping structured contractor data from websites like YellowPages, and I encountered frequent issues like bot detection, dynamic loading delays, and unreliable internet. Web scraping can be challenging, especially when dealing with detection mechanisms and network issues.

When I first got into web scraping, I had no idea where to start. There were countless tutorials online, but most were either outdated, too simplistic, or broke on real-world websites.

I wanted to extract structured data, but I ran into walls:

Websites blocked my bot within minutes

Pages didn’t load reliably due to dynamic content

Most examples never went beyond static requests + BeautifulSoup

No one talked about error handling or resuming after failure

Resources were scattered, and trial-and-error became my best teacher. I spent countless hours debugging XPath issues, avoiding bot detection, and building something reliable — only to watch it fail the next day due to a slight website change or IP block.

All these challenges eventually motivated me to put together a flexible and reliable scraper — something I wish I had when I started that would:

Work on real-world sites

Handle network issues gracefully

Mimic human-like browsing to bypass detection

Save data in structured formats for future use

How it Works?

Here’s a step-by-step breakdown of how the scraper works, part by part. Each section of the code handles a specific task — from setting up the browser to navigating pages and extracting data — making the whole process smooth and reliable.

  • Wait for the Internet Connection
def wait_for_internet(timeout=120):
    while True:
        try:
            socket.create_connection(("8.8.8.8", 53), timeout=3)
            print(f"Internet is back at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
            return True
        except OSError:
            print(f"Internet down at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}. Retrying in 2 min...")
            time.sleep(timeout)
Enter fullscreen mode Exit fullscreen mode

Before making any HTTP requests or scraping, this function checks for internet availability by pinging Google’s DNS (8.8.8.8) (You can use any other DNS as well.). It ensures the scraper doesn't crash if the internet drops.

  • Setup a Stealth WebDriver
def setup_driver():
    options = uc.ChromeOptions()
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--start-maximized")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-gpu")
    options.add_argument("--headless")
    options.add_argument("--window-size=1920,1080")
    options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/114.0.0.0 Safari/537.36"
    )
    # headless=False for visibility; set True in production
    driver = uc.Chrome(options=options)
    driver.execute_cdp_cmd("Network.setUserAgentOverride", {
        "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                     "AppleWebKit/537.36 (KHTML, like Gecko) "
                     "Chrome/114.0.0.0 Safari/537.36"
    })
    return drive
Enter fullscreen mode Exit fullscreen mode

We use undetected_chromedriver instead of regular Selenium ChromeDriver. This bypasses bot detection mechanisms (like fingerprinting) that websites often deploy.

Key options:

--headless: Runs browser without GUI (useful for servers)

Custom User-Agent: Mimics a real browser session

--disable-blink-features=AutomationControlled: Hides automation signals

  • Safe Data Extraction
def extract_data(driver, link):
    wait_for_internet()
    driver.get(link)
    time.sleep(2)
    wait = WebDriverWait(driver, 10)

    def safe_select(css, attr=None):
        try:
            el = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, css)))
            return el.get_attribute(attr).strip() if attr else el.text.strip()
        except:
            return ""

    name    = safe_select("h1.dockable.business-name")
    phone   = safe_select("a.phone.dockable span.full")
    website = safe_select("a.website-link.dockable", "href")
    address = safe_select("span.address")
    email_href = safe_select("a.email-business", "href")
    email   = email_href.replace("mailto:", "") if email_href.startswith("mailto:") else ""

    return {
        "Name": name,
        "Phone": phone,
        "Website": website,
        "Address": address,
        "Email": email,
        "Profile Link": link
    }
Enter fullscreen mode Exit fullscreen mode

This function opens an individual profile page and extracts:

  1. Name
  2. Phone Number
  3. Website
  4. Address
  5. Email

It uses CSS Selectors and WebDriverWait to ensure elements are present before scraping — avoiding premature lookups.

  • Resume-Capable Main Scraper
def scrape(city_name, contractor_url, max_pages=100):
    driver = setup_driver()
    wait = WebDriverWait(driver, 10)
    data = []

    safe_city_name = city_name.replace(" ", "_")
    filename = f"{safe_city_name}.csv"
    existing_links = set()

    if os.path.exists(filename):
        print(f"Resuming from existing file: {filename}")
        existing_df = pd.read_csv(filename)
        existing_links = set(existing_df["Profile Link"].dropna().tolist())
        data.extend(existing_df.to_dict(orient="records"))

    try:
        for page in range(1, max_pages + 1):
            print(f"\n{city_name} — Processing page {page}...")
            url = f"https://www.your_website.com/{contractor_url}?page={page}" if page > 1 else f"https://www.your_website.com/{contractor_url}"
            wait_for_internet()
            driver.get(url)

            for attempt in range(3):
                try:
                    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a.business-name")))
                    break
                except:
                    print(f"Retry {attempt+1}/3 — waiting for contractor links…")
                    time.sleep(2)
            else:
                print("Could not load contractor links. Skipping city.")
                break

            links = [el.get_attribute("href")
                     for el in driver.find_elements(By.CSS_SELECTOR, "a.business-name")
                     if el.get_attribute("href")]

            for link in links:
                if link in existing_links:
                    print(f"Skipping already scraped: {link}")
                    continue

                print(f"Scraping: {link}")
                contractor_data = extract_contractor_data(driver, link)
                data.append(contractor_data)
                existing_links.add(link)

                # Optional: Save after each entry (safer but slower)
                pd.DataFrame(data).to_csv(filename, index=False)
                time.sleep(1)

    except KeyboardInterrupt:
        print("\nInterrupted by user! Saving collected data for this city…")

    finally:
        driver.quit()
        df = pd.DataFrame(data)
        df.to_csv(filename, index=False)
        print(f"Saved {len(df)} entries to '{filename}'")
Enter fullscreen mode Exit fullscreen mode

The main scraper lies in the scrape() function. This part of the code is responsible for navigating through multiple pages of search results (up to max_pages) and extracting individual contractor details from each profile page.

Here’s how it works:

  1. Construct the URL for each results page
  2. Find all the listing links on that page
  3. Rate Limiting
  4. Open each profile and extract data
  5. Save data to CSV after every new entry (optional)
  6. Resume support (very useful!) If a CSV file already exists for a field, the scraper will load it first, then skip scraping any links that are already listed in that file. This makes the scraper resumable, efficient, and safe to stop and restart at any point without losing previous work.
  • Main driver function
def main():
    cities_df = pd.read_csv("your_file_name.csv") 
    for _, row in cities_df.iterrows():
        city_name = row["city"]
        contractor_url = row["url"]
        scrape_contractors_for_city(city_name, contractor_url, max_pages=100)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This reads a list of cities from a CSV file (with columns city and url) and invokes the scraper for each one sequentially.

  • Project Dependencies
pandas
undetected-chromedriver
selenium
Enter fullscreen mode Exit fullscreen mode

Usage Tips

To adapt this scraper for other sites:

  • Update CSS Selectors: Modify the selectors in extract_contractor_data()
  • Change URL Pattern: Update the URL construction logic
  • Adjust Wait Conditions: Modify what elements to wait for
  • Update Data Fields: Change the returned dictionary structure

Conclusion

This scraper template provides a solid foundation for reliable web scraping projects. The combination of stealth techniques, robust error handling, and data persistence makes it suitable for production environments where reliability is crucial.

The code successfully helped me scrape large datasets without encountering blocking issues, and the resume functionality saved countless hours when dealing with interruptions.

Feel free to adapt this template for your own scraping needs, and remember to always scrape responsibly!

I’ve made the entire code public on GitHub for others to use and contribute. Clone, fork, or star the repo:

https://github.com/Varun3507/selenium-stealth-scraper.git

Found this helpful? Give it a ⭐ on GitHub and follow me for more web scraping and automation content!

Top comments (0)