Web scraping is the automated extraction of data from websites. It sounds simple, and for basic cases it is. Fetch a page, parse the HTML, extract the data you need. But modern websites are increasingly hostile to scrapers, and the legal landscape has become significantly more complex.
The technical landscape
Static HTML sites. These are the easy case. Fetch the page, parse the DOM, extract data using CSS selectors or XPath. Tools: fetch + cheerio (Node.js), requests + BeautifulSoup (Python).
const response = await fetch('https://example.com/products');
const html = await response.text();
const dom = new JSDOM(html);
const prices = dom.querySelectorAll('.product-price');
JavaScript-rendered sites (SPAs). The HTML source contains a root <div> and a JavaScript bundle. The actual content is rendered client-side. Traditional fetch-and-parse does not work because the content does not exist in the initial HTML.
Solution: headless browsers. Puppeteer (Chrome) or Playwright (multi-browser) load the page, execute JavaScript, wait for rendering, then give you the fully rendered DOM.
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/products');
await page.waitForSelector('.product-price');
const prices = await page.$$eval('.product-price', els => els.map(el => el.textContent));
API-backed sites. Many modern sites load data via API calls that return JSON. Inspecting the Network tab in DevTools reveals these endpoints. If you can call the API directly, you skip HTML parsing entirely and get structured data.
Anti-scraping defenses
Rate limiting. Too many requests too fast triggers blocks. Solution: add delays between requests (1-3 seconds), rotate IP addresses with proxies.
CAPTCHAs. Triggered by suspicious traffic patterns. Headless browsers can sometimes avoid triggering them by mimicking human behavior (realistic mouse movements, human-like timing).
Bot detection (Cloudflare, PerimeterX, DataDome). These services fingerprint browsers and detect automation. They check for headless browser indicators, JavaScript engine inconsistencies, and behavioral patterns. Bypassing them is an arms race.
Dynamic class names. React and CSS-in-JS generate randomized class names like .css-1a2b3c4. You cannot rely on class names for selectors. Use semantic HTML attributes (data-testid, aria-label) or structural selectors (main > section:nth-child(2) > div).
Honeypot traps. Hidden links that are invisible to users but visible to crawlers. Following them identifies you as a bot.
Legal considerations
The legal status of web scraping varies by jurisdiction and context:
Public data. The 2022 hiQ vs. LinkedIn ruling (upheld by the Ninth Circuit) affirmed that scraping publicly available data does not violate the Computer Fraud and Abuse Act. But this is US law and may not apply in other jurisdictions.
Terms of Service. Many websites prohibit scraping in their ToS. Violating ToS is a contractual issue, not necessarily a criminal one, but can lead to civil liability.
GDPR and personal data. Scraping personal data of EU residents triggers GDPR obligations regardless of where the scraper is located.
Copyright. The data itself may be copyrighted even if publicly accessible. Scraping copyrighted content for commercial use is risky.
The safe zone: scraping publicly available, non-personal, non-copyrighted data at reasonable rates, for legitimate purposes. Everything beyond that requires legal counsel.
The practical tool
I built a web scraper tool at zovo.one/free-tools/web-scraper that provides a visual interface for selecting elements from a web page and extracting structured data. Point at the elements you want, define the extraction pattern, and export as CSV or JSON. It handles pagination and is useful for one-off data extraction tasks that do not justify writing a custom scraper.
I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.
Top comments (0)