I Built 23 Free Web Scrapers on Apify — Here is What I Learned
Building in public is one thing, but building scrapers in public is a whole different beast. Over the last few months, I’ve developed and released 23 free web scrapers on the Apify platform. From Amazon and TikTok to Google Maps and LinkedIn, I’ve touched almost every corner of the web where data lives.
If you’re an indie dev, a data enthusiast, or someone looking to break into the world of web automation, this is my story of why I built them, the technical hurdles I faced, and what it’s actually like to maintain a fleet of scrapers in 2026.
Why Build 23 Free Scrapers?
Most developers start building scrapers for a specific project. I started because I saw a gap. While there are plenty of enterprise-grade scraping solutions, many indie developers, students, and small researchers just need a quick, reliable way to get data without a $200/month subscription.
I wanted to build a "Swiss Army Knife" of data extraction tools. By releasing them for free on the Apify Store, I wasn't just building tools; I was building a portfolio and a reputation. In the world of "Scraper-as-a-Service," your best marketing is a tool that actually works.
The goal was simple:
- Master the art of scraping: You don't really know how a site works until you try to automate it.
- Help the community: Data shouldn't be gated by technical complexity.
- Explore the Ecosystem: Apify is unique because it handles the infrastructure. By building on top of it, I could focus 100% on the logic.
The Portfolio: A Tour of the "Big 5"
Out of the 23 actors I’ve built, five have consistently dominated the charts in terms of usage. Here’s a breakdown of what they do and the technical challenges that made them interesting.
1. Amazon Product Scraper: The "Scale" Challenge
The "OG" of scrapers. Everyone wants Amazon data for price monitoring or competitor analysis. This one extracts everything: ASIN, title, price, ratings, reviews, and even BSR (Best Seller Rank).
The Challenge: Amazon is a master of A/B testing. On any given day, you might see three different versions of the product page. Some have the price in a span with a specific class; others have it buried in a "Buy Box" iframe.
The Lesson: Instead of just scraping the HTML with brittle CSS selectors, I learned to target the specific JSON blobs hidden in the page. If you look for window.P.register('twister-js-init-dpx-data', ...) in the source, you'll find the entire product state. This is far more stable than trying to find the right div.
2. Google Maps Scraper: The "Lead Gen" Goldmine
This is the ultimate business-to-business (B2B) tool. It pulls business names, addresses, phone numbers, and ratings.
The Innovation: I noticed most users didn't just want the Google Maps data—they wanted to contact the businesses. I added an "Include Website" option. If enabled, the scraper doesn't just stay on Google Maps; it follows the business's website link and attempts to find emails and social media profiles.
Technical Hurdle: Scraping 1,000 different websites is harder than scraping one big site like Google. Every website has different anti-bot measures. I had to implement a recursive crawler that searches for "Contact Us" and "About" pages while strictly limiting its depth to avoid getting stuck in a "spider trap."
3. TikTok Profile & Video Scraper: The "Rehydration" Hack
TikTok is notoriously difficult because its internal structure changes weekly. If you try to scrape it with a browser, you'll constantly run into "Verify you are human" sliders.
The Breakthrough: I discovered the __UNIVERSAL_DATA_FOR_REHYDRATION__ script tag. When you load a TikTok profile, the server sends down a massive JSON object containing the profile data and the first 30 videos. By parsing this JSON instead of the DOM, the scraper became 10x more stable and significantly faster. It turns a "Browser" problem into a "JSON" problem.
4. LinkedIn Job Scraper: The "Final Boss"
LinkedIn is the final boss of anti-bot protection. They use sophisticated fingerprinting to detect if you're using a headless browser.
The Strategy: While others were struggling with complex browser automation, I focused on a "human-mimicry" implementation using Playwright.
-
Fingerprinting: I used
crawlee's built-in fingerprinting to rotate headers, screen resolutions, and WebGL signatures. - Scrolling: Instead of jumping to the bottom of the page, the scraper simulates a variable-speed scroll, pausing occasionally as if a human is reading the job description.
- The Result: A scraper that can pull 100+ jobs in minutes without triggering a login wall.
5. Shopify Store Scraper: The "Competitor Intelligence" Tool
Great for dropshippers and e-commerce researchers. It identifies the theme, the apps installed on the store, and extracts the full product catalog.
The Trick: Most Shopify stores have a /products.json endpoint. It’s often hidden or paginated, but it provides perfectly structured data. My scraper identifies if a site is running on Shopify and then hits this endpoint directly, saving minutes of rendering time.
Technical Lessons: What Works (and What Doesn't)
After 23 iterations, my "stack" has become very opinionated.
The Power of Crawlee
If you aren't using Crawlee, you're playing on hard mode. It’s the engine behind all my scrapers. It handles the boring stuff—request retries, proxy rotation, and session management—so I can focus on the parsing logic. The CheerioCrawler is my favorite for speed, while PlaywrightCrawler is my heavy hitter for dynamic sites.
Cheerio vs. Playwright: The Efficiency Ratio
This is the eternal debate.
- Cheerio (Static/Hydrated): Whenever possible, use Cheerio. It’s light, fast, and uses 1/10th of the RAM. Most modern sites actually "hydrate" their data into a JSON object inside a script tag. Find that tag, and you don't need a browser.
- Playwright (Dynamic): Use this only when you must. If the page literally won't show data until a button is clicked or a script is executed, Playwright is your friend.
The Anti-Bot War: Proxies and Fingerprinting
In 2026, simple IP rotation isn't enough. Sites like Cloudflare and Akamai look at your "TLS Fingerprint"—the way your computer shakes hands with the server.
- Residential Proxies: Mandatory for LinkedIn and Amazon. They make your traffic look like it's coming from a home Wi-Fi network.
- Header Order: Did you know browsers send headers in a very specific order? If you send them out of order, you're immediately flagged as a bot.
-
Canvas Fingerprinting: Browsers render graphics differently based on your OS and GPU. Tools like
crawleehelp spoof these so every request looks like it's from a unique, "real" machine.
Maintenance: The Soul-Crushing Reality
Building the scraper is the easy part. Maintenance is where the real work happens.
-
The "Tuesday" Problem: Big tech companies often push updates on Tuesdays. I've woken up many Wednesday mornings to find five scrapers broken because a single CSS class changed from
price-valuetop-val. - The Solution: Build for failure. Wrap your parsers in try-catch blocks and use detailed logging. I use a "Sentinel" pattern where the scraper regularly checks if it's still finding the "core" fields (like Price or Title). If the "missing field" rate goes above 20%, it alerts me immediately.
Monetization Insights: Why "Free" is a Smart Business Move
You might wonder why I release these for free. On Apify, "free" actors still generate revenue and opportunities:
- The Lead Magnet: A "free" scraper is the best business card. I’ve had dozens of companies reach out asking for custom integrations or private versions of my scrapers. They see it works, they see the code quality, and they trust me to build their enterprise solution.
- Apify Platform Credits: Users still pay for the "compute" and "proxies" they use. This builds the ecosystem, which in turn brings more users to the platform where I can offer premium services.
- The Portfolio Effect: When I apply for a contract, I don't just say "I know web scraping." I say "I maintain 23 scrapers with 10,000+ monthly runs." That proof of scale is invaluable.
AI and the Future of Scraping
We can't talk about scraping in 2026 without mentioning AI. I've started using LLMs (like GPT-4o) to help with the "fallback" parsing.
If my CSS selectors fail, I send a snippet of the HTML to an LLM and ask it to "Extract the price from this mess." It’s slower and more expensive, but it prevents the scraper from returning zero results. AI is making scrapers "self-healing," and that’s the next frontier I’m exploring.
My Tech Stack Overview
If you want to build something similar, here is what I use:
- Language: TypeScript (Type safety is non-negotiable for complex parsers)
- Framework: Crawlee
-
Libraries:
-
cheerio: For lightning-fast HTML parsing. -
playwright: For heavy-duty browser automation. -
got-scraping: For making HTTP requests that look like real browsers.
-
- Platform: Apify for hosting, scheduling, and proxy rotation.
Conclusion
Building 23 scrapers taught me more about the architecture of the web than years of standard web development. It’s a constant cat-and-mouse game, but there is something incredibly satisfying about turning the messy, unstructured web into a clean CSV file.
The web is the world's largest database, but it's a database with a terrible API. Web scraping is the bridge that fixes that.
If you’re interested in seeing the scrapers in action or using them for your own projects, you can find the whole collection here:
Whether you need to monitor Amazon prices, find leads on Google Maps, or track TikTok trends, these tools are there for you to use.
What's next? Probably 23 more. The demand for data isn't slowing down, and as long as there are websites, there will be a need for people who know how to (respectfully) scrape them.
I'm an indie developer focusing on web automation and data extraction. If you found this useful, follow me for more technical deep dives into the world of automation!
Top comments (0)