DEV Community: Solomon Williams

You're Using ScraperAPI or Scrape.do. You're Still Writing Parsers. There's a Better Way.

Solomon Williams — Mon, 04 May 2026 18:51:14 +0000

If you're using a scraping API like ScraperAPI, Scrape.do, or ScrapingBee, you already solved the hard fetching problem — proxy rotation, CAPTCHA, JS rendering, IP blocks.

But here's what happens after the fetch:

const html = await scraperApi.fetch('https://example.com/products');
// now what?
// cheerio? puppeteer? regex?
// custom parser that breaks every time the site updates?

You get raw HTML back and then you spend hours writing and maintaining a parser on top. Every time the site updates its markup, your selectors break. You fix them. They break again.

That's the part nobody talks about in scraping API comparisons.

The Two-Layer Problem

Web scraping has two distinct problems:

Fetching — getting the HTML past bot detection, CAPTCHAs, and IP blocks
Extraction — turning that HTML into structured, typed data your application can actually use

ScraperAPI, Scrape.do, ScrapingBee — these tools are excellent at layer 1. They've invested heavily in proxy infrastructure, fingerprint evasion, and rendering pipelines. That's genuinely hard to build.

But layer 2 is still your problem. And it's not a small problem.

What the Parsing Tax Actually Costs You

Let's be honest about what maintaining a custom parser costs:

Initial build time — hours to days depending on page complexity
Ongoing maintenance — sites change their markup, your selectors break
Edge case handling — missing fields, null values, type inconsistencies
Testing — every site update potentially breaks your extraction
Scaling — each new site you want to scrape needs a new parser

One analysis put it well: an AI scraper that costs slightly more per page but requires zero parsing overhead often beats a cheaper raw HTML API once you factor in engineering time.

DivParser as Your Extraction Layer

DivParser is an AI extraction API. You give it HTML — from any source — and describe what you want in plain English. It returns clean, typed JSON.

The key endpoint is /v1/parse:

curl -X POST "https://api.divparser.com/v1/parse" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<html>...your scraped content...</html>",
    "schema": "Extract product name, price, rating and availability"
  }'

Response:

[
  { "name": "Widget Pro", "price": 49.99, "rating": 4.8, "availability": true },
  { "name": "Widget Lite", "price": 19.99, "rating": 4.2, "availability": false }
]

No selectors. No cheerio. No regex. No parser to maintain.

The Combined Stack

ScraperAPI / Scrape.do
  → handles: proxy rotation, CAPTCHA, JS rendering, IP blocks
  → returns: raw HTML

DivParser /v1/parse
  → handles: intelligent extraction, type casting, schema enforcement
  → returns: clean typed JSON

You keep the fetching infrastructure you already trust. You drop in DivParser as the extraction step. No custom parser to write or maintain.

When This Combo Makes Sense

You're already using a scraping API and spending significant engineering time on parsing and selector maintenance.

You're scraping multiple different sites — each with different markup. With a custom parser, that's N parsers to write and maintain. With DivParser, it's one schema per site written in plain English.

You need strict output types — DivParser supports Nestlang, a typed schema language that enforces output structure. If you define price as a number, you get a number — not a string with a dollar sign.

You're building for AI pipelines — LLMs need structured data, not raw HTML. The fetcher gets the page, DivParser formats it for your pipeline.

What DivParser Doesn't Replace

To be clear — DivParser doesn't replace your fetching layer. It has its own scraper for public pages, but if you're already paying for ScraperAPI or Scrape.do for their proxy network and anti-bot capabilities, keep using them for fetching. DivParser just removes the parsing step that follows.

It also doesn't handle auth-required pages, CAPTCHA solving, or residential proxy rotation — that's still your fetching layer's job.

Try It

DivParser has a free tier — no credit card required. If you're already fetching HTML and writing custom parsers on top, it's worth testing against one of your existing targets.

divparser.com — docs and API reference included.

Happy to answer questions in the comments about how the extraction engine works or how to integrate it with your existing stack.

Firecrawl vs Apify vs DivParser: Picking the Right Web Scraping API in 2026

Solomon Williams — Mon, 04 May 2026 18:37:55 +0000

The web scraping API market has matured a lot in the last two years. There are now tools for every layer of the pipeline — fetching, rendering, extraction, and scheduling. But picking the wrong one costs you time, money, and broken selectors at 2am.

This is a practical breakdown of three tools that cover different parts of the stack: Firecrawl, Apify, and DivParser.

The Core Distinction Nobody Talks About

Before comparing features, it helps to understand that these tools are solving different problems:

Fetching tools — handle proxy rotation, CAPTCHA, JS rendering. They return raw HTML or markdown. You still parse it yourself.
Extraction tools — take HTML (or a URL) and return structured data. The AI understands the page and returns typed JSON.
Platforms — combine both, plus scheduling, storage, and pre-built scrapers.

Most tools in 2026 are fetching tools with some extraction bolted on. A few are extraction-first. That distinction matters a lot depending on your use case.

Firecrawl

Best for: Fast single-page fetches feeding into LLM pipelines

Firecrawl is clean, fast, and developer-friendly. Its core value is turning a URL into markdown or structured content with minimal setup. Pre-warmed browsers mean sub-second latency on cached pages, and the credit pricing is predictable — 1 page = 1 credit under standard conditions.

The extraction ("Extract" feature) is an add-on that starts at $89/month on top of your base plan. So if clean structured JSON is your primary need, you're paying for two things.

Strengths:

Very fast on simple fetches
Self-hostable (AGPL)
Low entry cost ($16 Hobby tier)
Stealth proxies included

Weaknesses:

Credits disappear fast on large crawls
Structured extraction is a separate, expensive add-on
Limited built-in scheduling

Apify

Best for: Large-scale scraping with fine-grained control

Apify is a full platform — 6,000+ pre-built Actors (scrapers), a global proxy pool, CAPTCHA solving, cron scheduling, webhooks, and SOC 2 Type II compliance. If you need to scrape Amazon, LinkedIn, or Google at scale with minimal custom code, Apify probably has an Actor for it.

The tradeoff is complexity. The Actor/Compute Unit model has a learning curve, and costs can spike with inefficient code. Cold starts add ~1.5s latency. And the entry price ($39/month) is higher than alternatives.

Strengths:

Breadth — pre-built scrapers for almost every major site
Effective anti-blocking technology
Enterprise-ready (SOC 2, GDPR)
You can monetize your own scrapers on their marketplace

Weaknesses:

Actor/CU concepts add friction for new users
Consumption costs can spike unexpectedly
Overkill for teams that just need structured data from a handful of sites

DivParser

Best for: Getting clean structured JSON from any page without writing or maintaining a parser

DivParser takes a different approach. Instead of returning raw HTML for you to parse, it does the extraction for you — you describe what you want in plain English (or use Nestlang, a typed schema language), and it returns typed JSON directly.

curl -X POST "https://api.divparser.com/v1/scrapes" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/jobs",
    "schema": "Extract job title, company and salary",
    "pageType": "LISTING"
  }'

Response:

[
  { "title": "Backend Engineer", "company": "Acme Corp", "salary": "$120k" },
  { "title": "Data Engineer", "company": "Startup Inc", "salary": "$110k" }
]

It also has a parse-only endpoint — you POST raw HTML and get structured data back without any fetching involved. This is useful when you already have HTML from another scraper, a dataset, or even a page you downloaded manually.

Strengths:

Clean typed JSON in one API call — no parsing layer needed
Parse endpoint accepts raw HTML (bring your own)
Nestlang for strict schema enforcement
Built-in scheduling via BullMQ
Lowest entry price ($10.99 Starter)
JS rendering + gradual scroll for complete listing extraction

Weaknesses:

No residential proxies yet (planned)
No pre-built scrapers for specific sites
Earlier stage — smaller scale limits than Apify/Firecrawl
No CAPTCHA solving

Side-by-Side Comparison

	Firecrawl	Apify	DivParser
Output format	Markdown / HTML	Raw HTML / JSON (Actor-dependent)	Typed JSON
AI extraction	Add-on ($89+/mo)	Actor-dependent	Built-in
Parse raw HTML	❌	❌	✅
Schema enforcement	❌	❌	✅ Nestlang
Scheduling	Limited	✅ Full	✅ Cron + interval
Anti-bot	✅ Stealth proxies	✅ Strong	Basic (proxies planned)
Pre-built scrapers	❌	✅ 6,000+	❌
Entry price	$16/mo	$39/mo	$10.99/mo
Self-host	✅ AGPL	❌	❌
Enterprise compliance	❌	✅ SOC 2	❌

Which One Should You Use?

Use Firecrawl if:

You're feeding page content into an LLM pipeline and need fast markdown
You want to self-host your scraping infrastructure
You're doing simple fetches at moderate volume

Use Apify if:

You need to scrape a heavily protected site and there's an Actor for it
You're operating at serious scale (100k+ pages/month)
You need enterprise compliance (SOC 2, GDPR)

Use DivParser if:

You want structured JSON out of the box without building a parser
You're working with HTML you already have (datasets, archives, manual downloads)
You need strict schema-enforced output via Nestlang
You want simple, predictable scheduling without the Actor/CU complexity
You're building a data pipeline and want extraction as a composable API step

The Honest Summary

Firecrawl and Apify are excellent at fetching. DivParser is focused on extraction. They're not always competing — in fact, if you're already using Firecrawl or a proxy-based fetcher and still building your own parser on top, DivParser's /v1/parse endpoint might be worth a look as the extraction step in your pipeline.

The scraping market in 2026 is moving toward output quality as the key differentiator. Raw HTML is cheap. Clean, typed, structured data is what pipelines actually need.

All three tools have free tiers. Test them against your actual URLs before committing.

I Built an AI Extraction API, Got Zero Paying Users, Then Rebuilt the Whole Engine

Solomon Williams — Mon, 04 May 2026 18:20:36 +0000

I'm Solomon, founder of DivParser — an AI-powered web extraction API. I launched it a few months ago, got users testing it, and ended up with zero paying customers.

This is the honest story of what went wrong, what I rebuilt, and what I discovered along the way.

What DivParser Does

You give DivParser a URL or raw HTML. You describe the data you want in plain English (or use Nestlang, our typed schema language). It returns clean, structured JSON — no selectors, no regex, no scraper maintenance.

curl -X POST "https://api.divparser.com/v1/scrapes" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "schema": "Extract product name, price and availability",
    "pageType": "LISTING"
  }'

You get back:

{
  "results": [
    { "name": "Widget Pro", "price": 49.99, "availability": true },
    { "name": "Widget Lite", "price": 19.99, "availability": true }
  ]
}

Simple. Clean. No parsing layer to maintain.

The Problem: Zero Paying Users

I launched. People signed up. Nobody paid.

After sitting with that for a while, I dug into why. The honest answer was that the product had a real flaw — incomplete data extraction.

The original engine worked like this:

Fetch the page with a headless Playwright browser
Run it through a proprietary trimmer that converts raw HTML into a compact intermediate format
Feed the trimmed content + a massive system prompt into an LLM
Get back JSON

The problem was step 3. The system prompt was carrying too much weight — it was teaching the model our intermediate format with examples, teaching it Nestlang with examples, handling fallback prompt recognition, detecting blocked sites, AND processing the actual page data. All in one inference call.

On large pages, the model would lose attention halfway through and return partial results. A product listing with 48 items might come back with 20. That's not a product people pay for.

The Fix: Chunking + Merge

The solution turned out to be simpler than I expected.

Instead of one massive AI call, I split the trimmed content into chunks and run extraction on each chunk in parallel. Then a final AI call merges the results and removes duplicates.

Trimmed content
  → Chunk 1 → AI extraction → partial JSON
  → Chunk 2 → AI extraction → partial JSON  
  → Chunk 3 → AI extraction → partial JSON
       ↓
  Merge AI → deduplicated, complete JSON

Chunk size is dynamic — short pages get one call, large pages get split accordingly. Items that fall on chunk boundaries come back with null fields from both adjacent chunks, and the merge AI reconciles them into one complete record.

This solved two problems at once:

Incomplete extraction — each chunk is small enough for the model to give full attention
Large page support — no page is too big anymore, it just gets more chunks

The Parse Layer: "Bot Protected? Download and Parse."

While rebuilding the engine, I added something I didn't originally plan — a parse-only endpoint.

curl -X POST "https://api.divparser.com/v1/parse" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<html>...your content...</html>",
    "schema": "Extract company name, phone, rating and business type"
  }'

You POST raw HTML. DivParser extracts structured data. No fetching, no bot detection concerns, no proxies needed.

I tested it on a Google Maps search results page I downloaded locally — searched for "companies in Gambia", saved the HTML, uploaded it to DivParser. Got back:

[
  { "name": "Neotec Company Limited", "rating": "4.8 (21)", "phone": "799 0990", "type": "Real estate developer" },
  { "name": "ZigTech", "rating": "5.0 (19)", "phone": "260 0001", "type": "Software company" },
  ...
]

20 structured business records. From Google Maps. Without touching Google's servers once.

I also tested it on a Jumia e-commerce page — 333 products extracted cleanly in one parse call.

The parse layer essentially turns bot protection into a non-problem for a whole class of use cases. If DivParser can't scrape it, you can download it and parse it.

What DivParser Looks Like Now

POST /v1/scrapes — fetch + extract from a live URL
POST /v1/parse — extract from raw HTML you already have
POST /v1/schedules — recurring scrapes on a cron or interval via BullMQ
Nestlang — optional typed schema for strict output enforcement
Pagination — auto-detects URL patterns and scrapes across pages
Dashboard — visual interface for non-API users

Free tier available. No credit card required.

What I Learned

The zero paying users problem wasn't a marketing problem. It was a product problem. The extraction was incomplete and developers noticed immediately.

Fixing the engine first, then talking about it, is the right order.

If you're building data pipelines, doing market research, or just tired of maintaining brittle scrapers — give DivParser a try: divparser.com

I read every reply. Happy to talk architecture, Nestlang, or anything else in the comments.