Solomon Williams

Posted on May 4

I Built an AI Extraction API, Got Zero Paying Users, Then Rebuilt the Whole Engine

#ai #api #startup #webscraping

I'm Solomon, founder of DivParser — an AI-powered web extraction API. I launched it a few months ago, got users testing it, and ended up with zero paying customers.

This is the honest story of what went wrong, what I rebuilt, and what I discovered along the way.

What DivParser Does

You give DivParser a URL or raw HTML. You describe the data you want in plain English (or use Nestlang, our typed schema language). It returns clean, structured JSON — no selectors, no regex, no scraper maintenance.

curl -X POST "https://api.divparser.com/v1/scrapes" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "schema": "Extract product name, price and availability",
    "pageType": "LISTING"
  }'

You get back:

{
  "results": [
    { "name": "Widget Pro", "price": 49.99, "availability": true },
    { "name": "Widget Lite", "price": 19.99, "availability": true }
  ]
}

Simple. Clean. No parsing layer to maintain.

The Problem: Zero Paying Users

I launched. People signed up. Nobody paid.

After sitting with that for a while, I dug into why. The honest answer was that the product had a real flaw — incomplete data extraction.

The original engine worked like this:

Fetch the page with a headless Playwright browser
Run it through a proprietary trimmer that converts raw HTML into a compact intermediate format
Feed the trimmed content + a massive system prompt into an LLM
Get back JSON

The problem was step 3. The system prompt was carrying too much weight — it was teaching the model our intermediate format with examples, teaching it Nestlang with examples, handling fallback prompt recognition, detecting blocked sites, AND processing the actual page data. All in one inference call.

On large pages, the model would lose attention halfway through and return partial results. A product listing with 48 items might come back with 20. That's not a product people pay for.

The Fix: Chunking + Merge

The solution turned out to be simpler than I expected.

Instead of one massive AI call, I split the trimmed content into chunks and run extraction on each chunk in parallel. Then a final AI call merges the results and removes duplicates.

Trimmed content
  → Chunk 1 → AI extraction → partial JSON
  → Chunk 2 → AI extraction → partial JSON  
  → Chunk 3 → AI extraction → partial JSON
       ↓
  Merge AI → deduplicated, complete JSON

Chunk size is dynamic — short pages get one call, large pages get split accordingly. Items that fall on chunk boundaries come back with null fields from both adjacent chunks, and the merge AI reconciles them into one complete record.

This solved two problems at once:

Incomplete extraction — each chunk is small enough for the model to give full attention
Large page support — no page is too big anymore, it just gets more chunks

The Parse Layer: "Bot Protected? Download and Parse."

While rebuilding the engine, I added something I didn't originally plan — a parse-only endpoint.

curl -X POST "https://api.divparser.com/v1/parse" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<html>...your content...</html>",
    "schema": "Extract company name, phone, rating and business type"
  }'

You POST raw HTML. DivParser extracts structured data. No fetching, no bot detection concerns, no proxies needed.

I tested it on a Google Maps search results page I downloaded locally — searched for "companies in Gambia", saved the HTML, uploaded it to DivParser. Got back:

[
  { "name": "Neotec Company Limited", "rating": "4.8 (21)", "phone": "799 0990", "type": "Real estate developer" },
  { "name": "ZigTech", "rating": "5.0 (19)", "phone": "260 0001", "type": "Software company" },
  ...
]

20 structured business records. From Google Maps. Without touching Google's servers once.

I also tested it on a Jumia e-commerce page — 333 products extracted cleanly in one parse call.

The parse layer essentially turns bot protection into a non-problem for a whole class of use cases. If DivParser can't scrape it, you can download it and parse it.

What DivParser Looks Like Now

POST /v1/scrapes — fetch + extract from a live URL
POST /v1/parse — extract from raw HTML you already have
POST /v1/schedules — recurring scrapes on a cron or interval via BullMQ
Nestlang — optional typed schema for strict output enforcement
Pagination — auto-detects URL patterns and scrapes across pages
Dashboard — visual interface for non-API users

Free tier available. No credit card required.

What I Learned

The zero paying users problem wasn't a marketing problem. It was a product problem. The extraction was incomplete and developers noticed immediately.

Fixing the engine first, then talking about it, is the right order.

If you're building data pipelines, doing market research, or just tired of maintaining brittle scrapers — give DivParser a try: divparser.com

I read every reply. Happy to talk architecture, Nestlang, or anything else in the comments.

DEV Community