I Built a CLI That Extracts Content From URLs and Turns It Into JSON for AI

Omar Fuentes — Fri, 24 Apr 2026 17:04:38 +0000

If I wanted to analyze several articles on the same topic, the process was always the same:

open multiple tabs
read each article
copy headings
copy paragraphs
organize everything manually

It was slow and repetitive.

So I decided to automate it.

That’s how content-scraper-cli was created.

The Idea

Instead of manually collecting information from multiple articles, I wanted a tool that could:

Take a list of URLs
Extract the important parts of each page
Organize everything in a structured format

The result is a JSON file that can be used as research input for AI tools or content analysis.

Installing the CLI

You can install the tool globally with npm:

npm install -g content-scraper-cli

Then run it from your terminal:

content-scraper

The CLI will ask for:

the URLs you want to analyze
the name of the JSON output file

Example:

📎 Enter URLs separated by comma:
https://blog.com/article-1, https://site.com/post-2

💾 Output JSON file name:
research-data

After processing the pages, the tool generates a JSON file with structured information from each article.

What the Tool Extracts

For every URL, the CLI extracts:

Metadata

title
meta description
author (if available)
publication date
site language
meta keywords

Content Structure

H1 headings
H2 headings
H3 headings

Content

paragraphs with meaningful length
lists (ul / ol)

Statistics

total paragraphs
total word count

This makes it easy to analyze how different articles are structured.

Example Output

The output file contains structured data like this:

{
  "generado_en": "...",
  "total_fuentes": 3,
  "fuentes": [
    {
      "url": "...",
      "titulo": "...",
      "estructura": {
        "h1": [],
        "h2": [],
        "h3": []
      },
      "parrafos": [],
      "listas": []
    }
  ]
}

This dataset can then be used for:

content research
SEO analysis
AI article generation workflows
studying how top articles structure their content

Why I Built It

The goal was simple:

Make content research faster.

Instead of manually reading and copying data from multiple pages, the CLI collects the structure of several articles in seconds.

Notes

The tool works best with:

blogs
news websites
long-form articles

Some sites with strong anti-bot protection may block requests.

The CLI intentionally does not extract images and does not rely on paid APIs.

Everything runs locally.

Open Source

The project is open source and available on npm.

If you want to try it:

npm install -g content-scraper-cli

Created Web Developer and Programmer omar fuentes

DEV Community: Omar Fuentes