DEV Community

Omar Fuentes
Omar Fuentes

Posted on

I Built a CLI That Extracts Content From URLs and Turns It Into JSON for AI

If I wanted to analyze several articles on the same topic, the process was always the same:

  • open multiple tabs
  • read each article
  • copy headings
  • copy paragraphs
  • organize everything manually

It was slow and repetitive.

So I decided to automate it.

That’s how content-scraper-cli was created.

The Idea

Instead of manually collecting information from multiple articles, I wanted a tool that could:

  1. Take a list of URLs
  2. Extract the important parts of each page
  3. Organize everything in a structured format

The result is a JSON file that can be used as research input for AI tools or content analysis.

Installing the CLI

You can install the tool globally with npm:

npm install -g content-scraper-cli
Enter fullscreen mode Exit fullscreen mode

Then run it from your terminal:

content-scraper
Enter fullscreen mode Exit fullscreen mode

The CLI will ask for:

  • the URLs you want to analyze
  • the name of the JSON output file

Example:

πŸ“Ž Enter URLs separated by comma:
https://blog.com/article-1, https://site.com/post-2

πŸ’Ύ Output JSON file name:
research-data
Enter fullscreen mode Exit fullscreen mode

After processing the pages, the tool generates a JSON file with structured information from each article.

What the Tool Extracts

For every URL, the CLI extracts:

Metadata

  • title
  • meta description
  • author (if available)
  • publication date
  • site language
  • meta keywords

Content Structure

  • H1 headings
  • H2 headings
  • H3 headings

Content

  • paragraphs with meaningful length
  • lists (ul / ol)

Statistics

  • total paragraphs
  • total word count

This makes it easy to analyze how different articles are structured.

Example Output

The output file contains structured data like this:

{
  "generado_en": "...",
  "total_fuentes": 3,
  "fuentes": [
    {
      "url": "...",
      "titulo": "...",
      "estructura": {
        "h1": [],
        "h2": [],
        "h3": []
      },
      "parrafos": [],
      "listas": []
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This dataset can then be used for:

  • content research
  • SEO analysis
  • AI article generation workflows
  • studying how top articles structure their content

Why I Built It

The goal was simple:

Make content research faster.

Instead of manually reading and copying data from multiple pages, the CLI collects the structure of several articles in seconds.

Notes

The tool works best with:

  • blogs
  • news websites
  • long-form articles

Some sites with strong anti-bot protection may block requests.

The CLI intentionally does not extract images and does not rely on paid APIs.

Everything runs locally.

Open Source

The project is open source and available on npm.

If you want to try it:

npm install -g content-scraper-cli
Enter fullscreen mode Exit fullscreen mode

Created Web Developer and Programmer omar fuentes

Top comments (0)