DEV Community

Casualwriter
Casualwriter

Posted on

a portable lightweight web crawler using Powerpage.

Just code a portable lightweight web crawler using Powerpage. Powerpage Web Crawler is a portable javascript-application running with Powerpage. It is coded by vanilla javascript in about 350 lines codes, without any dependency.

Image description

Powerpage Web Crawler is a portable program, just simply download and run powerpage.exe. It is a powerful and easy-to-use web-scrawler suitable for blog site crawling and offline-reading.

Just simply define below, for example

  • base-url := https://dev.to/casualwriter // the home page of favor blog site
  • index-pattern := none // RegExp of the url pattern of category page
  • page-pattern := /casualwriter/[a-z] // RegExp of the url pattern of content page
  • content-css := #main-title h1, #article-body //css selector for blog content.

Program will

  • crawl all category pages.
  • find out all url of content pages.
  • crawl content for one page, or all pages.
  • save setting and links to database (support multiple sites)
  • save content pages to local files.
  • allow off-line reading from local files.

About Powerpage

Powerpage Web Crawler run with PowerPage, which is a lightweight web browser with DB capability and windows accessibility, for quick development of javascript/html/css application.

for the source code of Powerpage, please visit https://github.com/casualwriter/powerpage/tree/main/source/src

By the way, sorry for beginner coding style and rough screen layout (for independence).

Enjoy,

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay