DEV Community

Cover image for Recommend a flexible Node.js multi-functional crawler library —— x-crawl
CoderHXL
CoderHXL

Posted on

4 1 1 1 1

Recommend a flexible Node.js multi-functional crawler library —— x-crawl

x-crawl

x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.

If you also like x-crawl, you can give the x-crawl repository a star on GitHub to support it. Thank you for your support!

Features

  • 🔥 Asynchronous Synchronous - Just change the mode property to toggle asynchronous or synchronous crawling mode.
  • ⚙️Multiple uses - Supports crawling dynamic pages, static pages, interface data, files and polling operations.
  • ⚒️ Control page - Crawling dynamic pages supports automated operations, keyboard input, event operations, etc.
  • 🖋️ Flexible writing style - The same crawling API can be adapted to multiple configurations, and each configuration method is very unique.
  • ⏱️ Interval Crawling - No interval, fixed interval and random interval to generate or avoid high concurrent crawling.
  • 🔄 Failed Retry - Avoid crawling failure due to short-term problems, and customize the number of retries.
  • ➡️ Proxy Rotation - Auto-rotate proxies with failure retry, custom error times and HTTP status codes.
  • 👀 Device Fingerprinting - Zero configuration or custom configuration, avoid fingerprinting to identify and track us from different locations.
  • 🚀 Priority Queue - According to the priority of a single crawling target, it can be crawled ahead of other targets.
  • 🧾 crawl log - Logs the crawl and uses colored string reminders at the terminal.
  • 🦾 TypeScript - Own types, implement complete types through generics.

Example

Take the automatic acquisition of some photos of experiences and homes around the world every day as an example:

// 1. Import module ES/CJS
import xCrawl from 'x-crawl'

// 2. Create a crawler instance
const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 2000, min: 1000 } })

// 3. Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call the crawlPage API to crawl the page
  const pageResults = await myXCrawl.crawlPage({
    targets: [
      'https://www.airbnb.cn/s/*/experiences',
      'https://www.airbnb.cn/s/plus_homes'
    ],
    viewport: { width: 1920, height: 1080 }
  })

  // Obtain the image URL by traversing the crawled page results
  const imgUrls = []
  for (const item of pageResults) {
    const { id } = item
    const { page } = item.data
    const elSelector = id === 1 ? '.i9cqrtb' : '.c4mnd7m'

    // wait for the page element to appear
    await page.waitForSelector(elSelector)

    // Get the URL of the page image
    const urls = await page.$$eval(`${elSelector} picture img`, (imgEls) =>
      imgEls.map((item) => item.src)
    )
    imgUrls.push(...urls.slice(0, 6))

    // close the page
    page.close()
  }

  // Call crawlFile API to crawl pictures
  await myXCrawl.crawlFile({ targets: imgUrls, storeDirs: './upload' })
})
Enter fullscreen mode Exit fullscreen mode

running result:

Note: Please do not crawl randomly, you can check the robots.txt protocol before crawling. The class name of the website may change, this is just to demonstrate how to use x-crawl.

More

More content can be viewed: https://github.com/coder-hxl/x-crawl

AWS GenAI LIVE image

Real challenges. Real solutions. Real talk.

From technical discussions to philosophical debates, AWS and AWS Partners examine the impact and evolution of gen AI.

Learn more

Top comments (2)

Collapse
 
coderhxl profile image
CoderHXL

Come and try it

Collapse
 
efleurine profile image
Emmanuel

Thank you for sharing. I will give it a spin in the coming months.

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay