DEV Community

Cover image for Traditional crawler or AI-assisted crawler? How to choose?
CoderHXL
CoderHXL

Posted on

Traditional crawler or AI-assisted crawler? How to choose?

Foreword

In the field of data fetching, traditional crawlers and AI-assisted crawlers have their own advantages. The traditional crawler crawls data based on rules, which is suitable for websites with stable structure and clear rules. However, with the frequent changes and complications of the website structure, traditional crawlers have gradually exposed their limitations. In contrast, AI-assisted crawlers use artificial intelligence technology to intelligently analyze web pages and adapt to changes, showing higher flexibility and accuracy. So, in the face of different grasping needs, how should we choose? This article will deeply discuss the characteristics and advantages and disadvantages of traditional crawlers and AI-assisted crawlers, so as to provide you with decision-making reference.
What are the traditional crawlers and AI-assisted crawlers respectively

Traditional crawler

Traditional crawlers mainly rely on fixed rules or patterns to grab web data. They typically locate and extract the required information by identifying specific elements in a web page, such as class names, tags, or structures. However, the limitations of this approach are obvious. Once a site is updated to change the original class name, label, or structure, traditional crawlers will fail because they cannot recognize the new elements, resulting in data fetching failures or errors.

AI-assisted crawler

Ai-assisted crawlers intelligently analyze and understand the content of web pages to more accurately locate and extract the required information. Through technologies such as natural language processing, they are able to understand the semantic information of the web page, so that they can more precisely locate the required data, and even after the website is updated, AI-assisted crawlers can continue to effectively crawl data.

Example

The crawler uses x-crawl. The crawled websites are all real. To avoid disputes, https://www.example.com is used instead.

  • Traditional crawler to obtain movie information of movie rankings through specific elements in the web page
  • Crawler + AI, the crawler is paired with AI to obtain movie information from the movie rankings

Traditional crawler

Traditional crawler, obtains movie information of movie rankings through specific elements in the webpage

import { createCrawl } from 'x-crawl'

//Create a crawler application
const crawlApp = createCrawl()

// crawlPage is used to crawl pages
crawlApp.crawlPage('https://www.example.com').then(async (res) => {
   const { page, browser } = res.data

   // Wait for the element to appear on the page
   await page.waitForSelector('#wrapper #content .article')
   const filmHandleList = await page.$$('#wrapper #content .article table')

   const pendingTask = []
   for (const filmHandle of filmHandleList) {
     // Cover link (picture)
     const picturePending = filmHandle.$eval('td img', (img) => img.src)
     //Movie name(name)
     const namePending = filmHandle.$eval(
       'td:nth-child(2) a',
       (el) => el.innerText.split(' / ')[0]
     )
     // Introduction (info)
     const infoPending = filmHandle.$eval(
       'td:nth-child(2) .pl',
       (el) => el.textContent
     )
     // score
     const scorePending = filmHandle.$eval(
       'td:nth-child(2) .star .rating_nums',
       (el) => el.textContent
     )
     // Number of comments (commentsNumber)
     const commentsNumberPending = filmHandle.$eval(
       'td:nth-child(2) .star .pl',
       (el) => el.textContent?.replace(/\(|\)/g, '')
     )

     pendingTask.push([
       namePending,
       picturePending,
       infoPending,
       scorePending,
       commentsNumberPending
     ])
   }

   const filmInfoResult = []
   let i = 0
   for (const item of pendingTask) {
     Promise.all(item).then((res) => {
       // filmInfo is a movie information object, the order is determined before
       const filmInfo = [
         'name',
         'picture',
         'info',
         'score',
         'commentsNumber'
       ].reduce((pre, key, i) => {
         pre[key] = res[i]
         return pre
       }, {})

       //Save each movie information
       filmInfoResult.push(filmInfo)

       //Last processing
       if (pendingTask.length === ++i) {
         browser.close()

         // Arrange, decide whether it is more or less based on the quantity
         const filmResult = {
           element: filmInfoResult,
           type: filmInfoResult.length > 1 ? 'multiple' : 'single'
         }

         console.log(filmResult)
       }
     })
   }
})
Enter fullscreen mode Exit fullscreen mode

AI assisted crawler

Crawler + AI, let crawler and AI obtain movie information from movie rankings

import { createCrawl, createCrawlOpenAI } from 'x-crawl'

//Create a crawler application
const crawlApp = createCrawl()

//Create AI application
const crawlOpenAIApp = createCrawlOpenAI({
   clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
   defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})

// crawlPage is used to crawl pages
crawlApp.crawlPage('https://movie.douban.com/chart').then(async (res) => {
   const { page, browser } = res.data

   // Wait for the element to appear on the page and get the HTML
   await page.waitForSelector('#wrapper #content .article')
   const targetHTML = await page.$eval(
     '#wrapper #content .article',
     (e) => e.outerHTML
   )

   browser.close()

   // Let AI obtain movie information (the more detailed the description, the better)
   const filmResult = await crawlOpenAIApp.parseElements(
     targetHTML,
     `This is a list of movies. You need to get the movie name (name), cover link (picture), introduction (info), rating (score), and number of comments (commentsNumber). Use bracketed words as attribute names`
   )

   console.log(filmResult)
})
Enter fullscreen mode Exit fullscreen mode

Results of two examples

Movie information ultimately displayed by two examples

{
   "elements": [
     {
       "name": "Old Fox",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2900908599.webp",
       "info": "2023-10-27 (Tokyo International Film Festival) / 2023-11-24 (Taiwan, China) / Bai Runyin / Liu Guanting / Chen Muyi / Liu Yier / Kadowaki Mai / Huang Jianwei / Wen Shenghao / Ban Tiexiang / Yang Liyin / Fu Mengbai/Gao Yingxuan/Zhuang Yizeng/Zhang Zaixing/Xu Bowei/Guan Qing/Zhong Yao/You Jiaxuan/Zheng Yangen/Dai Yazhi/Jiang Ren/Xiao Hongwen...",
       "score": "8.1",
       "commentsNumber": "29211 people commented"
     },
     {
       "name": "Robot Dream",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2899644068.webp",
       "info": "2023-05-20(Cannes Film Festival) / 2023-12-06(Spain) / 2024(Mainland China) / Ivan Labanda/Albert Trevor Segarra/ Rafa Calvo/José Garcia Toss/José Luis Mediavilla/Garcia Molina/Esther Sollance/Spain/France/Pablo· Berger...",
       "score": "9.1",
       "commentsNumber": "64650 people commented"
     },
     {
       "name": "Under the sun",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2904961420.webp",
       "info": "2023-06-11 (Shanghai International Film Festival) / 2023-11-02 (Hong Kong, China) / 2024-04-12 (Mainland China) / David Jiang / Yu Xiangning / Lin Baoyi / Liang Zhongheng / Chen Zhanwen / Zhou Hanning / Liang Yongting/Gong Cien/Bao Peiru/Zhu Baiqian/Zhu Baikang/Xu Yuexiang/Hu Feng/Bao Qijing/Gao Hanwen/Peng Xingying/Luo Haoming/Tan Yuying...",
       "score": "8.0",
       "commentsNumber": "36540 people commented"
     },
     {
       "name": "Poor thing",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2897662939.webp",
       "info": "2023-09-01 (Venice Film Festival) / 2023-12-08 (USA) / Emma Stone / Mark Ruffalo / Willem Dafoe / Rami Yusuf / Christopher Abbott/Susie Bemba/Jerrod Carmichael/Katherine Hunter/Vicki Pepperdine/Margaret Qualley/Hannah Schigula/Jack Patton... .",
       "score": "7.0",
       "commentsNumber": "130113 people commented"
     },
     {
       "name": "Perfect Day",
       "picture": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2898894527.webp",
       "info": "2023-05-25 (Cannes Film Festival) / 2023-12-21 (Germany) / 2023-12-22 (Japan) / Yakusho Koji / Emoto Tokio / Nakano Arisa / Yamada Aoi / Aso Yumi/Sayuri Ishikawa/Tomokazu Miura/Mini Tanaka/Hiroto Oshita/Inuko Inuyama/Motomi Makiguchi/Tan Nagai/Ken Naoko/Moro Shioka/Ken Moriyu/Iru Katagiri/Xinguto Serizawa...",
       "score": "8.3",
       "commentsNumber": "33562 people commented"
     },
     {
       "name": "New Dragon Killing Formation",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2905374090.webp",
       "info": "2024-03-08 (South by Southwest Film Festival) / 2024-03-21 (USA Network) / Jake Gyllenhaal / Conor McGregor / Jessica Williams / Billy Magnussen/Daniela Melchior/Jimisola Ekumero/Lucas Gage/Travis Van Winkle/Darren Barnett/Joe Quem de Almeida...",
       "score": "6.3",
       "commentsNumber": "9980 people commented"
     },
     {
       "name": "Seoul Spring",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2905204009.webp",
       "info": "2023-11-22 (South Korea) / Hwang Jung-min / Jung Woo-sung / Lee Sung-min / Park Hae-jun / Kim Sung-joon / Park Hoon / An Se-ho / Jung Yun-ha / Jung Hae-in / Nam Yun-ho / Jeon Soo-ji / South Korea / Kim Sung-soo / 141 Minutes/Seoul Spring/Drama/Sung-su Kim/Korean",
       "score": "8.8",
       "commentsNumber": "171858 people commented"
     },
     {
       "name": "Goldfinger",
       "picture": "https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2901830629.webp",
       "info": "2023-12-30 (Mainland China) / Tony Leung / Andy Lau / Charlene Choi / Simon Yam / Alex Fong / Chan Ka Lok / Bai Zhi / Jiang Haowen / Pacific Insurance / Chin Ka Lok / Anita Yuen / Chow Jiayi / Sam Jiaqi / Li Jingjun / Ng Siu Hin / Ke Weilin / Feng Yongxian / Du Yaoyu/Li Jiancheng/Gu Yongfeng/Hong Kong, China/Mainland China/Chuang Wenqiang...",
       "score": "6.1",
       "commentsNumber": "135956 people commented"
     },
     {
       "name": "American novel",
       "picture": "https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2902166424.webp",
       "info": "2023-09-08(Toronto International Film Festival) / 2023-12-15(USA) / Jeffrey Wright/ Tracey Ellis Ross/ John Ortiz/ Issa· Ray/Sterling K. Brown/Erica Alexander/Leslie Goseth/Adam Brody/Keith David/Myra Lucretia Taylor/Raymond Anthony Thomas...",
       "score": "7.7",
       "commentsNumber": "26223 people commented"
     },
     {
       "name": "interest area",
       "picture": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2899514583.webp",
       "info": "2023-05-19 (Cannes Film Festival) / 2023-12-15 (USA) / Christian Fouridower / Sandra Wheeler / John Carter House / Ralph Herbert Erfurt/Freya Kreutzkamm/Max Baker/Imorgan Kugger/Stephanie Petrowicz/Ralph Zillman/Mary Rosa Tityan. ..",
       "score": "7.4",
       "commentsNumber": "24875 people commented"
     }
   ],
   "type": "multiple"
}
Enter fullscreen mode Exit fullscreen mode

Compare

Steps required for traditional crawlers to extract information

const pendingTask = []
for (const filmHandle of filmHandleList) {
   const picturePending = filmHandle.$eval('td img', (img) => img.src)
   const namePending = filmHandle.$eval(
     'td:nth-child(2) a',
     (el) => el.innerText.split(' / ')[0]
   )
   const infoPending = filmHandle.$eval(
     'td:nth-child(2) .pl',
     (el) => el.textContent
   )
   const scorePending = filmHandle.$eval(
     'td:nth-child(2) .star .rating_nums',
     (el) => el.textContent
   )
   const commentsNumberPending = filmHandle.$eval(
     'td:nth-child(2) .star .pl',
     (el) => el.textContent?.replace(/\(|\)/g, '')
   )

   pendingTask.push([
     namePending,
     picturePending,
     infoPending,
     scorePending,
     commentsNumberPending
   ])
}

const filmInfoResult = []
let i = 0
for (const item of pendingTask) {
   Promise.all(item).then((res) => {
     const filmInfo =
       ['name', 'picture', 'info', 'score', 'commentsNumber'].reduce <
       any>
       ((pre, key, i) => {
         pre[key] = res[i]
         return pre
       },
       {})

     filmInfoResult.push(filmInfo)

     if (pendingTask.length === ++i) {
       const filmResult = {
         element: filmInfoResult,
         type: filmInfoResult.length > 1 ? 'multiple' : 'single'
       }
     }
   })
}
Enter fullscreen mode Exit fullscreen mode

Relying on fixed class names and structures, the process is still relatively cumbersome.

Steps required for AI-assisted crawlers to extract information

const filmResult = await crawlOpenAIApp.parseElements(
   targetHTML,
   `This is a list of movies. You need to get the movie name (name), cover link (picture), introduction (info), rating (score), and number of comments (commentsNumber). Use bracketed words as attribute names`
)
Enter fullscreen mode Exit fullscreen mode

A matter of words.


  • Traditional crawlers need to rely on fixed class names and various cumbersome operations to obtain data. If the website is updated frequently, changes in class names or structures after the website is updated may cause the traditional crawler crawling strategy to fail. It is necessary to re-obtain the latest class name and update various operations to crawl data.
  • AI-assisted crawler only needs a paragraph to obtain the required data more efficiently, intelligently and conveniently. You can even pass the entire HTML to AI to help us operate it. Since the website content is more complex, the location to be retrieved needs to be described more accurately, and a large amount of Tokens will be consumed. However, even if subsequent updates to the website cause the class name or structure to change, it can still be crawled normally. to data, because we can no longer rely on fixed class names or structures to locate and extract the required information, but let AI understand and parse the semantic information of web pages, thereby extracting the required data more efficiently, intelligently and conveniently.

If more content is required, then the traditional crawler will take more steps, while the AI-assisted crawler can do it by just adding a few sentences, and there is no need to worry about whether the class name and structure of the website will change after the website is updated.

Summarize

Traditional crawlers mainly rely on preset rules or patterns to crawl web page data, and they perform well on websites with stable structures and clear rules. However, with the rapid development of network technology and frequent updates of website structure, traditional crawlers are facing more and more challenges. Once the website structure changes, traditional crawlers usually need to readjust the rules, which may even lead to crawling failure, which greatly reduces its efficiency and accuracy.

In contrast, AI-assisted crawlers combine artificial intelligence technology to intelligently analyze the structure and semantics of web pages and adapt to changes in the website. Through technologies such as machine learning and natural language processing, AI-assisted crawlers can identify and learn features in web pages to more accurately locate and extract the required data. This allows AI-assisted crawlers to maintain efficient crawling capabilities when facing complex and changing website structures.

In general, traditional crawlers and AI-assisted crawlers each have their own applicable scenarios. For websites with stable structure and clear rules, traditional crawlers may be a more economical and direct choice. However, for websites with complex structures and frequent updates, AI-assisted crawlers have shown greater flexibility and accuracy advantages. When choosing, we need to comprehensively consider factors such as specific crawling needs, website characteristics, and resource investment.

resource

The crawlers used in the examples in this article are all from x-crawl. Whether it is a traditional crawler or an AI-assisted crawler, it can satisfy you, and it also has many useful functions.

x-crawl

x-crawl is a flexible Node.js AI-assisted crawler library. Flexible usage and powerful AI assistance functions make crawler work more efficient, intelligent and convenient.

It consists of two parts:

  • Crawler: It consists of crawler API and various functions, which can work normally even without relying on AI.
  • AI: Currently based on the large AI model provided by OpenAI, AI simplifies many tedious operations.

Features

  • 🤖 AI Assistance - Powerful AI assistance function makes crawler work more efficient, intelligent and convenient.
  • 🖋️ Flexible writing - A single crawling API is suitable for multiple configurations, and each configuration method has its own advantages.
  • ⚙️Multiple uses - Supports crawling dynamic pages, static pages, interface data and file data.
  • ⚒️ Control page - Crawling dynamic pages supports automated operations, keyboard input, event operations, etc.
  • 👀 Device Fingerprint - Zero configuration or custom configuration to avoid fingerprint recognition to identify and track us from different locations.
  • 🔥 Asynchronous Synchronization - Asynchronous or synchronous crawling mode can be performed without switching the crawling API.
  • ⏱️ Interval crawling - no interval, fixed interval and random interval, determine whether to crawl with high concurrency.
  • 🔄 Failed retry - Customize the number of retries to avoid crawling failures due to temporary problems.
  • ➡️ Rotation proxy - Automatic proxy rotation with retries on failure, custom error times and HTTP status codes.
  • 🚀 Priority Queue - According to the priority of a single crawling target, it can be crawled ahead of other targets.
  • 🧾Crawling information - Controllable crawling information will output colored string information in the terminal.
  • 🦾 TypeScript - Own types and implement complete types through generics.

If you think x-crawl is helpful to you, or if you like x-crawl, you can give x-crawl repository a point on GitHub A star. Your support is the driving force for our continuous improvement! thank you for your support!

x-crawl GitHub: https://github.com/coder-hxl/x-crawl

x-crawl documentation: https://coder-hxl.github.io/x-crawl/

Top comments (0)