DEV Community

Cover image for AI-pipe: Pipeline for generating/storing embeddings from AI models to DB with data scraped from sites using custom scripts
Ogbotemi Ogungbamila
Ogbotemi Ogungbamila

Posted on

AI-pipe: Pipeline for generating/storing embeddings from AI models to DB with data scraped from sites using custom scripts

This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models

What I Built

A web page to quickly create a pipeline to feed AI models data scraped from a provided webpage.

Features

Custom scriptinig

Total control over the kind, type and form of data scraped from webpages is given in the form of custom scripts with templates provided.

Embeddings generation

The web service supports generating embeddings from OpenAI and Ollama AI models. It also provides a fallback for users without access to AI models running on a remote server through PostgresML

Barebones RAG(Retrieval Augmented Generation)

It also provides support for writing the generated embedding along with its input prompt to a PostgreSQL database for application in using vector embeddings for semantic search, product listiing etc...

Demo

How I Used Bright Data

Scraping browser

I used Puppeteer along with a web socket URL that points to a browser provided by BrightData to access websites, mutate the DOM and traverse the DOM while applying custom scripts to scrape data from it.

Here is the code that handles the above

const puppeteer = require('puppeteer-core'),
      path   = require('path'),
      fs     = require('fs'),
      both = require('../js/both'),
      file   = path.join(require('../utils/rootDir')(__dirname), './config.json'),
      config = fs.existsSync(file)&&require(file)||{...process.env};

module.exports = function(request, response) {
  let { data } = request.body, result;
  let { nodes, url } = data = JSON.parse(data),
  /**serialize the needed function in the imported object for usage in puppeteer */
      _both = { asText: both.asText.toString() };

  new Promise(async (res, rej) => {  
    puppeteer.connect({
      headless: false,
      browserWSEndpoint: config.BROWSER_WS,
    }).then(browser=>browser.newPage().then(async page=>{

      await page.setUserAgent('5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36');
      await page.goto(url, { waitUntil:'load', timeout:1e3*60 });
      // await page.waitForFunction(() => document.readyState === 'complete');
      // await page.screenshot({path: 'example.png'});


      const result = await page.evaluate((nodes, both) => {
        /** convert serialized function string back into a function to execute it */
        both.asText = new Function(`return ${both.asText}`)()
        /**remove needless nodes from the DOM */
        document.head.remove(), ['link', 'script', 'style', 'svg'].forEach(tag=>document.body.querySelectorAll(tag).forEach(el=>el.remove()))
        /**defined "node" - the variable present in the dynamic scripts locally to make it available in the 
          custom function context when created with new Function */
        let page = {}, node, fxns = Object.keys(nodes).map(key=>
          /**slip in the local variable - page and prepend a return keyword to make the function string work 
           * as expected when made into a function
          */
          nodes[key] = new Function(`return ${nodes[key].replace(',', ', page, ')}`)()
        );
        /** apply the functions for the nodes to retrieve data as the DOM is being traversed */
        both.asText(document.body, (_node, res)=>fxns.find(fxn=>res=fxn(node=_node, page)) && /*handle fetching media assets later here*/res || '');
        return page
      }, nodes, _both);
      res(result), await browser.close();
    }).catch(rej))
    .catch(rej)
  }).then(page=>result = page)
  .catch((err, str, name)=>{
    str = err.toString(), name = err.constructor.name, result = {
      error: /^\[/.test(str) ? `${name}: A sudden network disconnect occured` : str
    }
  }).finally( ()=> {
    response.json(result)
  })
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)