DEV Community

Mohsin Rashid
Mohsin Rashid

Posted on

1

RAG Web Scraping

What I Built

I have built a Retrieval-Augmented Generation (RAG) system that leverages Ollama's nuextract model to scrape and extract specific content from HTML documents. The system first extracts content from the HTML, splits it into chunks, and stores the embeddings in a PostgreSQL database using PgVector. With the help of Ollama's nuextract, the model processes the HTML content and provides relevant results based on custom queries. The entire process integrates HTML content scraping with powerful vector search capabilities, enabling the extraction of precise and useful data from complex web pages.

Demo

Link to GitHub

Ollama: I used Ollama’s nuextract model to generate embeddings from the HTML content and perform scraping operations based on custom queries.
PgVector: This tool helped store and manage the embeddings in PostgreSQL. I used PgVector to handle vector-based search and retrieval from the HTML data stored in the database.
PostgreSQL: The vectorized data from HTML content was stored in a PostgreSQL database, making it easy to scale and query for relevant data.
Docker: I utilized Docker to run PgVector in a containerized environment, which simplified the setup and ensured a smooth deployment process.
LangChain: LangChain was used to build the retrieval chain, connecting the embeddings with Ollama's nuextract model for efficient query processing and data extraction.
Jupyter Notebook: The project is designed to be run within a Jupyter Notebook, providing a convenient and interactive way to load, process, and query the data.

Sample HTML Content

Image description

Code Demo

Image description

Image description

Image description

Final Thoughts

This project demonstrates the potential of combining modern LLMs with vector-based retrieval techniques to efficiently scrape and extract meaningful information from HTML documents. Integrating PgVector with Ollama's nuextract allows for high-quality, scalable web scraping operations, which can be applied to a variety of use cases, from automated data extraction to content aggregation.

The overall experience of building this project was rewarding, especially exploring the power of vector embeddings and retrieval augmented generation for real-world tasks like web scraping. The combination of PgVector, Ollama, LangChain, and the nuextract model makes for a powerful toolset that can be extended to different AI applications requiring efficient content extraction from complex documents.

This submission is eligible for the following prize categories:

  1. Open-source Models from Ollama: This project utilizes Ollama's nuextract model for extracting structured data from HTML content.
  2. Vectorizer: The use of PgVector for storing and retrieving document embeddings qualifies this project for the Vectorizer Vibe category.

Image of AssemblyAI tool

Challenge Submission: SpeechCraft - AI-Powered Speech Analysis for Better Communication

SpeechCraft is an advanced real-time speech analytics platform that transforms spoken words into actionable insights. Using cutting-edge AI technology from AssemblyAI, it provides instant transcription while analyzing multiple dimensions of speech performance.

Read full post

Top comments (0)

Billboard image

Try REST API Generation for MS SQL Server.

DevOps for Private APIs. With DreamFactory API Generation, you get:

  • Auto-generated live APIs mapped from database schema
  • Interactive Swagger API documentation
  • Scripting engine to customize your API
  • Built-in role-based access control

Learn more

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay