DEV Community

Cover image for Scraping Content for LLM
shrey vijayvargiya
shrey vijayvargiya

Posted on

Scraping Content for LLM

Hello friends,

Welcome back to the new blog, to newcomers, I am Shrey. I write about software development and programming, and I am currently running iHateReading

The story begins a few months ago, I was working on our new feature on the platform called Universo, which provides a collection of unique domains/websites/tools/products for developers across the globe. Meanwhile, I was also working on the Explore section of our website; both of these features need to scrape content from the internet and the web.

  • Mainly first scraping is using RSS feeds
  • Storing data in a database
  • Refetch and repeat the cycle

First, I've collected all the resources to make Universo, the collection of domains and followed by creating an RSS feed for each domain to fetch the RSS XML content.

If you're new to RSS, RSS feeds are standardised XML files that deliver real-time updates from websites, blogs, podcasts, and news sites directly to users through RSS readers, allowing efficient content tracking without manually visiting each site. RSS wikipedia

Once we get the RSS link, we can easily make an API XMLHTTP request to the URL to fetch the content.

const response = axios.get(url.trim().toString());

// filter RSS latest links from the below return object
return response.data;
Enter fullscreen mode Exit fullscreen mode

Then I move forward to scrape the content for 100+ websites using RSS links and store the data in the database.

While fetching data using RSS feeds, I was quite shattered to know that even the top website doesn't have RSS feeds. If you are running a content platform, make sure to add an RSS feed, as it helps to improve search engine rankings for Google.

For the website that does not have proper RSS feeds, we might need to scrape the content or use their respective APIs.

That's how scraping comes to my mind as a quite interesting problem statement to work on. Scraping the URL or web is still quite a task; most websites use robots.txt, block unwanted IPs and do more to avoid bot detection for scraping content.

A simple scraping looks like this below

const response = await axios.get(url.trim().toString());
const data = response.data;

const html = load(html);
// add html parser to parse the content into markdown
return html

Enter fullscreen mode Exit fullscreen mode

Once we are ready to fetch the data, I'll move ahead with another problem statement.

Problem statement: *URL content into LLM-ready markdown format *
This is the problem I've faced while copy-pasting content into chatgpt from the website, and sometimes the links are not correctly formatted. The good option/solution is to make our own scraping URL into LLM-ready markdown, and here we go.

A few packages to use

  • Puppeteer and Playwright for heavy websites
  • Cheerio for small static websites
  • Markdown conversion packages
  • Proxy rotation using proxy-chain
  • JSDOM and DOM-related packages
  • XMLHTTP request packages like AXIOS, Undici

Then, using the above packages, I create a simple endpoint to scrape any URL and convert it into markdown.

But getting HTML from the URL along with data is not so hard a task; the hardest part is parsing HTML into markdown-ready content, taking care of some unwanted text like headers, navbars, aside, sidebar, and footers.

For removing those elements, we have worked with classnames to remove some selectors or elements from the fetched HTML.

const selectorsToRemove = [
                                "header",
                                "footer",
                                "nav",
                                "aside",
                                ".header",
                                ".top",
                                ".navbar",
                                "#header",
                                ".footer",
                                ".bottom",
                                "#footer",
                                ".sidebar",
                                ".side",
                                ".aside",
                                "#sidebar",
                                ".modal",
                                ".popup",
                                "#modal",
                                ".overlay",
                                ".ad",
                                ".ads",
                                ".advert",
                                "#ad",
                                ".lang-selector",
                                ".language",
                                "#language-selector",
                                ".social",
                                ".social-media",
                                ".social-links",
                                "#social",
                                ".menu",
                                ".navigation",
                                "#nav",
                                ".breadcrumbs",
                                "#breadcrumbs",
                                ".share",
                                "#share",
                                ".widget",
                                "#widget",
                                ".cookie",
                                "#cookie",
                                "script",
                                "style",
                                "noscript",
];


selectorsToRemove.forEach((sel) => {
 document.querySelectorAll(sel).forEach((el) => el.remove());
});
Enter fullscreen mode Exit fullscreen mode

Check out the product iHateReading Scrapify

A few things to pay attention along with a few learnings

  • For blocking IPs, I prefer using a rotating proxy process
  • Respect robots.txt to fetch content from the website
  • Headless Chrome packages like Puppeteer work a bit slowly and don't work, but are mainly used for heavy JS-loaded websites like React, Next, Svelte
  • Cheerio is still the best parser to grab the HTML elements
  • Markdown parsing from HTML is still tough, but we have a few packages that did the work well: turndown, html-to-md, and dom-to-semantic-markdown
  • Block images, heavy JS, ads and other chrome settings to quickly scrape content from the internet

More to come, I've added web search as well.
See if we can fetch content from a single URL, we can fetch content from web search engines such as Google, Bing, Yahoo, Safari and Yandex and more.
Using the same technique, I've released another feature to scrape the web content, as shown in the image below

Srcrapify website screenshot

I am thinking of releasing the API for others to use and build products, probably. If you want to have the API, do let me know in the comments section.

Have a good day, and do check the product.

iHateReading Scrapify

Top comments (0)