Scraping Content for LLM

#backend #node #webdev #javascript

Hello friends,

Welcome back to the new blog, to newcomers, I am Shrey. I write about software development and programming, and I am currently running iHateReading

The story begins a few months ago, I was working on our new feature on the platform called Universo, which provides a collection of unique domains/websites/tools/products for developers across the globe. Meanwhile, I was also working on the Explore section of our website; both of these features need to scrape content from the internet and the web.

Mainly first scraping is using RSS feeds
Storing data in a database
Refetch and repeat the cycle

First, I've collected all the resources to make Universo, the collection of domains and followed by creating an RSS feed for each domain to fetch the RSS XML content.

If you're new to RSS, RSS feeds are standardised XML files that deliver real-time updates from websites, blogs, podcasts, and news sites directly to users through RSS readers, allowing efficient content tracking without manually visiting each site. RSS wikipedia

Once we get the RSS link, we can easily make an API XMLHTTP request to the URL to fetch the content.

const response = axios.get(url.trim().toString());

// filter RSS latest links from the below return object
return response.data;

Then I move forward to scrape the content for 100+ websites using RSS links and store the data in the database.

While fetching data using RSS feeds, I was quite shattered to know that even the top website doesn't have RSS feeds. If you are running a content platform, make sure to add an RSS feed, as it helps to improve search engine rankings for Google.

For the website that does not have proper RSS feeds, we might need to scrape the content or use their respective APIs.

That's how scraping comes to my mind as a quite interesting problem statement to work on. Scraping the URL or web is still quite a task; most websites use robots.txt, block unwanted IPs and do more to avoid bot detection for scraping content.

A simple scraping looks like this below

const response = await axios.get(url.trim().toString());
const data = response.data;

const html = load(html);
// add html parser to parse the content into markdown
return html

Once we are ready to fetch the data, I'll move ahead with another problem statement.

Problem statement: *URL content into LLM-ready markdown format *
This is the problem I've faced while copy-pasting content into chatgpt from the website, and sometimes the links are not correctly formatted. The good option/solution is to make our own scraping URL into LLM-ready markdown, and here we go.

A few packages to use

Puppeteer and Playwright for heavy websites
Cheerio for small static websites
Markdown conversion packages
Proxy rotation using proxy-chain
JSDOM and DOM-related packages
XMLHTTP request packages like AXIOS, Undici

Then, using the above packages, I create a simple endpoint to scrape any URL and convert it into markdown.

But getting HTML from the URL along with data is not so hard a task; the hardest part is parsing HTML into markdown-ready content, taking care of some unwanted text like headers, navbars, aside, sidebar, and footers.

For removing those elements, we have worked with classnames to remove some selectors or elements from the fetched HTML.

const selectorsToRemove = [
                                "header",
                                "footer",
                                "nav",
                                "aside",
                                ".header",
                                ".top",
                                ".navbar",
                                "#header",
                                ".footer",
                                ".bottom",
                                "#footer",
                                ".sidebar",
                                ".side",
                                ".aside",
                                "#sidebar",
                                ".modal",
                                ".popup",
                                "#modal",
                                ".overlay",
                                ".ad",
                                ".ads",
                                ".advert",
                                "#ad",
                                ".lang-selector",
                                ".language",
                                "#language-selector",
                                ".social",
                                ".social-media",
                                ".social-links",
                                "#social",
                                ".menu",
                                ".navigation",
                                "#nav",
                                ".breadcrumbs",
                                "#breadcrumbs",
                                ".share",
                                "#share",
                                ".widget",
                                "#widget",
                                ".cookie",
                                "#cookie",
                                "script",
                                "style",
                                "noscript",
];


selectorsToRemove.forEach((sel) => {
 document.querySelectorAll(sel).forEach((el) => el.remove());
});

Check out the product iHateReading Scrapify

A few things to pay attention along with a few learnings

For blocking IPs, I prefer using a rotating proxy process
Respect robots.txt to fetch content from the website
Headless Chrome packages like Puppeteer work a bit slowly and don't work, but are mainly used for heavy JS-loaded websites like React, Next, Svelte
Cheerio is still the best parser to grab the HTML elements
Markdown parsing from HTML is still tough, but we have a few packages that did the work well: turndown, html-to-md, and dom-to-semantic-markdown
Block images, heavy JS, ads and other chrome settings to quickly scrape content from the internet

More to come, I've added web search as well.
See if we can fetch content from a single URL, we can fetch content from web search engines such as Google, Bing, Yahoo, Safari and Yandex and more.
Using the same technique, I've released another feature to scrape the web content, as shown in the image below

I am thinking of releasing the API for others to use and build products, probably. If you want to have the API, do let me know in the comments section.

Have a good day, and do check the product.

iHateReading Scrapify

DEV Community

Scraping Content for LLM

Top comments (0)