Crawling and Searching

#scraping #searching #websearch

Anycrawl.dev is the website to turn any website into LLM-ready data

If we check the github repository and package.json for the scrap endpoint, we can find the crawlee package as a dependency. Crawlee is a recent open-source repository that helps to crawl websites to parse the final data. More importantly, Crawlee stores the data in the local directory using the file system module.

Moving ahead with the website concept, I am considering creating our own custom in-house web scraping and search APIs. If possible, I would also make them open source and implement pricing for others to use, launching them simultaneously on RapidAPI.

This can be the new SAAS idea for iHateReading, which certainly works on important things, including content, tools and APIs.

Let's see, firecrawl.dev is another good example, endorsed by Vercel at the same time. But all APIs cost in USD and mainly more than $10/month, which is like difficult for india and other countries users, we need pricing in INR and well low USD to allow middle class indian developers to access the API and build more products around the same and that is why I want to make this at low cost.

Few things to take care of if rate limiting, pricing, authentication, database and networking and servers, although I knew all of them, but few scaling for millions of users as the developer experience is lacking in my confidence to execute, but one thing we can do is take one step at a time as defined below the core steps

Simple scrap and internet search APIs
Deploy APIs for the initial few users or for personal use
Build products around the API
Optimise and Scale the API
Add paywall, rate limiting and authentication for others to use
Launch API on other platforms and iHateReading

A few other reasons why I want the custom API

Travel AI itinerary planner
AI jobs scraper/searcher
AI agents like n8 automation kind of tool
Aggregator platform

All kinds of such products can be easily built at and low cost. Then we can integrate the APIs in iHatReading Universoa and Jobs-Portals to make sure we scrape real-time data for devs to find jobs and get the latest packages.

Few APIs are needed apart from scraping, is dedicated platform search/search API, for example

Scrap Reddits
Search Bing, Yahoo, Safari
Seach/Search medium, devto, twitter, substack, youtube
Scrap Google Maps location details, jobs, etc

Well, we are thinking a lot for now, and it's good to stop here before we execute the first few steps for our use case.

In addition, I am reading about how to build a custom web search, because of our recent blog on AI Web Search Agent, but one problem is that AI Web Search Agent use third-party API and which means it's not FREE to use after the limit, which also means we can't made the API for others to use.

But a few things to understand while making custom web search APIs are how perplexity works under the hood. First, they rely heavily on Bing search and later on move to custom infrastructure for web search, along with indexing and scraping.

Then I move to make a simple Bing search, Google image search API and so on. Every scraping or searching needs chromium, playwright and cheerio packages; Puppeteer are slow but still being used a lot.

SerpAPI, SearchAPI, Google search API, Bing search API are the already existing options, but again making one is the task, and sincerly putting it at low cost is the agenda.

Well, nothing much to say

See you at the next one

iHateReading

Originally published on iHateReading

DEV Community

Crawling and Searching

Top comments (0)