DEV Community: Ruchika Atwal

Web Crawling and Scraping: Traditional Approaches vs. LLM Agents

Ruchika Atwal — Wed, 18 Dec 2024 17:19:01 +0000

Web crawling and scraping are essential for gathering structured data from the internet. Traditional techniques have dominated the field for years, but the rise of Large Language Models (LLMs) like OpenAI’s GPT has introduced a new paradigm. Let’s explore the differences, advantages, and drawbacks of these approaches.

Traditional Web Crawling & Scraping

How It Works:

Traditional approaches rely on:

Code-driven frameworks like Scrapy, Beautiful Soup, and Selenium.
Parsing HTML structures using CSS selectors, XPath, or regular expressions.
Rule-based logic for task automation.

Advantages:

Efficient for predictable websites: Handles structured websites with consistent layouts.
Customizability: Code can be tailored to specific needs. Cost-effective: Does not require extensive computational resources.

Drawbacks:

Brittle to changes: Fails when website layouts change. High development time: Requires expertise to handle edge cases (e.g., CAPTCHAs, dynamic content).
Scalability issues: Struggles with large-scale, unstructured, or diverse data sources.

LLM Agents for Web Crawling & Scraping

How It Works:

LLM agents use natural language instructions and reasoning to interact with websites dynamically. They can infer patterns, adapt to changes, and execute tasks without hard-coded rules. Examples include tools like LangChain or Auto-GPT for multi-step workflows.

Advantages:

Dynamic adaptability: LLMs adapt to layout changes without reprogramming.
Reduced technical barrier: Non-experts can instruct agents with plain language.
Multi-tasking: Simultaneously extract data, classify, summarize, and clean it.
Intelligent decision-making: LLMs infer contextual relationships, such as prioritizing important links or understanding ambiguous data.

Drawbacks:

High computational cost: LLMs are resource-intensive.
Limited precision: They may misinterpret website structures or generate hallucinated results.
Dependence on training data: Performance varies depending on LLM training coverage.
API costs: Running LLM-based scraping incurs additional API usage fees.

When to Use Traditional Approaches vs. LLM Agents

Scenario	Traditional	LLM Agents
Static, well-structured sites	✔	✘
Dynamic or unstructured sites	✘	✔
Scalability required	✔	✔
Complex workflows (e.g., NLP)	✘	✔
Cost-sensitive projects	✔	✘

Key Takeaway

Use traditional methods for tasks requiring precision, cost-efficiency, and structure.
Opt for LLM agents when dealing with dynamic, unstructured, or context-sensitive data. The future lies in hybrid models, combining the predictability of traditional approaches with the adaptability of LLMs to create robust and scalable solutions.

Web Crawling, Web Scraping And Its challenges

Ruchika Atwal — Thu, 11 May 2023 15:46:40 +0000

Introduction to Web crawling & Web scraping :

Web crawling and web scraping are two related techniques used for extracting data from websites, but they have distinct differences in their methodology and purpose.

Web crawling is the automated process of navigating through web pages using a software program called a crawler or spider. The crawler visits web pages and indexes their content, links, and metadata, which is then stored in a database for further analysis or retrieval. Web crawling is often used for web search engines, where the crawler collects data from a large number of web pages to build an index for search queries.

On the other hand, web scraping is the process of extracting specific data from web pages using automated software tools. Web scraping tools can extract data from various sources, including text, images, and videos, and transform it into a structured format such as a CSV, JSON, or XML file. Web scraping is often used for data mining, market research, and content aggregation, where the goal is to gather and analyze data from multiple websites.

Web scraping and web crawling have some similarities, as both techniques involve automated software tools that interact with web pages. However, web scraping is more focused on data extraction, while web crawling is more focused on indexing and navigation. Additionally, web scraping can be more targeted and specific, while web crawling is more broad and general.

There are many examples of how web crawling and web scraping can be used in various industries and applications. Here are a few examples :

E-commerce :
Web scraping can be used to extract product information from e-commerce websites such as Amazon or eBay. This data can be used for price monitoring, market analysis, or inventory management.
Social media :
Web scraping can be used to collect user-generated content from social media platforms such as Twitter or Instagram. This data can be used for sentiment analysis, marketing research, or customer engagement.
Financial services :
Web scraping can be used to extract financial data from stock market websites or financial news portals. This data can be used for investment analysis, risk management, or financial modelling.
News media :
Web scraping can be used to collect news articles from various news websites such as BBC or CNN. This data can be used for media monitoring, trend analysis, or content curation.

Note :
However, it is important to note that web scraping should be conducted ethically and legally, respecting the terms of service of the target websites and the privacy rights of the users.

Challenges in Web crawling :

Web crawling can present several challenges that can affect the efficiency, accuracy, and legality of the crawling process. Here are some of the common challenges faced in web crawling:

Website blocking :
Some websites may use technologies such as CAPTCHAs, IP blocking, or user-agent detection to prevent automated access. This can make it difficult or impossible for the crawler to access the website.
Data parsing :
Web pages can contain complex and unstructured data, which can make it difficult to extract relevant information. Moreover, some websites may use dynamic or AJAX-based content, which can require advanced techniques such as JavaScript rendering or browser emulation to extract data.
Data quality :
Web pages can contain duplicate, incomplete, or inaccurate data, which can affect the validity and reliability of the extracted data. Moreover, some websites may use anti-scraping measures such as honeypots or fake data to mislead the crawlers.
Legal and ethical issues :
Web crawling can raise legal and ethical concerns such as copyright infringement, privacy violation, or web spamming. Crawlers should respect the terms of service of the target websites, obtain permission from the website owners, and apply ethical scraping practices such as rate limiting, respectful behaviour, and user-agent identification.
Scalability and performance :
Web crawling can require significant computational resources, bandwidth, and storage capacity, especially when dealing with large or distributed websites. Moreover, web crawling can be time-sensitive, requiring real-time updates or continuous monitoring of the target websites.

Solution for Web crawler blocking :

There are several solutions for website blocking that can help web crawlers to overcome access restrictions and avoid being detected as automated bots. Here are some of the common solutions:

Proxy rotation :
Web crawlers can use a pool of rotating proxies to change their IP address and avoid being detected as coming from a single source. This can also help to distribute the crawling load across multiple IP addresses and reduce the risk of being blacklisted. Various proxy service providers are their find according to your use-case and cost.
User-agent customisation :
Web crawlers can customize their user-agent string to mimic the behaviour of a real user agent, such as a web browser. This can help to avoid being detected as a bot and enable access to websites that block bots.
Delay and throttling :
Web crawlers can introduce a delay or a throttling mechanism between requests to simulate the behaviour of a human user and avoid triggering anti-scraping measures such as rate limiting or traffic spikes.
CAPTCHA solving :
Web crawlers can use CAPTCHA solving services to automatically solve CAPTCHAs and gain access to websites that use them. However, this solution may require additional computational resources and incur additional costs.
Browser emulation :
Web crawlers can use headless browsers or browser emulators to simulate the behaviour of a real web browser and enable access to websites that use JavaScript or AJAX-based content. This can help to extract data that is not accessible through traditional web crawling techniques.

It is important to note that some of these solutions may have legal and ethical implications and should be used with caution, respecting the terms of service of the target websites and the privacy rights of the users. Moreover, web crawlers should always monitor their performance and adjust their strategies according to the changing environment and the feedback from the target websites.

Crawling large website some useful points :

If you are trying to crawl a large website, there are several techniques you can use to make the process more efficient. Here are some potential solutions to address the challenges of crawling large websites:

Use parallel processing :
One way to speed up the crawling process is to use parallel processing. This involves splitting the crawling process across multiple threads or processes, each of which can crawl a separate section of the website simultaneously. This can significantly speed up the process and reduce the overall time required to crawl the entire site.
Avoid duplicate requests :
When crawling a large website, it is easy to accidentally send duplicate requests to the same page. This can waste time and resources, and may also cause issues with the website's server. To avoid this, you can use caching techniques to store the results of previous requests and avoid sending duplicate requests.
Prioritize high-value pages :
Some pages on a website may be more important than others, either because they contain more valuable information or because they are more frequently visited by users. By prioritizing these high-value pages, you can ensure that they are crawled first, and that you do not waste time crawling less important pages.
Use a sitemap :
A sitemap is a file that contains a list of all the pages on a website. By using a sitemap, you can ensure that you crawl all the pages on the website in a systematic and efficient manner. This can also help you identify high-value pages and prioritize them for crawling.
Optimize crawl settings :
Finally, it is important to optimize the crawl settings for the specific website you are crawling. This may include adjusting the crawl rate, setting crawl depth limits, and adjusting other settings to ensure that the crawling process is as efficient and effective as possible.

By using these techniques, you can make the process of crawling large websites more efficient and effective. However, it is important to remember that crawling large websites can still be a complex and challenging task, and may require a significant amount of time and resources to complete.

Mongo Database dump & restore command - For ubuntu

Ruchika Atwal — Wed, 15 Feb 2023 13:35:20 +0000

Introduction for NoSQL database MongoDB :

1. MongoDB is a popular NoSQL document-oriented database that was first released in 2009.
2. MongoDB is an open-source database that is designed to be flexible and scalable, making it a popular choice for many modern web applications.
3. Unlike traditional relational databases, MongoDB stores data in flexible JSON-like documents, which allows for easy and fast querying of data.
4. MongoDB is known for its fast performance and scalability, making it an ideal choice for handling large amounts of data.

Importance of why we need to backup data in MongoDB :

Problem :

- Data loss can be a serious problem for organisations that rely on MongoDB for their applications.
- Without proper backups, organisations may be at risk of losing critical data in the event of a hardware failure, software bug, or other issue.
- The process of backing up data can be time-consuming and complex, especially for organisations with large amounts of data.

Highlights :

- Losing important data can be a major setback for any organisation, leading to lost revenue, decreased productivity, and damaged reputation.
- Without proper backups, organisations are putting their critical data at risk, which can lead to costly downtime and lost business opportunities.
- The process of backing up data can be a challenge for organisations, especially those that lack the resources or expertise to do it effectively.

Solve :

- Schedule regular backups: Establish a regular backup schedule that meets your organisation's needs and make sure it is followed consistently.
- Use a reliable backup solution: Choose a backup solution that is reliable, secure, and easy to use. Consider options like MongoDB Cloud Backup or third-party backup tools.
- Test your backups: Regularly test your backups to make sure they are working as expected and can be used to restore data in the event of a failure.
- Consider a managed backup service: If you lack the expertise or resources to manage backups in-house, consider using a managed backup service that can handle backups for you.
- By implementing these solutions, organizations can protect their critical data and avoid the risk of data loss. With a reliable backup strategy in place, organizations can focus on their core business objectives and rest assured that their data is safe and secure.

Mongo dump command

mongodump --host your_db_ip --port your_db_port --username db_username --password db_password -d database_name -c collection_name --out /folder_path --authenticationDatabase admin

Mongo restore command

mongorestore -d database_name -c collection_name
/folder_path/database_folder_name/collection_file_name.bson --host your_db_ip --port your_db_port --username db_username --password db_password --authenticationDatabase admin

Mongo dump command with zip

mongodump --host your_db_ip --port your_db_port --authenticationDatabase admin --username db_username --password db_password -d database_name -c collection_name --gzip

Mongo restore command with zip

mongorestore --gzip --db database_name --collection collection_name ./collection_name.bson.gz --host your_db_ip --port your_db_port --authenticationDatabase admin --username db_username --password db_password

Working with multiple python and pip version on one machine with virtual enviornment.

Ruchika Atwal — Tue, 26 Jul 2022 12:45:00 +0000

Based on python requirment you can first instal python version you want. Just time of virtual environment creation mention python version. Below have provided example with python3.7. if you want to use create python3.8 or any other version, replace python3.7 with your python version.

Create python virtual environment without pip

$ python3.7 -m venv myenv_python37 --without-pip

Activate python virtual environment you created above

$ source myenv_python37/bin/activate

Now, install pip inside virtual environment

$ curl https://bootstrap.pypa.io/get-pip.py | python

If you get error while installing pip:

ModuleNotFoundError: No module named 'distutils.cmd'

install below distutils package

$ sudo apt-get install python3.7-distutils

Install all requirement for python 3.7 inside virtual enviornment

$ pip install -r requirements.txt

After your work is done, close virtual environment

Note:

This helps maintaining multiple python and pip version on one system and using python and its supported package based on your requirement.
All packages from requirements.txt would be accesible inside virtual environment only, not global on your system for python3.7

Session Vs Cookies, Stateless Vs Stateful Protocol, HTTP Session Tracking

Ruchika Atwal — Sun, 17 Jul 2022 10:15:40 +0000

Session Vs Cookies :

Cookies and Sessions are used to store information. Cookies are only stored on the client-side machine, while sessions get stored on the client as well as a server.

Session
1. A session creates a file in a temporary directory on the server where registered session variables and their values are stored. This data will be available to all pages on the site during that visit.
2. A session ends when the user closes the browser or after leaving the site, the server will terminate the session after a predetermined period of time, commonly 30 minutes duration.
Cookies

Cookies contains a piece of information are not safe, though it is kept on client-side server.
Cookies are text files stored on the client computer and they are kept of use tracking purpose. Server script sends a set of cookies to the browser. For example name, age, or identification number etc. The browser stores this information on a local machine for future use. When next time browser sends any request to web server then it sends those cookies information to the server and server uses that information to identify the user.

Stateless Vs Stateful Protocol :

Network Protocols for web browser and servers are categorized into two types: Stateless Protocol, and Stateful protocol. These two protocols are differentiated on the basis of the requirement of server or server-side software to save status or session information.

Stateless Protocol :

In Stateless protocol, Client send request to the server and server response back according to current state.
It does not require the server to retain session information or a status about each communicating partner for multiple request.
Means, whenever the request is sent to server , server does not store the information about client/user request. Each time the same user sent the request, for server it will be new user though it is same user. As no information or id is being store with server for validation or identification.

Example of Stateless Protocol :

HTTP (Hypertext Transfer Protocol)
UDP (User Datagram Protocol)
DNS (Domain Name System)

Features of Stateless Protocols :

Stateless Protocol simplify the design of Server.
The stateless protocol requires less resources because system do not need to keep track of the multiple link communications and the session details.
In Stateless Protocol each information packet travel on it’s own without reference to any other packet.
Each communication in Stateless Protocol is discrete and unrelated to those that precedes or follow.

Statefull Protocol :

In Stateful Protocol If client send a request to the server then it expects some kind of response, if it does not get any response then it resend the request. FTP (File Transfer Protocol), Telnet are the example of Stateful Protocol.

Features of Stateless Protocols :

Stateful Protocols provide better performance to the client by keeping track of the connection information.
Stateful Application require Backing storage.
Stateful request are always dependent on the server-side state.
TCP session follow stateful protocol because both systems maintain information about the session itself during its life.

HTTP Session Tracking explanation how it works :

Whenever the client/user makes request to server.
The request is being made through/via HTTP.
HTTP stands for Hyper Text Transfer Protocol
HTTP is a stateless protocol.
Stateless protocol means the server does not track user, so no information is stored with server regarding client/user request.
So the problem here is that
Whenever the client/user makes multiple request, Same client/user is new for server.
So to overcome , session and cookie came in-hand
As http is stateless, to maintain state, server decided whenever client/user sent the request a ID will be sent along with request, and server though return same ID with response to keep track.
After session is created , after sending request and response recieved, server sends key with response to browser known as cookie.
So whenever the client/user sent request again browser takes cookie along with request to server.
So know the server check cookie id(key), and check for which session (client/user) does this cookie id (key) belongs to
So server identifies user with cookie.

Note : Points to remember :

Session is used to identify a user/client.
Session is stored at server side.
Cookie is stored in browser on client side.

Ubuntu commands

Ruchika Atwal — Tue, 28 Jun 2022 15:03:42 +0000

Command 'df' Short of "desk free". used to check available disk on system. Displays the information of file system device names, disk blocks, total disk space used, available disk space, percentage of usage and mount points on a file system

$ df

With df add -h to display Disk Space in Human Readable Format like in MB and GB.

$ df -h

Show only your home file system information.

$ df -h /home

Change directory/Folder.

$ cd /folder_name

Show just the names of files and directories existing in your present directory.

$ ls

Show list of all the files and directories existing along with the details like permissions, ownership, size etc.

$ ll

Beginner's - creating virtual environment on ubuntu for python

Ruchika Atwal — Mon, 27 Jun 2022 09:56:41 +0000

Installing virtual environment

Update your system

$ sudo apt-get update && sudo apt-get upgrade

Install Virtual Environment package with below command :

$ sudo apt install virtualenv -y

Create a python virtual environment directory for different.

$ mkdir ~/python-environments && cd ~/python-environments

Explanation :

mkdir -> is used a create folder/directory in ubuntu from
terminal
cd -> is used to change path for folder/directory path

&& -> is used to join and run 2 commands in terminal
together at once

You can run make directory and change directory command in separate line too :

$ mkdir ~/python-environments
$ cd ~/python-environments

Now create python virtual environment :

$ virtualenv --python=python3.8 env_python38

After running create virtual enviornment commant output in terminal would be like this :

created virtual environment CPython3.8.10.final.0-64 in 80ms
  creator CPython3Posix(dest=/home/ruchika/python-environments/env_python38, clear=False, global=False)
  seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, pkg_resources=latest, via=copy, app_data_dir=/home/ruchika/.local/share/virtualenv/seed-app-data/v1.0.1.debian.1)
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

type ls command to check list of file, folder in current directory :

ls

you can your created virtual environment directory :

env_python38

Check your environment is installed with the python version is proper :

$ ls env_python38/lib

Activate and deactivate virtual environment

Activate your virtual environment

$ source env_python38/bin/activate

After activating you see your virtual env name in rounf bracket :

(env_python38) yoursystemname:~/python-environments$

Deactiavte (close) your virtual environment:

$ deactivate

Remove extra space from text with regex - Python

Ruchika Atwal — Fri, 24 Jun 2022 09:22:59 +0000

Easy way to remove extra spaces from text, paragraph with regex in python

Import regular expression packages

import re

Remove extra spaces from text.

text = "  Hi python   is a case   sensitive language.    "
text = re.sub(' +', ' ', text)
print("text : ", text)

Explanation for above piece of code :

re.sub() - is a function in used to replace sub-string with another sub-string.
first argument in sub function is regular expression to find sub-string expression that will replace, i.e: ' +' (space with +) capture number of spaces.
second argument is what will replace in place of first argument, i.e: ' ' (one single space to replace with number of spaces)
third argument is your piece of text variable that you want to clean.

Git - Beginner's Guide

Ruchika Atwal — Wed, 22 Jun 2022 16:23:52 +0000

Git is version control system used to handle small to very large projects efficiently.

Git helps in tracking changes in the source code, enabling multiple developers to work together on non-linear development.

Installing git on ubuntu

Update packages

sudo apt update

Install git

sudo apt install git

Confirm that you have installed Git correctly by running the following command :

git --version

Push your first project folder to git server

How to initialiaze git in local system

After this command it will create empty git directory under current directory.

init git

Staging

After initiate git, now check what files are there for stating area. File have have added or not. If file not added to staging it will show filename as red color.

git status

Add to staging

#### to add single file or folder
git add filename

#### to add all file and folder
git add .

After adding you can check status by git status, file is added to staging or no.

And then now we need to commit changes.

Note:. Add will only add in the queue, it will not be added to versioning, after commit it will be added to version control.

Commit

git commit -m "this helps to track changes readble easy,  so write what changes you made"