DEV Community

Cover image for Web Scraping with Langchain and html2text
Ranjan Dailata
Ranjan Dailata

Posted on

Web Scraping with Langchain and html2text

Introduction

In this blog post, you will be guided on how to perform a simple web scrapping using the available open source python packages. We are going to make use of the langchain and html2text.

Hands-on

First, we need to make sure to install the langchain and html2text packages.

!pip install -q langchain playwright beautifulsoup4 html2text
Enter fullscreen mode Exit fullscreen mode

Here's the code snippet for accomplishing the web scrapping. The following code is utilizing the langchain's AsyncHtmlLoader and the Html2TextTransformer from html2text package for the extraction of HTML to Text.

import html2text
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

async def do_webscraping(link):
    try:
        urls = [link]
        loader = AsyncHtmlLoader(urls)
        docs = loader.load()

        html2text_transformer = Html2TextTransformer()
        docs_transformed = html2text_transformer.transform_documents(docs)

        if docs_transformed != None and len(docs_transformed) > 0:
            metadata = docs_transformed[0].metadata
            title = metadata.get('title', '')
            return {
                'summary': docs_transformed[0].page_content,
                'title': title,
                'metadata': metadata,
                'clean_content': html2text.html2text(docs_transformed[0].page_content)
            }
        else:
            return None

    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None
Enter fullscreen mode Exit fullscreen mode

Let's see the real code usages now. Assume below are the URLs for which you wish to do the web scrapping.

google_search_results = ['https://www.yelp.com/search?cflt=seafood&find_loc=Mountain+View%2C+CA+94043',
 'https://www.yelp.com/search?cflt=seafood&find_loc=Mountain+View%2C+CA',
 'https://www.opentable.com/cuisine/best-seafood-restaurants-mountain-view-ca']
Enter fullscreen mode Exit fullscreen mode

Here's the logic for interacting and performing the web scrapping.

for link in google_search_results:
  print(link)
  response = await do_webscraping(link)
  if response != None:
    structured_response.append(response)
Enter fullscreen mode Exit fullscreen mode

Google-Web-Scrapping

Here's the structured response.

Google-Web-Scrapping-Structured-Response

Top comments (3)

Collapse
 
hilmanski profile image
hil

Interesting! I didn't know about html2text before, thanks for the reference.

I did a similar experiment a while ago, but using AI model from OpenAI to parse the data with structure format here: Web parsing with AI

Thank you for sharing.

Collapse
 
ranjancse profile image
Ranjan Dailata

@hilmanski thanks a lot for your feedback. I have published a blog post which mimics the same using LLM - dev.to/ranjancse/google-search-wit...

Collapse
 
hilmanski profile image
hil

Thank you Ranjan! I'll read the post!