DEV Community

Cover image for Proper AI data extraction without manual HTML parsing
Oleg Kulyk
Oleg Kulyk

Posted on

Proper AI data extraction without manual HTML parsing

In my previous article, I've described the basic concept behind the AI data extraction using ChatGPT and some readers contacted back to me with questions and complains that the mentioned approach is not that good.

They were entirely right.

Sad AI model

The main issues are:

  • Improper ChatGPT AI model usage, which leads to poor extraction quality
  • Limited input size, which doesn't allow processing something bigger than example.com

So, let's dig deeper and find out how to address those issues and implement proper AI data extraction.

1. Improper ChatGPT AI Model Usage

The first issue was the improper use of the ChatGPT AI model, leading to poor extraction quality. This mainly happened because not all pages follow the same HTML structure, "confusing the model".

The best way to overcome this is to use Chat Models from OpenAI, like GPT-3.5 or GPT-4.

This way, you can set up your HTML extraction using a similar code:

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Extract <your parameters> from <HTML code>"},
    ]
)
Enter fullscreen mode Exit fullscreen mode

Also, you could use more capable models and apply more sophisticated layouts.

2. Input size

The second issue was the limitation in input size, which barred the processing of more extensive websites.

To tackle this, divide the larger content into smaller, manageable chunks and then process them sequentially or in parallel. This method will allow the AI model to handle larger datasets effectively without being overwhelmed.

Or use GPT-4 model πŸ™‚

Poor robot

What's the cost?

Using powerful GPT-3.5 or GPT-4 models for data extraction from web pages is an over-engineered solution, so paying for such an amount of tokens could be not that economically efficient (usually, it is still cheaper than supporting your parsers and paying the support team of developers).

So, the solution should fit exactly the purpose to be the most cost-efficient.

AI extractor from ScrapingAnt

The Beta version of the solution is here!

ScrapingAnt team developed unique AI Data Extractor technology, which is suitable for any-size websites and requires only 2 parameters: URL and free-form input!

A JSON with automatically converted input parameter names would appear as the output. Which is near magic!

It's time to turn any web site into API with JSON format!

The best thing I like about this tool is that it fully supports nesting and types, so you don't need to struggle with loops like in other no-code tools, but write it down:

product title, price(number), full description, reviews(list: review title, review content)
Enter fullscreen mode Exit fullscreen mode

Such extraction input could be sent to ScrapingAnt AI API for some Amazon URL like: https://www.amazon.com/dp/B0725MVKCZ/

To receive a similar output:

{
  "productTitle":"MREs (Meals Ready-to-Eat) Genuine U.S. Military Surplus Assorted Flavor (6-Pack)",
  "price":60,
  "fullDescription":"MREs (Meals Ready-to-Eat) Genuine U.S. Military Surplus Assorted Flavor (6-Pack). Brand: MRE. Number of Items: 1. Flavor: Assorted. Item Weight: 0.01 Ounces. Size: 6 Count (Pack of 1). Long shelf life when stored per manufacturer's directions. 2012 or newer Pack Date. Genuine US War Fighter Rations are the ultimate survivalist, Prepper & outdoor enthusiast Meal. Ideal for hunting, camping, hiking, fishing, boating, and emergency food supply. Designed for maximum endurance and nutrition with average 1250 calories per meal.",
  "reviews":[
    {
      "reviewTitle":"Will buy from Food Dude again",
      "reviewContent":"Great selection! I ordered 6 MRE and they are all different and a nice selection. I also ordered a First Strike MRE and you get what you pay for. This one is a Monster and it’s packed to the hilt. Not only will I buy from this vendor again i already ordered 4 more and will be getting a case on payday. Thanks so much for the lightning fast shipping. Great Seller,Great product!"
    },
    {
      "reviewTitle":"AWESOME!",
      "reviewContent":"This is the second time I've purchased this product, and if I could leave 6 stars, I would! Not only are the meals exactly as described, and arrived in good time, but the seller communication is phenomenal. I made a special request on my second purchase and my message was answered very quickly, with my request being fulfilled perfectly. I highly recommend this seller, and the MREs themselves. They're great for camping, emergency rations, or just to have instead of cooking or eating out!"},
...]}
Enter fullscreen mode Exit fullscreen mode

This is exactly the technology I've been looking for for the past 2 years, as you no longer need to write your HTML parsers and work with Cheerio or BeautifulSoup!

The extractor itself is available to test using the UI request generator. And it's free to try without a credit card πŸ™ƒ

Also, you can check more at documentation pages.

I'm excited to share this technology with this community and would appreciate your comments and feedback!

Top comments (0)