Screenshots optimization on OpenAI tokens

#openai #javascript #api #gpt4

Prologue

In my previous post I have taken the approach of extracting data from the html of the page, and in order to keep token usage low decided to also clean it. Here I will be looking at optimizing the token usage further.

Funnily enough taking screenshots of the element we are looking for, the number of tokens is drastically reduced.

Why?

When working with OpenAI APIs things are not exactly free. They are cheap but depending on how often you run certain operations it can add up quite quickly. Fortunately the API has some metrics for measuring usage for the different operations so that you don't get any surprises.

How?

So what you can do is pick a website at random, for the purpose of the exercise we could use homegate. It's a real estate listing site, so you might use it if you wanted to find a place to rent in Zürich.

Let's use puppeteer as per usual to load up the search results page that is listed above.

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";


const urls = [
  "https://www.homegate.ch/rent/real-estate/zip-8002/matching-list?ac=2.5",
];

puppeteer.use(StealthPlugin());

async function grabSelectorScreenshot() {
  // usual browser startup:
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setUserAgent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
  );
  for (const url of urls) {
    await page.goto(url, { waitUntil: "networkidle0" });
    // the following code will scroll to the bottom of the page
 }
}

Once we have loaded the listing search results you want to select the list element and take a screenshot of it.

To grab a picture of a single element on the page what you want to do is the following:

// you need to grab the right selector for your usecase
    const element = await page.$(".ResultListPage_resultListPage_iq_V2");
    const designatedPathPng = `./screenshots/${hashed}-list-ss.png`;
    await element.screenshot({ path: designatedPathPng, type: "png" });

This looks fairly straightforward and works fine, however if we have a look at the screenshot you will notice that some parts have not rendered.

The gotcha 😕

So...it seems there is a small conundrum with the randomly chosen website, it lazy renders parts as they come into view, which means we have to scroll down and get the bottom part visible before we can grab the snapshot. Just to be clear with this approach we will rely solely on the visual aspect of the page so there is no data available if the elements are not rendered.

To do this we will attempt to scroll down all the way to the bottom of the page and grab the search results:

    await page.evaluate(async () => {
      await new Promise((resolve, reject) => {
        var totalHeight = 0;
        var distance = 300; // should be less than or equal to window.innerHeight
        var timer = setInterval(() => {
          var scrollHeight = document.body.scrollHeight;
          window.scrollBy(0, distance);
          totalHeight += distance;

          if (totalHeight >= scrollHeight) {
            clearInterval(timer);
            resolve();
          }
        }, 500);
      });
    });

This tiny piece of code scrolls down to the bottom of the page and you can even delay it so it doesn't stress the host. Just to be clear you should always be ethical in your data collection and try not to annoy the site owners too much. It gets particularly dicey when going into the realm of commercial applications as then your legal bases need to be rock solid.

Finishing it up with a bit of AI

After we have scrolled all the way down to the bottom of the page we can come back to the snapshot of the element. It should now return the list of listings.

You can save the screenshot and send it out to OpenAI and grab the info from the image.

async function run(propertyInfoImage, imageType) {

  const response = await openai.chat.completions.create({
    model: "gpt-4-turbo",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: `Can you extract the property prices out of this
            image and send me the results? Can you send me the output as JSON?`,
          },
          {
            type: "image_url",
            image_url: {
              // imageType is png/jpeg
              url: `data:image/${imageType};base64,${propertyInfoImage}`,
            },
          },
        ],
      },
    ],
  });
  // spinner.succeed('Received response from OpenAI');
  console.log(response.choices[0]);
  console.log(response.usage);
}

There are a couple of interesting things about the images:

different type of image files might yield better results or better token usage metrics
the number of tokens used is considerably slower than in the case of text, which is pretty interesting eg:

{ prompt_tokens: 799, completion_tokens: 289, total_tokens: 1088 }

Conclusion

it's very surprising that using images has a lower token usage than using text by several orders of magnitude
chunking might be more challenging since you need to rely on visual marker detection which feels a bit more subjective
it is pretty interesting what you can do with visuals, especially when it comes to extracting data. You might find it interesting interpreting charts automatically or other types of information encoded into images and extracting it automatically.