Sleeyax

Posted on Oct 9, 2024 • Edited on Nov 16, 2024

Turn any website into a type-safe API using AI (part 1)

#ai #webscraping #openai #typescript

Not too long ago I saw this post on X/Twitter:

This idea intrigued me for a few days: turning any website into an API using AI? It sounded almost too good to be true. However, after experimenting with it, I can confidently say it’s not only possible but also much easier to achieve than you might expect.

In this post I'll uncover the secrets 😏.

How it works

In a nutshell the flow to go from an arbitrary webpage to a structured JSON object is as follows:

Scrape the webpage
Convert the HTML content into LLM-friendly text
Feed the converted data to an LLM
Instruct the LLM to extract the content into the provided JSON schema

Heads up: The code snippets below will be provided in TypeScript. If you prefer python - or any other programming language for that matter - I think you'll be able to follow along relatively easy though.

Naive approach

Let's start with the most basic approach to this problem by utilizing OpenAI's GPT 4o model. OpenAI recently launched structured JSON outputs, which makes the JSON processing part in the final step much easier.

Let's start by defining a similar function interface to the one we saw in the tweet:

export type ExpandOptions = {
  source: string;
  schema: z.ZodType;
  schemaName?: string;
};

export async function expand({
  source,
  schema,
  schemaName,
}: ExpandOptions) {
  // ...
}

Next, define your data schema with zod. We'll define a schema that resembles the example in the tweet:

const companySchema = z.object({
  name: z.string(),
  batch : z.string(),
  url: z.string(),
  industry: z.string(),
});

const companiesSchema = z.object({
  companies: z.array(company),
});

Now we can move on to implementing the exciting bits using the openai package:

export async function expand({ source, schema, schemaName }: ExpandOptions) {
  // Instantiate the OpenAI client
  const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  });

  // Fetch the HTML content (in plaintext) of the target URL.
  const res = await fetch(source);
  const input = await res.text();

  // Send the input to the model and parse the output according to the schema.
  const completion = await openai.beta.chat.completions.parse({
    model: "gpt-4o-2024-08-06",
    temperature: 1,
    messages: [
      {
        role: "system",
        content: `You are an expert entity extractor that always maintains as much semantic
meaning as possible. You use inference or deduction whenever necessary to
supply missing or omitted data. Examine the provided HTML content and respond 
with a JSON object that matches the requested format.`,
      },
      { role: "user", content: input },
    ],
    response_format: zodResponseFormat(schema, schemaName ?? "result"), // Converts the Zod schema to a JSON schema.
  });

  // Extract the parsed output from the model's response.
  const output = completion.choices[0].message.parsed;

  if (output == null) {
    throw new Error(
      "Failed to parse the model's output according to the schema"
    );
  }

  return output;
}

Finally, call the expand function:

const companies = await expand({
  source: "https://www.ycombinator.com/companies",
  schemaName: "Companies",
  schema: companiesSchema,
});
console.log(companies);

Make sure you've set the required environment variable OPENAI_API_KEY to your OpenAI API key and run the example:

npx tsx ./src/example-openai.ts

You should get the following output:

{ companies: [] }

So why didn't it work? The problem with the page we're trying to scrape on https://www.ycombinator.com/companies is that it relies on dynamic content. Basically, the list is empty on initial page load and only gets filled once some javascript code has finished loading the data from their API. You can confirm this by inspecting the page HTML source (CTRL + U). You'll notice that none of the items from the list can be found directly in the HTML source:

Thus, we'll need to run this javascript in order to render the full companies list. A regular HTTP client like fetch won't be able to do that, so we'll add browser automation to the mix.

We'll create another function which loads the page in a real browser and then extracts the rendered HTML content as soon as the page finished loading. We can use puppeteer to accomplish this:

async function fetchHtml(source: string) {
  const browser = await puppeteer.launch();
  const [page] = await browser.pages();
  // Wait until the page is fully loaded.
  await page.goto(source, { waitUntil: 'networkidle0' });
  // Extract the HTML content.
  const data = await page.evaluate(() => {
    // Remove unnecessary elements from the page to reduce the size of the content. This is absolutely necessary to prevent OpenAI token limits.
    for (const selector of ['script', 'style', 'link[rel="stylesheet"]', 'noscript', 'head']) {
      document.querySelectorAll(selector).forEach(el => el.remove());
    }
    // Return the rendered HTML content.
    return document.querySelector('*')!.outerHTML;
  });
  await browser.close();
  return data;
}

Now, modify your expand function as follows:

// ...
const input = await fetchHtml(source);

Finally, run the code again:

$ npx tsx ./src/example-openai.ts

{
  companies: [
    {
      name: 'Airbnb',
      batch: 'W09',
      url: 'https://www.ycombinator.com/companies/airbnb',
      industry: 'Travel, Leisure and Tourism'
    },
    {
      name: 'Amplitude',
      batch: 'W12',
      url: 'https://www.ycombinator.com/companies/amplitude',
      industry: 'B2B'
    },
    {
      name: 'Coinbase',
      batch: 'S12',
      url: 'https://www.ycombinator.com/companies/coinbase',
      industry: 'Fintech'
    },
    # ...

🎉 Congratulations, it works!

Conclusion

There's an elephant in the room. As we're dealing with a lot of tokens here (useless HTML tags also count towards token consumption) this can get quite costly. A single round-trip already cost me about $0.14:

That's $14 every 100 requests! Now imagine scraping a complex site...

I'll address cost reduction strategies, performance optimization and other challenges in part 2 of this post.

Sources

The prompt used in the example was taken (and slightly modified) from:

https://github.com/cigs-tech/cigs (MIT license)

Hi 👋 thanks for reading! This was my first ever post on dev.to. It ended up too long so I decided to cut it in multiple parts. If you enjoyed reading my content, consider following me on Twitter to stay in the loop ❤️.

Top comments (3)

tryhardest • Nov 4 '24

is part 2 out? I had some ideas for tokens, not sure this needs 4o either?

Sleeyax • Nov 6 '24

Not yet. I plan to finish and publish it later this month. Thanks for reading though! This post didn't catch much traction at first which honestly felt a little discouraging to continue writing on the next part.

GPT-4o isn't strictly required. You can experiment with other (cheaper) models too. Llama 3.1 instruct for example isn't too bad in comparison, especially if you cleaned your input tokens to a LLM-friendly format first.

Sleeyax • Nov 16 '24

Part 2:

dev.to/sleeyax/turn-any-website-in...