Imagine this Scenario
You are building a web application where you want to provide information to visitors of your website.
You don't have information about this subject, so you have to rely on other sources on the internet. You want the information to be as consistent and accurate as possible. You looked around but were not able to find an external API that provides the data that you need. Still, there are more than enough sources available online that are there for the taking, right? You just need something to get it for you and mould it into a format that you can consume.
You could just use/build a website scraper like everyone else does. This is a popular method to get data from the web and has a proven track record. But they require knowledge about where to get the data and can be costly to maintain.
So what about AI? The way AI works, we won't have to tell it where to get the data — it can just "get" it for you, right..? We won't have to worry about the structure of the data; we can just ask it to provide it in our preferred format.
Let's find out! In this post, I will take you along a journey with me and try different approaches to achieve this. With every approach, we will evaluate how it worked for us. For this pet project, let's say we want to build a web application where users can search for a ski lift in a ski resort and get details about this ski lift, like the brand, capacity, etc.
So How Would That Work?
Let's define two major steps:
- Use AI to find the data for us on the web.
- Use AI to transform the data into a suitable format that our application can consume.
Requirements & Criteria
- Data Quality: The data should be consistent and accurate.
- Cost: The solution should be cost-efficient; you are prepared to pay for good, reliable data, but you don't want to pay more than you need.
- Robustness: The solution should be robust — you don't want to have to fix something whenever a data source changes or is removed.
- Practicality: How practical is this solution?
Limitations and Disclaimers
- There are many AI models out there (and more will come), each with its strengths and weaknesses. For the sake of brevity, we will use the ChatGPT AI model in this post. Results may vary with other models, and you are free to try them out yourself.
- Gathering information from other websites to use in your web application can have commercial and/or ethical challenges. If you plan to do this commercially, make sure that it is legal for you to use the information that you gathered.
- The pieces of code provided here are kept minimal for brevity.
It will be like a little adventure! We will try to solve the use case we started with. We will evaluate every step and see what we can do better. The major takeaway here is to explore options — there is no "right" or "wrong" in this article.
1. Using AI to Find the Data for Us on the Web
Asking ChatGPT Chat
Let's start simple. We want data. We'll just ask ChatGPT Chat to get the data for us! Just to see how that would work.
The ChatGPT Chat is a chatbox where you can talk directly with ChatGPT. It provides an answer to any query that you give it. It's really accessible and easy to use; you don't even need an account. Just start typing. It is a bit like magic — it can get everything we need, and we won't have to care where to look, right? We just need to make sure that ChatGPT gives us what we actually want.
So, how would that look using the ChatGPT Chat? We would come up with a prompt something like this:
Prompting
You are a helpful assistant.
You will give us accurate, concise information about a ski lift in a ski resort, which we will provide at the end of this prompt.
You will find these ski lift details: name, lift type, manufacturer, operator, capacity (people/hour), duration (mins), occupancy (persons), construction (year), lift_elevation distance (meters), lift_distance (meters), ski_resort, source_urls.
Provide only the data in a JSON object using the exact matching keys without markdown formatting, where source_urls should be an array of URLs.
Ski lift: Colomba in Val Cenis
Breaking It Down
Let's break down what the prompt does:
- We tell ChatGPT about how it should act. This is necessary so ChatGPT knows how to best answer our query. We tell it to act as a helpful assistant and what we expect it to return. This is called "system prompting."
- We will give it specific details to look for. This is necessary so we know exactly the data that ChatGPT will provide, so we can consume it in our application.
- We tell ChatGPT to provide the data in JSON format. We are not interested in anything else. This saves resources and makes it easy for us to consume the data.
Result
If we run the call, presto:
{
"name": "Colomba",
"lift_type": "6-person high-speed detachable chairlift",
"manufacturer": "Doppelmayr",
"operator": "Val Cenis Ski Resort",
"capacity": 2400,
"duration": 3.83,
"occupancy": 6,
"construction": 2010,
"lift_elevation": 159,
"lift_distance": 1051,
"ski_resort": "Val Cenis – Lanslevillard/Lanslebourg/Termignon",
"source_urls": [
"https://www.skidetails.com/ski-resort/val-cenis-lanslevillardlanslebourgtermignon/ski-lifts/l91126/"
]
}
And we have our data! Wow, that was surprisingly easy! Let's evaluate this step.
Recap: ChatGPT Chat
- Data: The data is reasonably consistent because we can prompt for the properties we want. But it is hard to tell if it's accurate without fact-checking the data. We can ask ChatGPT Chat to provide source URLs to make this easier. It is possible that it will hallucinate parts of the data.
- Cost: There's a free tier, but expect to incur costs to improve the accuracy of the data and as your website grows.
- Robustness: This setup will probably always return all the properties you need in a form that you can consume in your web application.
- Practicality: It is not practical. There is no feasible way for your application to connect to the ChatGPT Chat. You won't be able to use this data in your website.
That was a promising start, right? It proved that ChatGPT can gather the data for us and provide it in the form that we want. But it's also pretty obvious that we can't use this:
There is only a user interface for the ChatGPT Chat; there is no interface for our web application.
That's exactly why OpenAI provides a ChatGPT API interface. We'll explore this in the next step.
Asking ChatGPT API
The ChatGPT API is an interface where you can interact with the ChatGPT programmatically. It accesses the same AI models and works similarly, but there are important differences. More on that later.
To access it, you will need an account and an API key. Also, in contrast to the Chat, it has no free tier but a pay-as-you-go model, charging you based on the number of tokens used.
It's relatively straightforward to communicate with the API service. If you want to make the same call as we did with the ChatGPT Chat, it would look like this:
const CHATGPT_API_KEY = `abc123`;
const CHATGPT_API_URL = `https://api.openai.com/v1/chat/completions`;
const prompt = `Can you give me details about this ski lift: ${liftName} in resort: ${resort}`;
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${CHATGPT_API_KEY}`,
};
const body = JSON.stringify({
model: "gpt-4.1-mini",
messages: [
{
role: "system",
content: `You are a helpful assistant.
You will give us accurate, concise information about a ski lift in a ski resort, which we will provide at the end of this prompt.
You will find these ski lift details: name, lift type, manufacturer, operator, capacity (people/hour), duration (mins), occupancy (persons), construction (year), lift_elevation distance (meters), lift_distance (meters), ski_resort, source_urls.
Provide only the data in a JSON object using the exact matching keys without markdown formatting, where source_urls should be an array of URLs.`,
},
{ role: "user", content: prompt },
],
});
const result = await fetch(CHATGPT_API_URL, {
method: "POST",
headers,
body,
});
Breaking It Down
- Sending the API key in the header authenticates us to ChatGPT and ensures that ChatGPT knows who to charge for this request.
- In the API, we can send messages to ChatGPT with a specified role. This tells ChatGPT how to best process the input. For example, we can prime the model with "system input" by adding a property
role: systemto the message. This is called input formatting. - The ChatGPT API offers additional options like model selection, temperature, maximum token usage, and more (we will not explore these in this post, but I encourage you to try them out sometime).
When prompted, the ChatGPT API would return something like this:
Result
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1736778567,
"model": "gpt-4.1-mini",
"choices": [
{
"index": 0,
"message": {
"data": {
"name": "Colomba",
"lift_type": "Chair lift",
"manufacturer": "Poma",
"operator": "Société des Remontées Mécaniques de Val Cenis",
"capacity": 2000,
"duration": 5,
"occupancy": 4,
"construction": 1998,
"lift_elevation": 350,
"lift_distance": 950,
"ski_resort": "Val Cenis",
"source_urls": [
"https://www.skidetails.com/ski-lifts/val-cenis/colomba/"
]
}
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 205,
"completion_tokens": 149,
"total_tokens": 354,
"prompt_tokens_details": { "cached_tokens": 0, "audio_tokens": 0 },
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": "fp_abc123"
}
Investigating the Result
So what does this tell us?
- The response provides metadata about the request, such as the model used, the time of the request, a unique identifier, etc. The key objects we care about are
choicesandusage. - In the
choicesobject, it returns a JSON object with the data we asked for. We can directly pass this into our application and use it however we want. Depending on your prompt, it can return one or more results. - In the
usageobject, it details the cost of this request. Every request to the API uses tokens, and there is a cost associated with these tokens depending on the AI model used.
Let's also dive a bit deeper in the api, like how costs are calculated and what models are out there. It greatly influences how useful your data is and what it will cost you.
Breakdown of Costs
We can break down the tokens into three types:
- Prompt Tokens: Represent the number of tokens in the input (the query you provide).
- Completion Tokens: Represent the number of tokens used to generate the output (the response).
- Total Tokens: The sum of prompt and completion tokens.
The total tokens used for this call are 354. The cost would depend on the model:
- GPT-4.1 Mini: $0.00032
- GPT-4.1: $0.0016
- GPT-5: $0.0017
Note: Prices are based on ChatGPT prices - checked Feb 2026.
These costs add up depending on the complexity of the query and the model used. Always optimise your queries to minimise token usage. It helps a lot to keep your queries concise and straightforward, also it can save you a great deal to be really explicit on the output format. AI models generate elaborate responses unless you tell them not to. After all, this is where the people behind AI models make money..
Breakdown of Models
ChatGPT offers different models that you can use. Which model is best for your use case depends on:
- The reasoning capacity you need.
- The cost you are willing to incur.
Common ChatGPT Models
-
GPT-4.1 Mini:
- Low to medium complexity.
- High-volume, low-cost tasks.
- Most cost-effective.
-
GPT-4.1:
- Moderate complexity.
- Balanced tasks (good for general use).
- Good balance between cost and capability.
-
GPT-5:
- High complexity, nuanced tasks.
- Advanced reasoning, planning, and problem-solving.
- High cost, highest accuracy.
Differences Between ChatGPT Chat and ChatGPT API
So, that proves our case. It works a bit differently, and this time costs are involved, but it proves that we can get the same data using the ChatGPT API as when we used the ChatGPT Chat.
Or did we?
There is a problem with the response. Some of you may have already noticed it. If you look carefully at the results, you will see that, while the properties are the same for both data entries, the data from the ChatGPT API is fundamentally different from the data that ChatGPT Chat provided.
const result_chatgpt_chat = {
name: "Colomba",
lift_type: "6-person high-speed detachable chairlift",
manufacturer: "Doppelmayr",
operator: "Val Cenis Ski Resort",
capacity: 2400,
duration: 3.83,
occupancy: 6,
construction: 2010,
lift_elevation: 159,
lift_distance: 1051,
ski_resort: "Val Cenis – Lanslevillard/Lanslebourg/Termignon",
source_urls: [
"https://www.skidetails.com/ski-resort/val-cenis-lanslevillardlanslebourgtermignon/ski-lifts/l91126/",
],
};
const result_chatgpt_api = {
name: "Colomba",
lift_type: "Chair lift",
manufacturer: "Poma",
operator: "Société des Remontées Mécaniques de Val Cenis",
capacity: 2000,
duration: 5,
occupancy: 4,
construction: 1998,
lift_elevation: 350,
lift_distance: 950,
ski_resort: "Val Cenis",
source_urls: ["https://www.skidetails.com/ski-lifts/val-cenis/colomba/"],
};
Wait... what? What happened? Assuming that both use the same model, the same parameters, and the same prompt, why is the response entirely different?
This puzzled me at first. I just did not understand why this happened. The first hint to what is really going on here can be found when looking at the source URLs.
const resort_url_chat =
"https://www.skidetails.com/ski-resort/val-cenis-lanslevillardlanslebourgtermignon/ski-lifts/l91126/";
const resort_url_api =
"https://www.skidetails.com/ski-lifts/val-cenis/colomba/";
By looking at the source URLs, we see that they use the same website as a source, but the URLs are very different in structure. Now, it's possible that the maintainer of the website has some interesting thoughts about SEO, but that is not the case here. One of these URLs is incorrect, and that is the root of our problem. A quick copy-paste to your browser will tell you quickly which url is valid. And it turns out..
The URL that we got from the API is total bogus.
So if the URL is wrong, it's very likely that the other data is also wrong. The reason behind this is straightforward, but it is easy to miss!
The ChatGPT API Has No Direct Access to the Internet
Unlike the ChatGPT Chat, the API does not browse webpages by default. Any information that the ChatGPT API provides is input or is generated by the model. That explains why the data makes no sense at all.
It's all generated.
This is a really easy pitfall to get into. ChatGPT will not warn you in any way that the data might not be accurate, even if we explicitly ask for accurate data. If it cannot find the data, it will just generate it and state it as factual data. This is known as hallucinating.
Note: You can instruct ChatGPT what to do when it can't find factual data. For example, you can tell it to leave the field blank instead of making something up to fill the field. But even then, it might still happen that it returns incorrect data instead of leaving the field blank.
Why Is the ChatGPT API Different?
Why does the ChatGPT API not have access to the internet? It seems like a really convenient functionality to have for the ChatGPT API. We know it is possible — the ChatGPT Chat has it, after all.
There are several reasons for this, but the most important one is cost.
Web browsing can be seen as a wildcard. If you are looking for something on the internet, you will get results depending on:
- How accurate your search term is.
- What information is available on the web.
While ChatGPT is very capable of finding information by browsing the web, it also charges you for any information it processes. And here's the catch: how many tokens will it need to find the data you're asking for? What if it needs to parse a page with 20k+ words to extract 20 words that are relevant to you? What if it needs to process five different webpages from different websites and aggregate this information? The amount of tokens you'll be charged to get this information will stack up quickly.
How many tokens are we willing to spend on a request? What if the token limit is reached? Can the code handle incomplete results? How can you guide ChatGPT on how/where to look for the information?
As you can see, it is far from certain what information we can expect and at what cost if we depend on information found on the web. Also, we would have to implement contingencies when we do not get the results we want.
Recap: ChatGPT API
So, where are we now?
- Data: We can't rely on the data at all. Because the API does not have access to the internet, it will fall back to its LLM and hallucinate the results.
- Cost: There will be a small cost per request depending on token and model usage.
- Robustness: This setup will still probably return all the properties you need in a form that you can consume in your web application.
- Practicality: It is not practical. Without access to the internet, we only get hallucinated data.
This kind of blows our use case out of the water, doesn't it? This was precisely the reason we wanted to use AI: to get us the information we are looking for without all that hassle. If it can't access that information to start with, we're kind of dead in the water..
Asking ChatGPT API using web-search tool
There is something for that. It's called the web-search tool.
You can tell ChatGPT API to perform a web-search whenever it needs to get real-time data. This is an explicit opt-in, so you need to configure it or ChatGPT will start guessing..
Let's dive into it a bit.
The web-search tool tells the model that it can use web_search if it needs additional information. We can pass the results back into the model so it can use this as input for your original query.
Note: Additionally, there are specific 'search-preview' AI models available, but they are very similar from using the web-search tool and have a 'probably obsolete soon' taste about them. Still you can still try a specific search-preview model if you want to.
Using the web-search tool would look something like this:
const CHATGPT_API_KEY = `abc123`;
const CHATGPT_API_URL = `https://api.openai.com/v1/responses`;
const prompt = `Can you give me details about this ski lift: ${liftName} in resort: ${resort}`;
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${CHATGPT_API_KEY}`,
};
const body = JSON.stringify({
model: "gpt-5", // or another tool-capable model you use
tools: [{ type: "web_search" }],
include: ["web_search_call.action.sources"],
input: [
{
role: "system",
content:
`You are a helpful assistant.
You will give us accurate, concise information about a ski lift in a ski resort, which we will provide at the end of this prompt.
You will find these ski lift details: name, lift type, manufacturer, operator, capacity (people/hour), duration (mins), occupancy (persons), construction (year), lift_elevation distance (meters), lift_distance (meters), ski_resort, source_urls.
Provide only the data in a JSON object using the exact matching keys without markdown formatting, where source_urls should be an array of URLs.`,
},
{ role: "user", content: prompt },
],
});
const result = await fetch(CHATGPT_API_URL, {
method: "POST",
headers,
body,
});
Breaking it down
Nothing really groundbreaking really. We just added some lines to the previous prompt:
tools: [{ type: "web_search" }],
include: ["web_search_call.action.sources"],
Result
Running this call would give us a response like this:
{
"id": "resp_abc123",
"object": "response",
"created": 1736778567,
"model": "gpt-5",
"output": [
{
"type": "web_search_call",
"id": "ws_01",
"action": {
"type": "search",
"query": "Colomba chairlift Val Cenis manufacturer capacity length vertical rise year",
"sources": [
{
"url": "https://www.skiresort.info/ski-resort/val-cenis-lanslevillard-lanslebourg-termignon/ski-lifts/l91126/",
"title": "Colomba - Val Cenis lift details"
},
{
"url": "https://www.skiresort.info/ski-resort/val-cenis-lanslevillard-lanslebourg-termignon/ski-lifts/",
"title": "Ski lifts Val Cenis – list"
},
{
"url": "https://www.yumpu.com/en/document/view/56526718/doppelmayr-garaventa-annual-brochure-2011",
"title": "Doppelmayr/Garaventa Annual Brochure 2011 (La Colomba)"
}
]
}
},
{
"type": "message",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "{\"name\":\"Colomba\",\"lift_type\":\"6pers. High speed chairlift (detachable)\",\"manufacturer\":\"Doppelmayr\",\"operator\":null,\"capacity\":2400,\"duration\":4,\"occupancy\":6,\"construction\":2010,\"lift_elevation\":159,\"lift_distance\":1051,\"ski_resort\":\"Val Cenis\",\"source_urls\":[\"https://www.skiresort.info/ski-resort/val-cenis-lanslevillard-lanslebourg-termignon/ski-lifts/l91126/\",\"https://www.skiresort.info/ski-resort/val-cenis-lanslevillard-lanslebourg-termignon/ski-lifts/\",\"https://www.yumpu.com/en/document/view/56526718/doppelmayr-garaventa-annual-brochure-2011\"]}"
}
]
}
],
"usage": {
"input_tokens": 250,
"output_tokens": 210
}
}
Which also is pretty similar to the previous response. You will notice the new 'action' object, which summarizes a web-search call. This will tell you the query ChatGPT executed and the sources that the query yielded.
Using the web-search api tool is not free - expect to spend about $0.01 for every web-search call done + the content tokens that are put into your original query. It is hard to say how much content tokens are use because the result is non-deterministic, expect additional tokens on a typical web search - amount varies by query and sources.
Recap: ChatGPT API + Web-search
So, how did the web-search tool do?
- Data: We probably get all the results we asked for.
- Cost: There will be a small cost involved every time the web-search tool is invoked along with content tokens from the search result.
- Robustness: This setup will probably work, but we have limited control on what we query for and what results we get. We can never be sure where it gets its data from and how accurate it is. We could have conflicting or incomplete data.
- Practicality: It is practical (yay!). It can get us what we need, at a minor price and the trade-off of some control.
Some thoughts about control
Like previously stated, you have limited control on what the model will look for and what results to pick. We don't even know what search engine or reasoning is behind it as this is not provided and can change over time. Also, you won't know if the results are copyrighted (so it might be a legal liability)
It's a bit like a genie that fulfils all your wishes. Your wish is fulfilled, but is it what you truly wanted...?
Still, there are some things you can do to level things a bit.
Tool-choice
The model will only use web-search whenever it thinks it will need to. You can use tool_choice: required to force the model to always do a search query if you need to.
Prompting
You can not directly provide a search query or search options like a regular search query, but you can update your prompt and provide some direction. For example,
If you want it to look for a specific query:
When performing web search, use the following query exactly:
"Colomba chairlift Val Cenis capacity manufacturer length vertical year"
Do not broaden or rephrase the query.
If you want it to look at specific trusted (or legal) places:
rules:
- You may extract data only from these domains:
- doppelmayr.com
- leitner.com
- skiresort.info
- If information is not explicitly stated on those domains, return null.
- Do not infer, estimate, or merge values.
2. Using a Search API + ChatGPT
So we were able to
- Use AI to find the data for us on the web.
- Use AI to transform the data into a suitable format that our application can consume.
However, there are some trade-offs. We have only limited control on the web-search and its results. We can not set engine, exact query, pagination, domain filters, etc. We don't even know whether the data we get is copyrighted. It might be necessary to have more control.
So let's explore that a bit: What if we want to have more control on where the data comes from? There's a straightforward answer for this. We could do what we always did:
Use Search Engines and feed the result into ChatGPT
Search engines are excellent at finding results. But results can be influenced by companies paying for visibility or may not contain all the data you need. It might also be copyrighted. To get legal, accurate and complete data, it's helpful to identify a reliable source that provides all the necessary information. You probably don't want to deal with collecting results from multiple sources to keep things simple and straightforward.
Getting the Data
For this post, we'll use the Brave Search API. Other search engines also offer API versions. You'll need to create an account and generate an API key to communicate with the search API. Most APIs offer a free plan with rate limits, with paid plans available if you need more traffic.
Search engines need accurate input. If you provide a vague prompt, you'll get vague results. It's important to carefully craft the prompt you give to the search API.
Let's tell the API to prioritize results from a trusted ski data website, https://my-trusty-ski-data-website.com, for best results.
Here’s how it might look:
const website = "https://my-trusty-ski-data-website.com";
const resort = "val cenis";
const liftName = "colomba";
const SEARCH_API_KEY = `abc123`;
const SEARCH_API_URL = `https://api.search.brave.com/res/v1/web/search`;
const SEARCH_API_OPTIONS = {
count: 1, // We are only interested in the first result
result_filter: "web", // We only want web results
};
const headers = {
headers: {
Accept: "application/json",
"X-subscription-token": SEARCH_API_KEY,
"Accept-Encoding": "gzip"
},
method: "GET",
};
// encode SEARCH_API_OPTIONS so that we can use them as a querystring
const queryString = Object.entries(SEARCH_API_OPTIONS)
.map(
([key, value]) => `${encodeURIComponent(key)}=${encodeURIComponent(value)}`
)
.join("&");
const url = `${SEARCH_API_URL}?q=site:${website}%20details%20${encodeURIComponent(
resort
)}%20${encodeURIComponent(liftName)}&${queryString}`;
const response = await fetch(url, headers);
Breaking It Down
- It makes a call to the Brave Search API to get results related to our query.
- Provides headers to include the API key and specify the result format.
- Configures the API to return only a single web result.
- Specifies the website, ski resort, and lift name in the query.
- Ensures the data is properly URL-encoded.
This query will return one URL with (hopefully) all the information we need! However, there's no guarantee that the result will be useful. It's always a good idea to tweak the search request to make it as reliable as possible.
Processing the Data
Once we access the URL, we get an entire webpage. While it should contain the data we need, it will also include unnecessary elements like headers, sidebars, footers, scripts, and styles. We need to extract the relevant data. Let’s see how this would look using ChatGPT.
Cost of Processing the Data
ChatGPT is excellent at processing data, but it consumes tokens for the input you provide. So how much data are we talking about, and what would it cost?
Let’s work with some numbers:
- The average HTML size of a webpage is 50–100 KB, excluding assets like media, CSS, and JavaScript. Let’s assume 100 KB for this calculation.
- 100 KB equals 102,400 bytes.
- 1 byte roughly equals 1 character (approximation; UTF-8 may use multiple bytes per character).
- 1 ChatGPT token equals ~4 characters in English (or HTML-like text).
This means we would need 102,400 / 4 = 25,600 tokens to feed the page into ChatGPT.
Price Estimate (input tokens only):
- GPT-4.1-mini: $0.01
- GPT-4.1: $0.051
- GPT-5: $0.032
That actually doesn't sound too bad.. but it will add up quickly when numbers add up. For example, for 1000 calls a day you would be looking at $10-50.
Recap: Brave API + ChatGPT API
So, what did this bring us?
- Data: We should be able to get pretty good data, as we can tell what we want to look for and where to look.
- Cost: The costs can potentially be higher than we would like to as we feed the entire web result to ChatGPT.
- Robustness: This setup will probably return all the properties you need in a form that you can consume in your web application.
- Practicality: It is reasonably practical but does not feel very efficient.
So we can use ChatGPT to extract the data we need from the website source code we found, but it's not really efficient yet. Also, there is only so much data that ChatGPT can process at a time, because of something called the context window.
Context Window
Keep in mind that AI models have a context window limit. The ChatGPT API is stateless, meaning it won't retain input between API calls. There are ways to work around this (e.g., resending the input with every query or choosing a model with a larger context window), but it's clear that making your input as concise as possible before providing it to ChatGPT is crucial.
Sending an entire page source into the ChatGPT API would not be my definition of concise.
Optimising Input Size
What can we do to make the input smaller and more concise? A typical HTML page contains a lot of unnecessary information for our use case, such as:
- Commented code
- HTML markup
- Inline styles
- Inline JavaScript
- Assets
Use Search Engine + Scraper and feed the result into ChatGPT
Let’s clean that up a bit!
We can use a scraper for this. Scrapers are excellent at collecting data from webpages. They are typically used to extract specific data, like prices, from webpages. However, relying on specific page structures can be unreliable, as any changes to the webpage by its maintainer can break the scraper.
There are two main approaches we can take:
- Extract the exact data we need using CSS selectors.
- Remove everything we don't need.
In this case, I would choose the second option. It’s a more robust and reliable way to reduce the size of our input without risking changes in the input breaking our scraper.
For this post, we’ll use Cheerio, but there are many other options available.
Here’s how it might look:
import * as cheerio from "cheerio";
const URL = `https://url-with-data`;
const $ = await cheerio.fromURL(URL);
$("script, style").remove(); // Remove all scripts and styles
const rawText = $("body").text(); // Get all text from the page
const data = rawText.replace(/\s+/g, " ").trim(); // Remove unnecessary whitespaces
Breaking It Down
This will give us all the text on the webpage, stripped of unnecessary elements. This approach can reduce the text size by 90% or more. For example, a 100 KB webpage would be reduced to 10 KB of text.
Cost Comparison
Let’s compare the costs of processing the reduced input:
- GPT-4.1-mini: $0.001
- GPT-4.1: $0.0051
- GPT-5: $0.0032
And so we can save 90% in cost, just by removing things we do not need.
Recap: Brave Search + Cheerio + ChatGPT API
So, what did this bring us?
- Data: We should be able to get pretty good data, as we can tell what we want to look for and where to look.
- Cost: The costs can be optimised so we don't pay more than we should or want to.
- Robustness: This setup will probably return all the properties you need in a form that you can consume in your web application.
- Practicality: It is reasonably practical, but is the extra hassle worth it..?
At this point, you might realise that we are taking more and more tasks out of ChatGPT’s hands. It feels like we’re falling back more and more on common, widely used practices. It sounds like we are building a regular scraper. With some AI sprinkles.
Why not just use the web-search tool?
So why should we bother with all this and not just use the web-search tool?
It depends. Like stated before, it all comes down to how much control you need, and the price you are willing to pay. Looking at the numbers above, the web-search API is pretty pricey compared to the methods discussed - it will add $0.01 to every call you do regardless of the other tokens used, even for small inputs.
Conclusion: What Does This Mean?
Here's what I take away from this:
-
AI Can Retrieve Web Data Directly: With the
web-searchtool the ChatGPT API can have access to the internet, at some extra cost and less control over the result. There are some things you can do to fine-tune the results up to a certain point. - AI Excels at Data Transformation: Use AI to process and structure raw data into a usable format for your application.
- Optimise Data to Save Costs: Always optimise input/output size to reduce AI processing costs and improve efficiency.
- AI Has Limitations: Remain mindful of constraints like the context window and potential inaccuracies (hallucinations).
- AI Is a Tool, Not a Solution: The more concise/specific your query, the better your result will be. You still need to think it through to prevent getting something you don't want or can't use. Optionally combine AI with traditional methods for a more cost-effective and robust data pipeline.
- Caching: I did not discuss caching in this post, but of course it's a no-brainer to have a caching strategy whenever you work with cacheable data to optimise your code and prevent unnecessary costs.
| ChatGPT Chat | ChatGPT API | ChatGPT API web-search | Brave Search + ChatGPT API | Brave Search + Cheerio + ChatGPT API | |
|---|---|---|---|---|---|
| Accuracy | + | -- | + | ++ | ++ |
| Control | -- | - | - | + | + |
| Cost | ++ | + | - | + | ++ |
| Robustness | -- | -- | - | + | + |
| Legal risk | - | + | - | + | + |
| Complexity | ++ | + | + | - | -- |
| Practicality | -- | -- | + | + | + |
Note: Note: “Legal risk” here means the risk of ingesting/reproducing third-party web content without clear permission or provenance. Pure hallucinations may reduce copyright risk, but increase misinformation/product liability risk.
Other Considerations
AI is evolving rapidly, with new possibilities and opportunities emerging every day. The conclusions I’ve drawn here might change in the near future — perhaps even today.
There are other options worth exploring:
- Agentic AI: Agentic AI is like you order a person to do a job for you, like 'fetch me the cheapest flight to Annecy', or 'do all my declarations for this month'. You could also ask it to gather information for you, like we did in this post. This will probably do the job, but it might take time to yield results and will be hard to scale. It also sounds a bit like using a sledgehammer to drive in a nail.
- Self-hosting/training your AI: It gives you more control on your costs, but purchasing or using your own hardware might not necessarily be cheaper.
- Using a paid solution like roborabbit.com: This could be a viable alternative depending on your needs.
Top comments (0)