Thomas Hansen for AINIRO.IO

Posted on Jul 6, 2023 • Edited on Jul 1 • Originally published at ainiro.io

Increasing ChatGPT Live Data Internet Quality

#chatgpt #openai #machinelearning

Yesterday we were able to connect our ChatGPT chatbots to the internet. This feature is still experimental, and it doesn't always work - But today we were able to sort out 90% of the bugs related to this.

In this article I will highlight some of the more important things we did to accomplish this, and share my findings with the rest of the world. Hopefully it'll be useful for others trying to accomplish the same.

We're a Hyperlambda shop of course, but you can probably transpile our ideas to your programming language of choice.

Try ChatGPT with Internet Access

Before I start explaining what we did, do me a favour. Click our chat button in the bottom/right corner of this page, and write the following into it;

Find me information for the following query "Does WHO declare COVID-19 to be a pandemic in 2023?"

The point being that you'll get something such as the following back from it.

The difference between the above answer and what ChatGPT answers you is obvious I presume. Our chatbot can reach out to the internet using DuckDuckGo and scrape resulting websites, allowing us to have ChatGPT deal with real time information from the internet. Notice, it will only reach out to the internet if you write your query as follows;

Find me information for the following query "QUERY_HERE"

Now that we've got the semantics out of the picture, let's look at some of the things we had to do to increase the quality of this process. Initially, only about 40 to 50 percent of our queries would succeed, and below I will explain why, and what we did to increase this number beyond 90%.

Website scraping

We've probably got the best web scraping technology in the industry, and we've learned a lot as we've been using it over the last 7 months, scraping dozens of websites every single day due to our get a free AI chatbot web form. This puts us in a unique position to understand how to create high quality AI training data from websites, and all sorts of different sources - And you'd be surprised by how much of "the AI problem" is good old fashion software development, with algorithms, architecture, composition, software design, and simple code.

If you want better AI, write better traditional code 😉

Some of the more important findings in regards to website scraping we've done are as follows.

Not all websites CAN be scraped

We try to be a "good scraping citizen". With this I mean we clearly identify our spiders as website scrapers, using unique identifiable HTTP User-Agent headers, and we try our best to respect websites to avoid overloading them as we scrape them. More work can be done here, but at least contrary to most others, we don't "hide" the fact that we're scraping your website.

However, not all websites allows for being scraped. Some websites simply shuts off all web scrapers they can identify. Some have web firewalls, preventing anything but "human beings" to access these and scrape these - Which creates a problem for us as we try to retrieve whatever information we can find at these sites.

The way we solve this, is by invoking DuckDuckGo and retrieve the top 5 hits for whatever query the user is searching for. Then we retrieve all of these in parallel, with a timeout of 10 seconds. Why the timeout? Because some sites will "block you from getting data, while keeping the socket connection open", implying they will never return. The idea is that unless the site returns its HTML in less than 10 seconds, we'll release the HTTP connection, and simply ignore this URL.

Out of 5 hits from DuckDuckGo typically 1 or less will block. Since we're fetching information in parallel, async from 5 URLs, we'll still get "some information from 2/3 websites" 98% of the time. And the process as a whole will never take more than 10 seconds due to our timeout. The timeout is crucial for us, since we don't persist data locally, but always fetches it on demand, implying 10 seconds to scrape web pages becomes 10 extra seconds to get your answer from ChatGPT.

Below is the primary entry point code. Even if you don't understand Hyperlambda, you should be able to understand the general idea, and possibly transpile it into your programming language of choice.

/*
 * Slot that searches DuckDuckGo for [max] URLs matching the [query],
 * for then to scrape each URL, and aggregating the result
 * returning it back to caller as a single Markdown.
 */
slots.create:magic.http.duckduckgo-and-scrape

   // Sanity checking invocation.
   validators.mandatory:x:@.arguments/*/query
   validators.string:x:@.arguments/*/query
      min:3
      max:250
   validators.integer:x:@.arguments/*/max
      min:1
      max:10

   // Searching DuckDuckGo for matches.
   add:x:+
      get-nodes:x:@.arguments/*
   signal:magic.http.duckduckgo-search

   // Building our execution object that fetches all URLs simultaneously in parallel.
   .exe

      // Waiting for all scraping operations to return.
      join

   for-each:x:@signal/*/result/*

      // Dynamically contructing our lambda object.
      .cur
         fork
            .reference
            try
               unwrap:x:+/*
               signal:magic.http.scrape-url
                  url:x:@.reference/*/url
                  semantics:bool:true
            .catch
               log.error:Could not scrape URL
                  url:x:@.reference/*/url
                  message:x:@.arguments/*/message

      // Adding URL and title as reference to currently iterated [fork].
      unwrap:x:+/*/*
      add:x:@.cur/*/fork/*/.reference
         .
            url:x:@.dp/#/*/url
            title:x:@.dp/#/*/title

      // Adding current thread to above [join].
      add:x:@.exe/*/join
         get-nodes:x:@.cur/*

   // Executing [.exe] retrieving all URLs in parallel.
   eval:x:@.exe

   /*
    * Iterating through each above result,
    * returning result to caller.
    *
    * Notice, we only iterate through invocations that have result, and
    * did not timeout by verifying [signal] slot has children.
    */
   for-each:x:@.exe/*/join/*/fork

      // Verifying currently iterated node has result, containing both prompt and completion.
      if
         exists:x:@.dp/#/*/try/*/signal/*/*/prompt/./*/completion
         .lambda

            // Adding primary return lambda to [return] below.
            unwrap:x:+/*/*/*
            add:x:../*/return
               .
                  .
                     url:x:@.dp/#/*/.reference/*/url
                     title:x:@.dp/#/*/.reference/*/title
                     snippets

            // Adding [snippets] to return below.
            add:x:../*/return/0/-/*/snippets
               get-nodes:x:@.dp/#/*/try/*/signal/*

   // Returning result of invocation to caller.
   return

The basic idea is as follows;

Query DuckDuckGo and scrape the reulting top 5 URLs
Create one async thread for each result, and retrieve these from their respective URLs, with a timeout of 10 seconds
Wait for all threads to finish, and create an aggregated result

There is a lot more code related to this, but since Magic is Open Source, you can study its code for more details. For instance, we do a lot to try our best to create Markdown out of the resulting HTML. This significantly reduces the amount of data we're sending to ChatGPT, while also keeping hyperlinks, images, and lists in their semantic form. This is why our chatbot can display images, hyperlinks, and lists the way it does. This simple fact increases the quality of our chatbot alone by at least 1 order of magnitudes.

We do NOT STEAL your information

One thing we do different, is that we try our best to always provide source and references to our users, if we can fit it into the context. This implies it'll typically end its explanation with something like "This information was fetched from the following URLs; abc, xyz".

This is first of all the polite thing to do, and secondly it allows our users to fact check what our chatbots are telling you. The end result becomes that instead of "stealing traffic from your website", we'd probably instead GIVE your website additional traffic - Since users would probably want to fact check their queries by reading the source DuckDuckGo provides us with.

Conclusion

This is hard. I remember my former partner saying; "Why should I invest in something anybody can copy and steal?" Well, so far we're the only one in the industry able to do what we're currently doing. We're basically "10 years ahead of the competition", and nobody are able to "copy us" - Even though I do my best to help them copy our ideas every single day, by exclusively innovation openly in the public space, and open source licensing 99% of every single line of code I write 😂

You were wrong, I was right, check. 7 billion 999 million 999 thousand and 999 more to go 😂

Psst, try out our AI chatbots here

Top comments (6)

Mohanraj • Jul 6 '23

I am unable to access uploaded .csv file for chatbot process by using openai api key in nextjs project.Kindly suggest me any libraries or logic to moveahead.

Thomas Hansen • Jul 6 '23

Not sure what you mean? Are you uploading CSV files to OpenAI?

Mohanraj • Jul 6 '23

Liquid syntax error: Variable '{{% raw %}' was not properly terminated with regexp: /\}\}/

Mohanraj • Jul 6 '23

I created "chatbot automation process" page in nextjs.In this page, I created the upload option to upload csv file. I try to implement chatbot for user and AI conversation using openai apikey based on current uploaded csv data only but I unable to get relevant response from AI for user query. The code is given below:
`import React, { useState } from 'react';
import Papa from 'papaparse';
import axios from 'axios';

const UploadPage = () => {
const [csvData, setCsvData] = useState([]);
const [chatHistory, setChatHistory] = useState([]);
const [currentQuery, setCurrentQuery] = useState('');

const findAnswerFromData = (query) => {
if (!query || csvData.length === 0) {
return null;
}

const parsedQuery = query.toLowerCase();

for (let i = 0; i < csvData.length; i++) {
  const { question, answer } = csvData[i];

  if (question && parsedQuery.includes(question.toLowerCase())) {
    return answer;
  }
}

return null;

};

const fetchAIResponse = async (query) => {
const apiUrl = 'api.openai.com/v1/engines/davinci/...';
const headers = {
'Content-Type': 'application/json',
Authorization: 'Bearer YOUR_APIKEY',
};

const prompt = `User: ${query} Assistant:`;

const payload = {
  prompt: prompt.substring(0, 2048),
  max_tokens: 50,
  temperature: 0.7,
  n: 1,
};

try {
  const response = await axios.post(apiUrl, payload, { headers });
  return response.data.choices[0].text.trim();
} catch (error) {
  console.error('Error processing chat message:', error);
  return "I'm sorry, I don't have an answer for that.";
}

};

const handleSendQuery = async (e) => {
e.preventDefault();

const relevantAnswer = findAnswerFromData(currentQuery);

if (relevantAnswer) {
  setChatHistory((prevHistory) => [
    ...prevHistory,
    { role: 'user', content: currentQuery },
    { role: 'assistant', content: relevantAnswer },
  ]);
} else {
  const botReply = await fetchAIResponse(currentQuery);

  if (
    botReply !== currentQuery &&
    !chatHistory.some((message) => message.content === botReply)
  ) {
    setChatHistory((prevHistory) => [
      ...prevHistory,
      { role: 'user', content: currentQuery },
      { role: 'assistant', content: botReply },
    ]);
  } else {
    setChatHistory((prevHistory) => [
      ...prevHistory,
      { role: 'user', content: currentQuery },
      { role: 'assistant', content: "I'm sorry, I don't have an answer for that." },
    ]);
  }
}

setCurrentQuery('');

};

return (

handleFileUpload(e.target.files[0])} />

  <form onSubmit={handleSendQuery} className="flex mt-4">
    <input
      type="text"
      value={currentQuery}
      onChange={(e) => setCurrentQuery(e.target.value)}
      className="flex-grow mr-2 px-2 py-1 border border-gray-300 focus:outline-none focus:ring-2 focus:ring-blue-500"
      placeholder="Enter your query..."
    />
    <button
      type="submit"
      className="bg-[blue] text-white px-4 py-2 rounded-md border border-black"
    >
      Send
    </button>
  </form>

  {chatHistory.length > 0 && (
    <div className="mt-4">
      {chatHistory.map((message, index) => (
        <div
          key={index}
          className={`${message.role === 'user' ? 'text-blue-600' : 'text-green-600'}`}
        >
          {message.content}
        </div>
      ))}
    </div>
  )}
</div>

);
};

export default UploadPage;
`

Thomas Hansen • Jul 6 '23

Did you add the last } to your code?

Thomas Hansen • Jul 6 '23

Hmm, I'm not the guy you should ask about ReactJS, or liquid syntax for that matter. Maybe reach out to one of the devs on this particular project ...?

DEV Community

Increasing ChatGPT Live Data Internet Quality

Try ChatGPT with Internet Access

Website scraping

Not all websites CAN be scraped

We do NOT STEAL your information

Conclusion

Top comments (6)

Read next

AI Breakthroughs: Language Models Can Now Control Computer Interfaces Like Humans

Behavioral Questions in AI Interviews: 2025 Insights

Is the EU Falling Behind in the AI Race?

NeurIPS 2024 - What Matters When Building Vision Language Models