Yesterday we were able to connect our ChatGPT chatbots to the internet. This feature is still experimental, and it doesn't always work - But today we were able to sort out 90% of the bugs related to this.
In this article I will highlight some of the more important things we did to accomplish this, and share my findings with the rest of the world. Hopefully it'll be useful for others trying to accomplish the same.
We're a Hyperlambda shop of course, but you can probably transpile our ideas to your programming language of choice.
Try ChatGPT with Internet Access
Before I start explaining what we did, do me a favour. Click our chat button in the bottom/right corner of this page, and write the following into it;
Find me information for the following query "Does WHO declare COVID-19 to be a pandemic in 2023?"
The point being that you'll get something such as the following back from it.
The difference between the above answer and what ChatGPT answers you is obvious I presume. Our chatbot can reach out to the internet using DuckDuckGo and scrape resulting websites, allowing us to have ChatGPT deal with real time information from the internet. Notice, it will only reach out to the internet if you write your query as follows;
Find me information for the following query "QUERY_HERE"
Now that we've got the semantics out of the picture, let's look at some of the things we had to do to increase the quality of this process. Initially, only about 40 to 50 percent of our queries would succeed, and below I will explain why, and what we did to increase this number beyond 90%.
Website scraping
We've probably got the best web scraping technology in the industry, and we've learned a lot as we've been using it over the last 7 months, scraping dozens of websites every single day due to our get a free AI chatbot web form. This puts us in a unique position to understand how to create high quality AI training data from websites, and all sorts of different sources - And you'd be surprised by how much of "the AI problem" is good old fashion software development, with algorithms, architecture, composition, software design, and simple code.
If you want better AI, write better traditional code 😉
Some of the more important findings in regards to website scraping we've done are as follows.
Not all websites CAN be scraped
We try to be a "good scraping citizen". With this I mean we clearly identify our spiders as website scrapers, using unique identifiable HTTP User-Agent headers, and we try our best to respect websites to avoid overloading them as we scrape them. More work can be done here, but at least contrary to most others, we don't "hide" the fact that we're scraping your website.
However, not all websites allows for being scraped. Some websites simply shuts off all web scrapers they can identify. Some have web firewalls, preventing anything but "human beings" to access these and scrape these - Which creates a problem for us as we try to retrieve whatever information we can find at these sites.
The way we solve this, is by invoking DuckDuckGo and retrieve the top 5 hits for whatever query the user is searching for. Then we retrieve all of these in parallel, with a timeout of 10 seconds. Why the timeout? Because some sites will "block you from getting data, while keeping the socket connection open", implying they will never return. The idea is that unless the site returns its HTML in less than 10 seconds, we'll release the HTTP connection, and simply ignore this URL.
Out of 5 hits from DuckDuckGo typically 1 or less will block. Since we're fetching information in parallel, async from 5 URLs, we'll still get "some information from 2/3 websites" 98% of the time. And the process as a whole will never take more than 10 seconds due to our timeout. The timeout is crucial for us, since we don't persist data locally, but always fetches it on demand, implying 10 seconds to scrape web pages becomes 10 extra seconds to get your answer from ChatGPT.
Below is the primary entry point code. Even if you don't understand Hyperlambda, you should be able to understand the general idea, and possibly transpile it into your programming language of choice.
/*
* Slot that searches DuckDuckGo for [max] URLs matching the [query],
* for then to scrape each URL, and aggregating the result
* returning it back to caller as a single Markdown.
*/
slots.create:magic.http.duckduckgo-and-scrape
// Sanity checking invocation.
validators.mandatory:x:@.arguments/*/query
validators.string:x:@.arguments/*/query
min:3
max:250
validators.integer:x:@.arguments/*/max
min:1
max:10
// Searching DuckDuckGo for matches.
add:x:+
get-nodes:x:@.arguments/*
signal:magic.http.duckduckgo-search
// Building our execution object that fetches all URLs simultaneously in parallel.
.exe
// Waiting for all scraping operations to return.
join
for-each:x:@signal/*/result/*
// Dynamically contructing our lambda object.
.cur
fork
.reference
try
unwrap:x:+/*
signal:magic.http.scrape-url
url:x:@.reference/*/url
semantics:bool:true
.catch
log.error:Could not scrape URL
url:x:@.reference/*/url
message:x:@.arguments/*/message
// Adding URL and title as reference to currently iterated [fork].
unwrap:x:+/*/*
add:x:@.cur/*/fork/*/.reference
.
url:x:@.dp/#/*/url
title:x:@.dp/#/*/title
// Adding current thread to above [join].
add:x:@.exe/*/join
get-nodes:x:@.cur/*
// Executing [.exe] retrieving all URLs in parallel.
eval:x:@.exe
/*
* Iterating through each above result,
* returning result to caller.
*
* Notice, we only iterate through invocations that have result, and
* did not timeout by verifying [signal] slot has children.
*/
for-each:x:@.exe/*/join/*/fork
// Verifying currently iterated node has result, containing both prompt and completion.
if
exists:x:@.dp/#/*/try/*/signal/*/*/prompt/./*/completion
.lambda
// Adding primary return lambda to [return] below.
unwrap:x:+/*/*/*
add:x:../*/return
.
.
url:x:@.dp/#/*/.reference/*/url
title:x:@.dp/#/*/.reference/*/title
snippets
// Adding [snippets] to return below.
add:x:../*/return/0/-/*/snippets
get-nodes:x:@.dp/#/*/try/*/signal/*
// Returning result of invocation to caller.
return
The basic idea is as follows;
- Query DuckDuckGo and scrape the reulting top 5 URLs
- Create one async thread for each result, and retrieve these from their respective URLs, with a timeout of 10 seconds
- Wait for all threads to finish, and create an aggregated result
There is a lot more code related to this, but since Magic is Open Source, you can study its code for more details. For instance, we do a lot to try our best to create Markdown out of the resulting HTML. This significantly reduces the amount of data we're sending to ChatGPT, while also keeping hyperlinks, images, and lists in their semantic form. This is why our chatbot can display images, hyperlinks, and lists the way it does. This simple fact increases the quality of our chatbot alone by at least 1 order of magnitudes.
We do NOT STEAL your information
One thing we do different, is that we try our best to always provide source and references to our users, if we can fit it into the context. This implies it'll typically end its explanation with something like "This information was fetched from the following URLs; abc, xyz".
This is first of all the polite thing to do, and secondly it allows our users to fact check what our chatbots are telling you. The end result becomes that instead of "stealing traffic from your website", we'd probably instead GIVE your website additional traffic - Since users would probably want to fact check their queries by reading the source DuckDuckGo provides us with.
Conclusion
This is hard. I remember my former partner saying; "Why should I invest in something anybody can copy and steal?" Well, so far we're the only one in the industry able to do what we're currently doing. We're basically "10 years ahead of the competition", and nobody are able to "copy us" - Even though I do my best to help them copy our ideas every single day, by exclusively innovation openly in the public space, and open source licensing 99% of every single line of code I write 😂
You were wrong, I was right, check. 7 billion 999 million 999 thousand and 999 more to go 😂
Top comments (6)
I am unable to access uploaded .csv file for chatbot process by using openai api key in nextjs project.Kindly suggest me any libraries or logic to moveahead.
Not sure what you mean? Are you uploading CSV files to OpenAI?
Liquid syntax error: Variable '{{% raw %}' was not properly terminated with regexp: /\}\}/
I created "chatbot automation process" page in nextjs.In this page, I created the upload option to upload csv file. I try to implement chatbot for user and AI conversation using openai apikey based on current uploaded csv data only but I unable to get relevant response from AI for user query. The code is given below:
`import React, { useState } from 'react';
import Papa from 'papaparse';
import axios from 'axios';
const UploadPage = () => {
const [csvData, setCsvData] = useState([]);
const [chatHistory, setChatHistory] = useState([]);
const [currentQuery, setCurrentQuery] = useState('');
const handleFileUpload = (file) => {
if (file) {
const reader = new FileReader();
reader.onload = (e) => {
const text = e.target.result;
const parsedData = Papa.parse(text, { header: true }).data;
setCsvData(parsedData);
};
reader.readAsText(file);
} else {
setCsvData([]);
}
};
const findAnswerFromData = (query) => {
if (!query || csvData.length === 0) {
return null;
}
};
const fetchAIResponse = async (query) => {
const apiUrl = 'api.openai.com/v1/engines/davinci/...';
const headers = {
'Content-Type': 'application/json',
Authorization: 'Bearer YOUR_APIKEY',
};
};
const handleSendQuery = async (e) => {
e.preventDefault();
};
return (
handleFileUpload(e.target.files[0])} />
);
};
export default UploadPage;
`
Did you add the last
}
to your code?Hmm, I'm not the guy you should ask about ReactJS, or liquid syntax for that matter. Maybe reach out to one of the devs on this particular project ...?