Disrupting the AI Scene with Open Source and Open Innovation

#ai #machinelearning #openai #chatgpt

When I discovered OpenAI the 23rd of December 2022 I became obsessed. I hadn't had this much fun coding since I started 40 years ago. After playing with "fine-tuning" for a month, and miserably failing, I found a YouTube video by Dave Shapiro that explained how to create a Q&A chatbot using OpenAI's "embeddings" API.

Dave has since removed the video, but it explains how to use OpenAI's embeddings API and combine it with their chat/completion API to create a Q&A chatbot that knows "everything" about the problem domain. Tage's reaction best sums up my findings as he told me one day.

OMG dad, this time you've really done it. I woke up shaking in the middle of the night because of enthusiasm, and I couldn't even sleep, so I hade to take a walk for 5 kilometres outside in the middle of the night just to calm down

How a Q&A chatbot works

To understand a Q&A chatbot based upon ChatGPT you can go to ChatGPT and find any article and copy and paste it into the prompt as follows;

Answer the following QUESTION given the specificed CONTEXT;

QUESTION; What is the meaning of life?

CONTEXT; [ ... content of some article explaining the meaning of life ... ]

What ChatGPT will do, is to answer any question you might have, while using the article's content as its "single source of truth". What we do, and everybody else that delivers ChatGPT chatbots is to create a database of "context data" that might be created by uploading documents, and/or scraping websites. When the user asks a question, we'll use OpenAI's embeddings API to create a "vector" of the question.

This vector is then used to perform a similarity search through our database, calculating "dot products", which becomes the "distance" between the question and the snippets from our context database. Then we order each result from our context database by this distance, and takes the first 4 to 5 context snippets and sends to OpenAI as the question's "context". Before you ask, yes ...

The whole process is simply "automated prompt engineering" ...

OpenAI's embeddings API again, is incredibly smart at finding similarities between questions and context data, allowing it to find the relevant data given any question you've got related context for in your database.

The problem with AI-based semantic search

The above "dot product" is the problem. To understand why, realise you have to perform a "table scan" through your entire database, extract the embedding vector for each record, and calculate the dot product for each result originating from this process. This is a CPU intensive job, and for a context database with with 2,500 records, this would take 30 to 50 seconds in our Kubernetes cluster for our systems. This is the reason why we've not been able to deliver chatbots with more than 2,500 "snippets" previously.

However, today we fixed this problem, and during the weekend, we'll hopefully be able to deploy a solution that at least in theory allows for 10,000+ snippets, possibly even more, while returning "context data" in 0.02 seconds, instead of 5 minutes.

Ever since I realised the above process was sub-optimal, I've been periodically doing a Google search for "sqlite vector plugin". To understand the importance of this, realise that since OpenAI went viral, at least half a dozen startups have been created with the intention to create a "vector-based database". I know about at least one such database that got 30 million dollars in VC funds earlier this year. To explain why realise the truth of the following statement ...

Who ever solves the vector database problem is destined to control the AI space and hence the world

What's at stake

I personally believe that AI the way it's progressed the last year is probably the most important thing to have occurred on Earth for the last 5 million years, seriously! When people are comparing AI to the internet, heavier than ear flight, the computer for that matter, I tend to laugh, and reply with ...

AI is the most significant event to have happened since we crawled down from the trees. For 5 million years we've been the smartest specie on Earth, that era ends in 2023!

So basically, if somebody can "control" innovation in this space, they basically own the future of mankind. The amount of power a company having such control could yield, would inevitably make all previous power constructs become the equivalent of "a child's playground" in comparison. We cannot let this happen, simply because if somebody "controls" the AI space, that person would be able to yield "God like powers" over the rest of us!

Giving "the people" control is CRUCIAL for the above reasons!

The solution

Even Google have publicly admitted they can't keep up with open source AI innovation. This of course is because of open source projects such as Hugging Face. However, there was always one tiny little piece missing; "vector based database systems". A superior vector based database can easily index millions, and even billions of database records, allowing even 14 year old kids "to build their own Google".

As I searched for "sqlite vector plugin" I didn't find any results, before a couple of weeks ago. Two weeks ago I found Alex' SQLite VSS plugin for SQLite. The library was an amazing piece of engineering from an "idea perspective". However, as I started playing around with it, I realised it was ipso facto like "Titanic". Beautiful and amazing, but destined to leak water and sink to the bottom of the ocean because of what we software engineers refers to as "memory leaks".

I spent a lot of time fixing the library, to the point where it could be argued "I melted down the Titanic, cast a new boat out of its original material, ending up with a 'Battle Ship Cruiser' with perfect memory management". To put that into perspective, below you can find my pull request for Alex' amazing library.

My pull request to SQLite VSS

It's a monster of a pull request, and it fixes some roughly 10 to 20 memory leaks. I ran the entire code through ChatGPT before submitting my PR to Alex, and even ChatGPT couldn't find any memory leaks in it, and claimed every single function, class, and construct was "perfect according to how to correctly create an SQLite database plugin". Before my PR the thing would consume 1GB of memory in a test deployment in our Kubernetes cluster. After it consumed half, and don't grow to inifinity and beyond. The leaks would make the library useless for all practical concerns. Every single leak is now fixed - Effectively making SQLite a better vector database than all vector database systems out there 😅

Once we start using the above plugin in our Kubernetes clusters, we can basically increase "model size" from 2,000 context snippets to (probably) 100,000+ context snippets for extreme cases. This allows us to scrape websites with 10,000+ pages, and create Q&A chatbots out of them. Previously our maximum was roughly 500 web pages.

Further down the road, we might even be able to continue modifying the library to the point where we can in theory index billions of pages using this technology. This would effectively allow us "to build Google 2.0", and create chatbots with knowledge "the size of the Himalayas".

The future is YOURS!

Magic, our platform is 100% open source. Anything else would be unfair considering what's at stake. Within a week, we will deploy these changes into our technology for all to use, allowing you to query databases with 10,000+ records in some 0.02 seconds to extract context data.

This allows us to create chatbots for things such as CouchBase's documentation, Microsoft's website, DEV.to for that matter, etc, etc, etc. This allows us to deliver technology allowing you to outperform Google and Microsoft on search. Combined with the innovation that's happening in the GPT space, with initiatives such as Hugging Face, etc - The inevitable result becomes as follows ...

The future is YOURS! Me and thousands of other open source software developers will make sure of it! 😁

Credits

Alex for having built an amazing semantic search plugin for SQLite! BRAVO Alex!
Dave Shapiro for constantly giving me good ideas for how to approach the AI space
Facebook research for having open source licensed an amazing vector based indexing library
Me for pulling all the strings together, ending up with a usable product, by "melting down the Titanic and creating a Battle Ship Cruiser" out of Alex' original work

Psst, support our work by creating your own ChatGPT chatbot and play with it for a week, for then to purchase a commercial license. As long as you guys keep buying, I'll make sure the people gets the power 😇

Psst, a completely unrelated YouTube videos to find some inspiration ...