The Chatbot Part 1 : Behind the curtains of NLP

#nlp #machinelearning #deeplearning

I started this blog series with the intention of documenting all my steps as I built a functional chatbot application. However, at the time of writing this article, I've already connected a working chatbot model served on a Flask backend to a React frontend. It ended up only taking a couple of days.

My familiarity with React and the Client-Server model allowed me to connect the chatbot model and frontend together rather easily. However, the logic of building the chatbot I never bothered to comprehend. This is the biggest pitfall of following tutorials - you tend to get caught up in seeing a finished product as soon as possible. Now, having GitHub Copilot autocomplete most of the code did not help matters.

In the days since, I've had time to reflect on the whole process. Although initially this article was meant to cover all the steps involved in building the chatbot model, doing so would have defeated the purpose of even going through all this. It is very important to separate the theory from the code from the implementation. So we're starting off with the theory. I promise you it'd be anything but boring!

The basics of NLP

Natural Language Processing - NLP in short - is at the core of an AI-powered chatbot. A functioning chatbot application relies on the model's ability to comprehend the human language and its intricacies.

Now, computers think and talk in numbers you see. So how do you translate between human and machine language? This is where we get into some of the core concepts behind NLP (basically all the new stuff I've come to discover while trying to break down how the chatbot functions)

Stemming vs Lemmatization

Words are complicated things. Their meaning can change based on context. Taking a word away from a sentence and nothing could change. Or everything could. As humans, we've become attuned to naturally distinguish between all the possible variations without a sliver of thought.

But how can you teach a machine all these things? The short answer is - you don't. Instead, you teach it to act based on patterns. You help it understand how to talk to you. To respond to you based on the nature of your input. After all, that is the singular purpose upon which a chatbot is built is it not?

Now if that is the case, we're free to abstract away all those fine details that make language as intricate as it is. Cannot we reduce generous and generosity to say "gener" if all we're deriving from the word is the context it adds to a sentence? Provided we also deal with the possible confusion with words like general, we indeed could. This is at the heart of "Stemming" - you reduce a word to its root - also called its "lemma".

Lemmatization does basically the same thing. However, there is a key difference. Rather than explain it myself, I'd like to point you to this answer on stackoverflow :

That last point is very important. Based on your use case, you're free to go with one or the other. I'll go over implementation when we get there.

Bag of Words

So say we've done our lemmatization (or stemming), using our sample data. Our model would now have an idea of what to look for. A vocabulary. Do note that all it knows is all you've trained it on!

But a chatbot will need to understand complete sentences. These sentences will likely contain multiple words that are not part of its current vocabulary.

How would we deal with this? Again, remember that the goal is to get the model to now assign context to sentences in some way or the other - nothing more and nothing less. We would need to find a way to do this based on the limited vocabulary it has. This is where the concept of "Bag of Words" (BoW) kicks in.

As much as I'd prefer to leave the implementation details to a future post, it'd be rather hard to dive into the concept of BoW without doing so.

A chatbot application - and essentially the majority of NLP algorithms - perform what is known as classification. And that means exactly what it sounds like. You group an input into one of several possible classes (in our case "contexts") - classes that are prepared in advance. For example, we can build a chatbot that understands solely the contexts "greeting" and "goodbye". Note that these contexts are purely arbitrary. Knowing this detail, we can move onto how a Bag of Words becomes intuitive in the scheme of things.

So you have two things - your vocabulary and your classes i.e contexts. As we're dealing with supervised learning in the case of NLP (we train the bot explicitly how to understand its world), we also have the data we train it with (a set of sentences each labelled with their context). As an example, say our vocabulary contains the lemma "hi", "bye","fly" and "bear". Say our classes are "greeting" and "farewell".

Given we have labelled data, we take each sentence we have, count the occurrences of the words that we know that appear in this sentence i.e we weight the parameters (our words), and assign that weight the label as defined in the training data. Each sentence essentially boils down to a "bag of words".

For example, in our case, say we have the sentence "Hi! How are you this morning?" in our training data, with a label of "greeting". We notice that it contains a word from our vocabulary - "hi". So the sentence with the parameter "hi" having a frequency of 1 must have the "greeting" class set to 1 i.e "true" for "greeting" and "false"/0 for the class "farewell". Essentially, we're reducing sentences to numbers while trying to retain a sense of context.

Confused? You might probably want to read all that again. Or take a look at this super helpful video:

Corpus

Admittedly, knowing what corpus means isn't critical to understanding how NLP or a chatbot works. However, it's one of those terms that tends to come up a lot in a similar project (and one of those fancy words that you could casually throw around to hint to someone that you REALLY know your stuff - regardless of whether you actually do or not!) .

Essentially what a corpus is is your labelled text (or audio) data - the data you ultimately use to train your model. However, it can also extend to the bodies of text used to lemmatize your words and sentences as well. Basically anything that is a labelled body of text (or audio).

Conclusion

So I've dissected all the theory that would be essential to understanding how an NLP-based chatbot works under the hood as best I understand it. In the following post, we can (finally) dive into the implementation and coding specifics. As always, feel free to let me know your thoughts on the article. And seeing as a majority of this was written based on my own understanding of matters, please do correct me if I'm off anywhere!

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.