Wilbert Misingo

Posted on Aug 28, 2022 • Edited on Apr 17, 2024

Swahili closed domain text retrieval chatbot

#ai #swahiliai #chatbot

INTRODUCTION

In recent times, conversational chatbots have seen a significant increase in adoption due to the growing demand for accurate and sophisticated responses. While we won't delve into the theoretical concept of chatbots here, it's important to recognize that developing a chatbot requires careful consideration of its type. Chatbots can be categorized based on various criteria:

1. Scope of Data Source:

a) Open Domain Chatbot
This type uses a data source with a single topic, which can be sourced locally or from the internet.

b) Closed Domain Chatbot
This type utilizes a data source covering multiple topics, again either locally or online.

2. Means of Producing Responses:

a) Self-Generating Response Chatbot
This chatbot automatically generates responses based on the knowledge it has acquired from open or closed domain data sources.

b) Retrieved Response Chatbot
These chatbots provide responses by retrieving pre-existing information from open or closed domain data sources stored either locally or online.

TYPES OF CHATBOTS

Considering the above categories, we can identify five general types of chatbots:

Open Domain Self-Generating Response Chatbot
Open Domain Retrieved Response Chatbot
Closed Domain Self-Generating Response Chatbot
Closed Domain Retrieved Response Chatbot
Hybrid Chatbot

In this guide, we will focus on the development of a basic closed domain retrieved response Swahili chatbot. This chatbot's data source revolves around computer knowledge, which is accessible both locally (as a text file) and online (via a web link). The implementation involves several libraries, each serving a unique purpose in creating the chatbot's functionality.

IMPLEMENTATION

The development of a closed domain retrieved response Swahili chatbot involves a careful orchestration of various processes, from data preprocessing to response generation. By leveraging libraries like NLTK and Sci-kit learn, and techniques such as TF-IDF and cosine similarity, the chatbot is able to deliver relevant and contextually appropriate responses to user inputs. This guide provides a solid foundation for anyone interested in building sophisticated chatbot systems tailored to specific domains. Let's dive into the implementation details.

Step 01: Installing libraries


pip install pandas
pip install nltk
pip install newspaper3k
pip install scikit-learn

Step 02: Importing Libraries and Modules:

Several essential libraries are imported to facilitate different aspects of the chatbot's development. These include NumPy, Pandas, NLTK, Sci-kit learn, and Newspaper3k.
You're also handling warnings and downloading necessary NLTK resources to ensure smooth processing.


import io
import random
import string
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import re
import pandas as pd
from newspaper import Article
warnings.filterwarnings('ignore')

Step 03: Downloading necessary NLTK modules and data sources

To ensure that your chatbot has access to the necessary linguistic resources and models for tokenization, lemmatization, and cross-lingual analysis, the following NLTK modules and data sources must be downloaded. They contribute to the quality and accuracy of your chatbot's language processing capabilities.

You're also handling warnings and downloading necessary NLTK resources to ensure smooth processing.

nltk.download('popular', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

Step 04: Fetching data from the data from the source source

Depending on whether the source is local or online, the raw data is fetched accordingly. This raw data will be the foundation of the chatbot's responses.

file = open('tarakilishi.txt', 'r', errors='ignore')
raw = file.read()
raw = raw.lower()

The chatbot's data can also be from an online source, where as the above code is how to process data from an online source, this may be a webpage or database. The data source used by this chatbot is concerning the knowledge of computers which is located at here, for a local uses, all the data where copied to text file and for an online source, the link to the web page is used.


raw = Article('https://simple.wikipedia.org/wiki/Light')
raw.download()
raw.parse()
raw.nlp()
raw = raw.text

You may choose which data source is suitable for you, whether a local or an online data source.

Step 05: Tokenizing data from the data source

The fetched data is tokenized into sentences and words using NLTK's tokenization functions. This is crucial for processing and analyzing the text effectively


sent_tokens = nltk.sent_tokenize(raw)
word_tokens = nltk.word_tokenize(raw)

Step 06: Lemmatization and Normalization

The data tokens undergo lemmatization, a process that reduces words to their base form. Additionally, the text is normalized by removing punctuations, making it more amenable to processing.


def LemmatizeTokens(tokens):
    lemmer = nltk.stem.WordNetLemmatizer()
    return [lemmer.lemmatize(token) for token in tokens]

def LemmatizationNormalize(text):
    remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
    return LemmatizeTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

Step 07: User Input and Bot Responses Processing

The chatbot is programmed to recognize user greetings and provide appropriate responses. This adds a touch of personalization to the interactions.


USER_GREETING_INPUTS = ("habari", "habari za sahizi", ...)
BOT_GREETING_RESPONSES = ["nzuri", "poa", "salama", "kheri"]

def greeting(sentence):
    for word in sentence.split():
        if word.lower() in USER_GREETING_INPUTS:
            return random.choice(BOT_GREETING_RESPONSES)

Step 08: Stop Words Processing

Stop words, common words often removed to improve text processing efficiency, are processed for Swahili. This enhances the quality of the responses generated.


data_file = pd.read_csv('Common Swahili Stop-words.csv')
swahili_stop_words = list(data_file['StopWords'])

Step 09: Bot Response Generation Processing

The heart of the chatbot's functionality lies in its response generation mechanism. The chatbot leverages TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity to determine the most relevant response based on the user input.


def response(user_response):
    bot_response = ''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemmatizationNormalize, stop_words=swahili_stop_words)
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx = vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if req_tfidf == 0:
        bot_response = bot_response + "Samahani sijakuelewa, unaweza rudia tena!!"
        return bot_response
    else:
        bot_response = bot_response + sent_tokens[idx]
        return bot_response

Step 10: Maintaining The Conversation Flow

The chatbot maintains a continuous conversation loop, allowing users to interact seamlessly. User inputs are processed, and appropriate responses are generated and displayed.


flag = True
print("BOT: Habari, tuzungumze kuhusu tanakishi/kompyuta, iwapo huitaji kuendelea na mazungumzo, sema inatosha")
while flag:
    user_response = input("YOU: ")
    user_response = user_response.lower()
    if user_response != 'inatosha':
        if user_response == 'asante' or user_response == 'asante pia':
            flag = False
            print("BOT: usijali")
        else:
            if greeting(user_response) is not None:
                print("BOT: " + greeting(user_response))
            else:
                print("BOT: ", end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag = False
        print("BOT: asante, karibu tena!!")

CONCLUSION

This comprehensive breakdown of your code provides a detailed understanding of each section's purpose and functionality. It's important to note that your code showcases a basic closed domain text retrieval chatbot that responds to user input related to computer knowledge in Swahili.

Thats all, the source code can be found at here

Happy Coding!!

Do you have a project 🚀 that you want me to assist you 🤝😊: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow ✅ me on Twitter/X 𝕏
Follow ✅ me on LinkedIn 💼

DEV Community

Swahili closed domain text retrieval chatbot

Top comments (0)