DEV Community

Cover image for Building Nirmol: A Bangla Offensive Language Detection API and Dataset
Sakib Mahmud
Sakib Mahmud

Posted on

Building Nirmol: A Bangla Offensive Language Detection API and Dataset

Nirmol is a Bangla/Bengali/Banglish offensive/bad/slang words Detection API available on GitHub as open source. Before Nirmal was created, there was no good API or any other solution to detect swearing or bad words in the Bengali language.

Motivation

Although such API is available for English or other popular languages, I could not find any kind of solution for the Bengali language even after searching a lot. I needed such a solution mainly for the platform I'm working with other developers to build.

In the platform, we are building, a Bengali-speaking person can fill in his/her data, and after publishing it can be seen by anyone on the internet. Many mischievous users use profanity which can cause considerable damage.

My job was to design the whole system properly and find some microservices-based solutions for minor problems. Finally, I was able to create a solution to this problem myself.

Design and Development

There can be several solutions to this problem like we can filter out Bengali words using artificial intelligence or we can create a collection where negative or bad words will be kept together to create a filter-out system.

Since our entire platform is already very complex initially we don't have much competition power. So in that case we could not agree on using artificial intelligence. Besides, none of the existing Bengali artificial intelligence can properly detect Bangla bad words or swearing.

I created a form and shared it with my friends and acquaintances through social media and collected a list of various commonly used Bengali bad words from them. Apart from that, I created a data set myself by combining different previously published datasets. It was not so easy to build my dataset.

Then I generated a JSON file and wrote an Express JS app that gets words or sentences and then checks if that is in the JSON file or not.

Installation

You can download the dataset from the GitHub repository but here is the Direct dataset link. You can download and use this dataset for ML and AI model training.

Nirmol API is based on:

  1. Node.js
  2. Express.js

npm package used

  1. body-parser
  2. cors
  3. fs
  4. nodemon

Run Nirmol locally

Step 1: Clone the Nirmol repository

git clone https://github.com/Sigmakib2/Nirmol.git

Step 2: Go to the Nirmol directory

cd Nirmol

Step 3: Install node modules

npm install

Step 4: Start the project

npm start

Then, open your web browser and navigate to http://localhost:3000, and you should see "Cannot GET /" displayed on the page. To test the API you have to enter something after the '/'. For example "http://localhost:3000/hello world"

API Response

The API endpoint analyzes a sentence for offensive/slang words and provides additional information about the sentence.

For example here is a get request and response:

Image description

{
"bad_sentence": true,
"bad_word_list": [
"কুত্তা"
],
"normal_words": [
"একটি",
"গালি",
"বা",
"খারাপ",
"শব্দ"
],
"badness": "16.67%"
}

You can also use the POST method to get a response. This feature was added by Tasnim Anas.

For POST request: the endpoint is "http://localhost:3000/" and you have to send payload in the body like this:

{
"sentence": "Your sentence here..."
}

Here's what the response means:

  1. bad_sentence: Indicates whether the sentence contains any offensive/bad/slang words or not. This only returns boolean values.
  2. bad_word_list: Lists the offensive/bad/slang words found in the sentence.
  3. normal_words: Lists the words in the sentence that are considered normal or not offensive/bad/slang words.
  4. badness: Indicates the proportion of offensive/bad/slang words in the sentence.

Use Cases

Here are some use cases of this API

  1. Content moderation: Bangla websites often host user-generated content such as comments, forum posts, or user profiles. This API can be integrated into these platforms to automatically detect and filter out inappropriate language, thus maintaining a clean and respectful environment for users.
  2. Social media platforms: Social media platforms that support Bangla language content can use this API to automatically flag or filter out offensive or inappropriate content in user posts, comments, and messages, helping to maintain a positive and safe community for users.
  3. E-commerce platforms: E-commerce websites serving the Bangla-speaking community can utilize this API to ensure that product reviews and comments remain free from offensive language, ensuring a positive shopping experience for customers.
  4. Educational platforms: Educational websites and software applications targeting Bangla-speaking users can use this API to monitor and filter user-generated content in discussion forums, chatrooms, or collaborative projects, promoting a respectful and constructive learning environment.
  5. Parental control software: Parental control software can leverage this API to monitor and filter out inappropriate content in Bangla language websites and applications, helping parents protect their children from exposure to harmful or offensive material online.
  6. Chat applications: Bangla language chat applications can integrate this API to automatically detect and filter out offensive language in user messages, helping to maintain a friendly and respectful communication environment among users.
  7. Customer support platforms: Customer support platforms serving Bangla-speaking customers can use this API to monitor and filter out abusive or inappropriate language in customer inquiries and support tickets, ensuring a professional and respectful interaction with users.

Top comments (0)