DEV Community

Cover image for Introduction To Natural Language Processing With Python: Language Detection As A Use Case
Victor Isaac Oshimua
Victor Isaac Oshimua

Posted on

Introduction To Natural Language Processing With Python: Language Detection As A Use Case

Have you ever wondered about the mechanism behind your language translation tools and how these tools translate thousands of languages with great precision? Or how do your grammar correction tools understand your language and spot errors in your grammar?

Personally, I have been puzzled by how all of this works. Well, the magic behind this incredible feat lies in one branch of artificial intelligence called Natural Language Processing (NLP).
NLP is applied to many areas of our lives as long as they involve languages. In this article, you will gain an understanding of NLP and how it can be applied to language detection.

Prerequisites

To get the most of this article, you should:

  • Be familiar with machine learning concepts
  • Have a basic knowledge of Python programming
  • Have a basic understanding of data science and machine learning frameworks, such as pandas, NumPy, and Scikit-Learn

What is NLP?

Before we delve into the essence of NLP, understand that natural language encompasses the methods humans use to communicate with one another, including both written and spoken forms.

Natural Language Processing is a subfield of artificial intelligence that involves programming computers to comprehend, process, and analyse human natural language data. Unlike other artificial intelligence problems that involve tabular and image data, NLP typically deals with training data in the form of text and audio. In most cases, the goal is to classify and analyse strings of text.

Techniques Used in NLP

Natural Language Processing encompasses a wide range of techniques and methods to process and analyse human language data. Here are some of the key techniques and approaches used in NLP:

Tokenization

Tokenization is the process of breaking down a text or a sequence of characters into smaller units called tokens. These tokens are typically words, subwords, or even individual characters, depending on the specific use case. Tokenization is a fundamental step in NLP, and it serves in the feature extraction process of NLP tasks because text data needs to be transformed into a numerical format that machine learning models can process. Tokens serve as the basic features for these models. By tokenizing text, you convert it into a format suitable for feature extraction. Here is how to perform tokenization:

def tokenize_text(text):
    """Tokenize a text into words using whitespace as a delimiter."""
    tokens = text.split()
    return tokens

text = "This is an example sentence for tokenization."
tokens = tokenize_text(text)
print(tokens 
Enter fullscreen mode Exit fullscreen mode

Result:
TOKENIZATION

Bag Of Words

The Bag of Words (BoW) is a common technique used in NLP. It represents text data as a collection of individual words or tokens, disregarding their order or structure within the text. BoW is a fundamental method for feature extraction from text data, making it suitable for a wide range of NLP tasks, including text classification, sentiment analysis, and document retrieval.

To implement BoW, you can use Scikit-Learn CountVectorizer, which is a powerful tool for converting a collection of text documents into a matrix of word counts.
Here's an example of how to use CountVectorizer to implement BoW in Python:

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to create the BoW representation
X = vectorizer.fit_transform(corpus)

# Get the vocabulary (unique words)
vocab = vectorizer.get_feature_names_out()

# Convert the BoW representation to a dense array for readability
dense_array = X.toarray()

# Display the BoW representation
print("BoW representation:")
print(dense_array)

# Display the vocabulary
print("Vocabulary:")
print(vocab)
Enter fullscreen mode Exit fullscreen mode

Result:

Bag of words

Stemming

Stemming is an important NLP technique used to reduce words to their base or root form. For instance, words like 'connection,' 'connectivity,' and 'connected' can be stemmed to their root form, which is 'connect.'

This technique plays a vital role in NLP as it helps eliminate redundancy in words, making NLP models more efficient and robust.

A typical application of stemming can be observed in search engines. When a user searches for words like 'love,' 'lovely,' or 'lover,' stemming ensures that the search engine returns similar results.
Here is how to perform stemming using the Snowball Stemmer in Python:

from nltk.stem import SnowballStemmer

# Create a Snowball Stemmer for the English language
stemmer = SnowballStemmer('english')

# Example words to be stemmed
words = ['connection', 'connectivity', 'connected']

# Stem the words
stemmed_words = [stemmer.stem(word) for word in words]

# Print the stemmed words
print(stemmed_words)
Enter fullscreen mode Exit fullscreen mode

Result:

stemmed

Stop words removal

Stop word removal is a data preprocessing technique in NLP used to eliminate the most common words in text data. These words typically include pronouns, prepositions, and conjunctions, which often carry little meaningful information for NLP models.

However, it's important to note that not all NLP tasks require stop word removal. Tasks like text classification, such as spam detection, may benefit from a preprocessing step that involves removing stop words.
Here's how to remove stop words using NLTK (Natural Language Toolkit) in Python:

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Sample text
text = "This is an example sentence with some stop words."

# Tokenize the text
words = nltk.word_tokenize(text)

# Remove stop words
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]

# Join the filtered words back into a sentence
filtered_text = ' '.join(filtered_words)

# Print the filtered text
print(filtered_text)
Enter fullscreen mode Exit fullscreen mode

Result:

stop words removal

Name Entity Recognition (NER)

Named Entity Recognition (NER) is a crucial NLP task that involves identifying and categorising named entities, such as names of persons, organisations, locations, dates, and more, within text data. NER helps in extracting structured information from unstructured text.
Here's how to perform NER using NLTK (Natural Language Toolkit):

# Download necessary NLTK resources
import nltk
nltk.download('punkt')  # Tokenizer
nltk.download('maxent_ne_chunker')  # Named Entity Chunker
nltk.download('words')  # Word corpus
nltk.download('averaged_perceptron_tagger')  # POS Tagger

from nltk import word_tokenize, pos_tag, ne_chunk

# Input text
text = "Apple Inc. is a leading tech company based in Cupertino, California."

# Tokenize the text into words
tokens = word_tokenize(text)

# Perform part-of-speech tagging on the tokens
tagged = pos_tag(tokens)

# Perform Named Entity Recognition (NER) using the ne_chunk function
named_entities = ne_chunk(tagged)

# Print named entities
for entity in named_entities:
    if isinstance(entity, tuple):
        # If it's a tuple, print the word and its POS tag
        print(entity[0], entity[1])
    else:
        # If it's a named entity, print the words and the entity label
        print(" ".join([word for word, tag in entity]), entity.label())
Enter fullscreen mode Exit fullscreen mode

Result:

ner

Language detection with NLP

Language detection is an NLP task that involves identifying the language of a given text. It is applied in spelling and grammar correction applications like Grammarly, as well as in next-word predictions on keyboards.

In this project, we will utilise various NLP techniques to construct a language detection model.
The objective of this project is to create a machine learning classifier capable of categorising different languages based on their text features.

The data used in this project was gotten from Kaggle . It contains 10337 texts of 17 different languages.
The languages are: English, Malayalam, Hindi, Tamil, Portuguese, French, Dutch, Spanish, Greek, Russian, Danish, Italian, Turkish, Swedish, Arabic, German, Kannada

In summary, we will use NLP techniques to process this data and train a machine learning classification model using the data. The model will classify text features with the label 'language,' allowing it to detect the language to which a text belongs.
Join me as we embark on a step-by-step journey to build this project.

Step 1: Read and observe the data

First, we need to download the data from Kaggle, which is stored in CSV format. To handle and manipulate the data, we will use pandas.

# Importing the pandas library and aliasing it as 'pd'
import pandas as pd

# Reading a CSV file named "Language Detection.csv" into a DataFrame
df = pd.read_csv("Language Detection.csv")

# Shuffling the rows of the DataFrame with a specific random state (42)
df = df.sample(frac=1, random_state=42)

# Resetting the index of the DataFrame and dropping the old index
df = df.reset_index(drop=True)

# Displaying the first few rows of the DataFrame
df.head()
Enter fullscreen mode Exit fullscreen mode

Result:

Data head

Now that we've loaded the data and had a glimpse of its structure, let's proceed to explore the languages represented in the dataset.

# Count the occurrences of each language in the "Language" column
language_counts = df["Language"].value_counts()
print(language_counts)

Enter fullscreen mode Exit fullscreen mode

Result:

language counts

Step 2: Data preparation

The next step is to prepare the data so that it can be used as input for the machine learning algorithm.

# Importing numpy for numerical operations
import numpy as np

# Importing CountVectorizer to create bag of word features
from sklearn.feature_extraction.text import CountVectorizer

# Importing train_test_split for splitting the data
from sklearn.model_selection import train_test_split

# Creating numpy arrays for the "Text" and "Language" columns
x = np.array(df["Text"])
y = np.array(df["Language"])

# Initializing a CountVectorizer to create bag of word features
cv = CountVectorizer()

# Transforming the text data into a sparse matrix of token counts
X = cv.fit_transform(x)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)


Enter fullscreen mode Exit fullscreen mode

Step 3: Building model with Multinomial Naïve Bayes algorithm

The Multinomial Naïve Bayes algorithm is a probabilistic classification algorithm commonly used in NLP and text classification tasks due to its simplicity and effectiveness. It's specifically designed for discrete data, making it well-suited for text data.

To learn more about this algorithm, check out this helpful tutorial by Analytics Vidhya.

# Import the Multinomial Naïve Bayes classifier from scikit-learn
from sklearn.naive_bayes import MultinomialNB

# Create an instance of the Multinomial Naïve Bayes model
model = MultinomialNB()

# Fit (train) the model using the training data
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

After building the model, we have to evaluate the model's performance. There are various performance evaluation metrics. Let's use the accuracy score to assess the model's accuracy on the test data.

# Calculate the accuracy score of the model on the test data
accuracy_score = model.score(X_test, y_test)

# Print the model's accuracy score
print(f"The model accuracy score is: {accuracy_score}")

Enter fullscreen mode Exit fullscreen mode

Result:

accuracy score

The model has achieved an accuracy score of 97%, which means that it correctly predicts the language for approximately 97% of the instances in the test dataset. In simpler terms, out of all the data points in the test dataset, the model's predictions are accurate for approximately 97% of them. This high accuracy score demonstrates the effectiveness of the model we've built.

Step 4: Use the model to detect languages

Let's create a program that will prompt the user to enter a text, and the model will detect the language of the text.

# Create a loop to continuously ask for user input
while True:
    user = input("Enter a Text (or 'end' to exit): ")  # Prompt the user to enter text

    # Check if the user wants to exit the loop
    if user.lower() == 'end':
        break  # Exit the loop if user enters 'end'

    data = cv.transform([user]).toarray()  # Transform user input into numerical data
    output = model.predict(data)  # Use the model to predict the language
    print(output)  # Print the predicted language
Enter fullscreen mode Exit fullscreen mode

I tried these texts on the program: I love my mum, Позже континенты воссоединились, образовав Паннотию, которая распалась около, Buenas tardes, Ich liebe dich.
Here are the results:

result

Conclusion

In this journey through language detection with machine learning and natural language processing, we've explored the fascinating world of automated language identification. We began with the fundamental concept of language detection and its real-world applications, from spell checkers to predictive text input.

Our step-by-step guide has led us through the process of building a robust language detection model. We've seen how to collect, preprocess, and prepare textual data for machine learning. We even created a user-friendly program that allows input text and have its language automatically detected.

The highlight of this project was achieving a remarkable accuracy score of 97%. This means that our model excels at correctly identifying languages for a vast majority of text inputs. It's a testament to the power of natural language processing and machine learning when applied effectively.

As we conclude this journey, it's essential to recognise the broader implications of language detection. Beyond its practical applications, language detection is a testament to the incredible progress we've made in the field of artificial intelligence. It's a reminder that the boundaries of what we can achieve with technology continue to expand.

I encourage you to take this knowledge further, explore more advanced techniques, and apply language detection to your own projects. Whether you're building multilingual applications, enhancing user experiences, or delving deeper into the world of natural language processing, the possibilities are endless.

Thank you for joining us on this educational and enlightening adventure. As you continue your exploration of the ever-evolving landscape of technology and language, remember that the journey itself is as valuable as the destination.

Happy coding!

Source code

Top comments (0)