DEV Community: NEUROTECH AFRICA

NLP Communities for Data Professionals to Join

Anthony Mipawa — Wed, 30 Nov 2022 13:03:25 +0000

Are you a data professional, engineer, or aspiring person to grow in NLP fields?

Yes, this is for you

One of the best methods to stay current with all the newest technologies and tools connected to NLP in the tech industry is to join NLP communities.

Tech communities are beneficial in keeping tech enthusiasts updated and motivated in building impactful tools even in idea growth, also project success. You may be working in the same industry, and having a channel to meet with different folks working in the same industry can help you on improving your expertise but also in time expose you to new tools. I bring this up because it is one of the strategies I have been using for more than four years.

I came across many people asking about NLP communities to engage and grow their expertise from that experience, I planned to put every piece together and share with folks out there.

I hope this will help a lot of folks looking for these communities, stay with me to explore the NLP communities.

Masakhane:

Masakhane pushing to build datasets and tools to facilitate Natural Language Processing in African languages and pose new research problems to enrich the NLP research landscape. A research effort originally for Machine translation focused on African languages that are open-source, continent-wide, and distributed online. It aimed to build a community of Natural Language Processing researchers, connect and grow it, spurring and sharing further research to enable language preservation, tool building, and increasing its global visibility and relevance.

You can join Masakhane slack community workspace through 👉 Join here

You can join the Masakhane mail list group through 👉 Join here

NeuralSpace Community:

A group of NLP enthusiasts led by NeuroSpace company with the mission to create a platform that helps bridge the massive language gap, that is prevalent around the world and prevents many from accessing vital services or education.

They do use slack as a channel for exchanging information and organizing NLP events with collaboration from experts from Meta AI, NeuralSpace, LoResMT, and Masakhane.

You can join NeuralSpace slack community workspace through 👉 Join here

Hugging Face Community:

A place where a broad community of data scientists, researchers, and ML engineers can come together and share ideas, get support and contribute to open source projects.

Hugging Face is a community and data science platform that provides tools that enable users to build, train and deploy ML models based on open-source code and technologies.

It is one of the most awesome community I have ever encountered in the NLP space, each day people share cutting-edge tools which are essential to the NLP ecosystem. Everyone can exchange and examine models and datasets at the Hugging Face central hub. In order to democratize AI for everyone, they aspire to become the location with the largest collection of models and datasets.

You can join Hagging Face community workspace through 👉 Join here

Spark NLP:

Slack group for developers and Spark NLP users to help get started to solve common NLP use cases and exchange ideas on best NLP practices. This community was built on the grounds of knowledge and communication management.

You can join the Spark NLP community slack workspace through 👉 Join here

Lanfrica Community:

Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralized hub. They do organize a series of talks to highlight and showcase language technology efforts (research, projects, software, applications, datasets, models, initiatives, etc.) geared towards under-represented languages around the world.

Lanfrica is equally interested in efforts targeting (or that can be transferred to) low-resource languages (these are languages with not much data, societal/research efforts or technologies, and recognition) and endangered languages.

You can join the Lanfrica community mailing list through 👉 Join here

You can join the Lanfrica community slack workspace through 👉 Join here

Other DS & ML Communities:

Kaggle: is a well-known data science competition platform. It boasts a community of over 5 million users, where you can compete and share data sets and projects. Inside Kaggle you’ll find all the code and data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time. The best thing I like about kaggle is they have a well-structured and interactive learning environment even for beginners to start their journey in data science and machine learning.

Zindi Africa: for sure this platform played an essential role in my career, not saying like dumped everything in my head but I consumed a lot of challenges to improve my data science understanding.

Zindi hosts the largest community of African data scientists, working to solve the world’s most pressing challenges using machine learning and Artificial Intelligence.

You can join the Zindi community through 👉 Join here

Driven Data: works on projects at the intersection of data science and social impact, in areas like international development, health, education, research and conservation, and public services. They focused to give more organizations access to the capabilities of data science and engage more data scientists with social challenges where their skills can make a difference.

DataTalks: This is another awesome community I do prefer to join their events and training programs. DataTalks is the place to talk about data, the global online community of data enthusiasts. Also, they do post their events on youtube through their channel, which is a very resourceful platform for data professional growth.

You can join the DataTalks community slack workspace through 👉 Join here

MLOps Community: The great community for learning topics related to machine learning models into production, they fill the swiftly growing need to share real-world Machine Learning Operations best practices from engineers in the field.

MLOps community hosts weekly talks and fireside chats about everything that has to do with the new space emerging around DevOps for Machine Learning aka MLOps aka Machine Learning Operations.

Curious to dig more about this awesome community?

You can join the MLOps community slack workspace through 👉 Join here

Final Thoughts:

When it comes to the advancement of AI, the open-source community is becoming more and more significant. Sharing information and resources to advance and advance is where the future is headed because no firm, not even the tech giants, will be able to "solve AI" on their own!

I hope this article opened new thoughts in the machine learning space, please spread the love by sharing with others on socials.

Understanding How to Evaluate Textual Problems

Anthony Mipawa — Tue, 13 Sep 2022 09:47:52 +0000

As a data professional, building models is a common topic what differs is just what that model is for? models, should solve certain challenges? then after we consider measuring the quality and performance of these models using evaluation metrics and these are essential to confirm something concerning built models.

Evaluation metrics are used to measure the quality of the statistical or machine learning model.

This article was originally published on the Neurotech Africa blog.

Need for evaluation?

The aim of building AI solutions is to apply them to real-world challenges. Mind you, our real world is complicated, so how do we decide which model to use and when? that is when their metrics come into application.

A failure to know how to justify why your choosing a certain model instead of others or why a certain model is good or not, indicates you are not aware of what your solving or the model you built.

"When you can measure what you are speaking of and express it in numbers, you know that on which you are discussing. But when you cannot measure it and express it in numbers, your knowledge is of a very meager and unsatisfactory kind." ~ Lord Kelvin

Today let's have a sense of what are the metrics used in Natural Language Processing challenges.

Textual Evaluation Metrics

In the Natural Language Processing (NLP) field, it is difficult to measure the performance of models for different tasks, challenge with labels is easier to evaluate but in the case of NLP task, the ground truth or result can be varied.

We have lots of downstream tasks such as text or sentiment analysis, language generation, question answering, text summarization, text recognition, and translation.

It is possible that biases creep into models based on the dataset or evaluation criteria. Therefore it is necessary to make Standard Performance Benchmarks to evaluate the performance of models for NLP tasks. These Performance metrics give us an indication of which model is better for which task.

Let's jump right in to discuss some of the textual evaluation metrics 😊

Accuracy: common metric in sentiment analysis and classification, not the best one but denotes the fraction of times the model makes a correct prediction as compared to the total predictions it makes. Best used when the output variable is categorical or discrete. For example, how often a sentiment classification algorithm is correct.

Confusion Matrix: also used in classification challenges, It provides a clear report on the prediction of models in different categories, from the primary objective visualization of the model the following questions can be answered:-

What percentage of the positive class is actually positive? (Precision)

What percentage of the positive class gets captured by the model? (Recall)

What percentage of predictions are correct? (Accuracy)

Also, we can consider Precision and Recall are complementary metrics that have an inverse relationship. If both are of interest to us then we’d use the F1 score to combine precision and recall into a single metric.

Perplexity: is a great probabilistic measure used to evaluate exactly how confused our model is. It’s typically used to evaluate language models, but it can be used in dialog generation tasks.

The language model refers to how machine-generated text is similar to humans write it. In other words, given w previous word and the correct score of generating w+1 token. The lower you get the perplexity, the better model you have.

Find this article about the perplexity evaluation metric, and take your time to explore Perplexity in Language Models.

Bits-per-character(BPC) and bits-per-word: are other metrics often used for language models evaluations tasks. It measures exactly the quantity that it is named after the average number of bits needed to encode on character.

“if the language is translated into binary digits (0 or 1) in the most efficient way, the entropy is the average number of binary digits required per letter of the original language." ~ Shannon

Entropy is the average number of BPC. The reason that some language models report both cross entropy loss and BPC is purely technical.

In practice, if everyone uses a different base, it is hard to compare results across models. For the sake of consistency, when we report entropy or cross-entropy, we report the values in bits.

Mind you, BPC is specific to character-level language models. When we have word-level language models, the quantity is called bits-per-word (BPW) – the average number of bits required to encode a word

General Language Understanding Evaluation (GLUE): this is a multi-task benchmark based on different types of tasks rather than evaluating a single task. As language models are increasingly being used for the purposes of transfer learning to other NLP tasks, the intrinsic evaluation of a language model is less important than its performance on downstream tasks.

Super General Language Understanding Evaluation(superGLUE): methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. This is the better or modified version of the GLUE benchmark with a new set of more difficult language understanding tasks, and improved resources after a GLUE benchmark performance comes close to the level of non-expert humans.

It comprised new ways to test creative approaches on a range of difficult NLP tasks including sample-efficient, transfer, multitask, and self-supervised learning

BiLingual Evaluation Understudy(BLEU): commonly used in Machine translation and Caption Generation, Since manual labeling for professional translation is very expensive the metric used in comparing a candidate translation(by machine) to one or more reference translations(by a human being). And the output lies in the range of 0-1, where a score closer to 1 indicates good quality translations.

The calculation of BLEU involves the concept of n-gram precision and sentence brevity penalty.

This metric has some drawbacks such as It doesn’t consider the meaning, It doesn’t directly consider sentence structure and It doesn’t handle morphologically rich languages.

Rachael Tatman wrote an amazing article about BLEU just take your time to read it here.

Self-BLEU: this ****is a smart use of the traditional BLEU metric for capturing and quantifying diversity in the generated text.

The lower the value of the self-bleu score, the higher the diversity in the generated text. Long text generation tasks like story generation, news generation, etc could be a good fit to keep an eye on such metrics, helping evaluate the redundancy and monotonicity in the model. This metric can be complemented with other text generation evaluation metrics that account for the goodness and relevance of the generated text.

Metric for Evaluation of Translation with Explicit ORdering(METEOR): Precision-based metric to measure the quality of the generated text. Sort of a more robust BLEU. Allows synonyms and stemmed words to be matched with the reference word. Mainly used in machine translation.

METEOR solved two BLEU drawbacks' of not taking recall into account and only allowing exact 𝑛-gram matching. Instead, METEOR first performs exact word mapping, followed by stemmed-word matching, and finally, synonym and paraphrase matching then computes the F-score using this relaxed matching strategy.

METEOR only considers unigram matches as opposed to 𝑛-gram matches it seeks to reward longer contiguous matches using a penalty term known as fragmentation penalty.

BERTScore: this is an automatic evaluation metric used for testing the goodness of text generation systems. Unlike existing popular methods that compute token-level syntactical similarity, BERTScore focuses on computing semantic similarity between tokens of reference and hypothesis.

Bidirectional Encoder Representations from Transformers compute the cosine similarity of each hypothesis token 𝑗 with each token 𝑖 in the reference sentence using contextualized embeddings. They use a greedy matching approach instead of a time-consuming best-case matching approach and then compute the F1 measure.

BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics.

Character Error Rate (CER): this is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

Possible tasks CER can be applied to measure the performance are Speech Recognition, Optical Character Recognition (OCR), and Handwriting Recognition.

Word Error Rate (WER): this is a common performance metric mainly used for speech recognition, optical character recognition (OCR), and handwriting recognition.

When recognizing speech and transcribing it into text, some words may be left out or misinterpreted. WER compares the predicted output and the reference transcript word by word to figure out the number of differences between them.

There are three types of errors considered when computing WER:

Insertions: when the predicted output contains additional words that are not present in the transcript(for example, SAT becomes essay tea)
Substitutions: when the predicted output contains some misinterpreted words that replace words in the transcript(for example, noose is transcribed as moose).
Deletions: when the predicted output doesn’t contain words that are present in the transcript(for example, turn it around becomes turn around).

For understanding let's consider the following reference transcript and predicted output:

Reference transcript: “Understanding textual evaluation metrics is awesome for a data professional”.
Predicted output: “Understanding textual metrics is great for a data professiona*l”*.

In this case, the predicted output has one deletion (the word “textual” disappears) and one substitution (“awesome” becomes “great”).

So, what is the Word Error Rate of this translation? Basically, WER is the number of errors divided by the number of words in the reference transcript.

WER = (num inserted + num deleted + num substituted) / num words in the reference

Thus, in our example:

WER = (0 + 1 + 1) / 10 = 0.2

Lower WER often indicates that the Automated Speech Recognition (ASR) software is more accurate in recognizing speech. A higher WER, then, often indicates lower ASR accuracy.

The drawback is that it assumes the impact of different errors is the same. Sometimes, insertion error may have a bigger impact than deletion. Another limitation is that this metric cannot distinguish a substitution error from combing, deletion and insertion error.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE): this is Recall based, unlike BLEU which is Precision based. ROUGE metric includes a set of variants: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. ROUGE-N is similar to BLEU-N in counting the 𝑛-gram matches between the hypothesis and reference.

This is a set of metrics used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references or a human-produced summary or translation.

Mind you, in summarization tasks where it’s important to evaluate how many words a model can recall (recall = % of true positives versus both true and false positives)

Feel free to check out the python package here.

Final Thoughts:

Understanding which performance measure to use and the best one for the problem at hand help to validate the right solution to meet the needs of the particular challenge.

The challenge with NLP solutions is on measuring their performance for various tasks. Speaking of other Machine learning tasks, it is easier to measure the performance because the cost function or evaluation criteria are well defined and have a clear picture of what is to be evaluated.

One more reason for this is that labels are well-defined in other tasks, but in the NLP task, the ground result can vary a lot. Coming up with the best model depends on various factors but evaluation metrics are an essential factor to consider depending on the nature of the task you are solving.

References:

The Future of Customer Service: What You Need to Know About Conversational AI

Anthony Mipawa — Fri, 09 Sep 2022 13:27:13 +0000

This article was originally published on the Neurotech Africa blog.

Today’s consumers are more informed, connected, and intractable than ever before. As a result, brands that fail to meet their high standards face an uphill battle. 74% of consumers will not recommend a brand again after a negative experience. Moreover, 90% of customers expect to be able to communicate directly with a company through chat or messaging as if they were friends. Conversational AI has the potential to revolutionize the customer service experience by making it more personal and accessible for end users. This blog post breaks down the whys and hows of conversational AI in customer service, so keep reading to learn more.

About Conversational AI?

Conversational AI, also known as natural language processing, is the ability of machines to understand human language and respond accordingly. Natural language processing is key to implementing conversational interfaces or interfaces that allow people to communicate with computers through spoken language and written text as if they were having a conversation with another person. A conversational interface has two main parts — An automated system (e.g. an IVR or an SMS-based solution) that detects and responds to user inputs — and a natural language processor (NLP) that analyses and understands the user input. The NLP will then transform the input into a machine-readable format and then trigger an appropriate response from the system.

Why is Customer Service Important?

Customer service is a key aspect of any customer-facing business. It can be the difference between capturing a new customer and losing an existing one. It’s no wonder that the customer experience is the top priority for brands. According to a recent study, 69% of customers would pay more for a better experience. That’s why so many companies are turning to customer service AI. Customer service AI brings conversational interfaces, a technology that’s been around since the 1960s. More recently, it’s become increasingly important in the fields of commerce, health care, transportation, and more.

How will Conversational AI Change Customer Service?

The rise of human-machine communication will transform customer service in the following ways: -

Increased accessibility: Human customer service will become more accessible to everyone thanks to the rise of AI-powered virtual assistants. Meanwhile, AI customer service agents will be able to handle more requests from more people simultaneously.
Better quality of service: High-quality, personalized service delivered by AI agents will boost customer satisfaction and retention.
Better customer satisfaction: Satisfied customers generate more revenue for businesses than unhappy customers. AI customer service agents can increase customer satisfaction across the board.
Improved customer retention: Businesses can retain customers by providing an exceptional customer service experience. AI can help businesses do just that.
Improved AI-human collaboration: Businesses will unlock new levels of productivity by bringing AI and human agents together.
Better customer retention through personalized messaging: AI agents will be able to deliver highly personalized messages to customers.

Limitations of Conversational AI in Customer Service

While conversational AI is poised to revolutionize the customer service experience, there are some limitations that we must account for

Customer expectations: Customers have high standards and will be disappointed if AI falls short of their expectations. -
Data privacy and security: Businesses must protect the privacy of their customer’s data. AI poses a particular concern in this regard, as hackers can use AI to take over machines and systems.
Cultural shifts in customer service: AI may not be a good fit for every culture, and businesses may have to adjust their strategies accordingly.
Human resources: The implementation of AI may mean fewer human agents, which may pose problems for businesses that depend on human customer service.
Technical limitations: While the promise of AI is great, the technology is not yet advanced enough to meet all of our expectations.
Shifting customer service strategies: Customer service strategies may shift in the coming years, rendering today’s AI technologies obsolete.

How to Achieve Success with Conversational AI in Customer Service

Success with conversational AI in customer service starts with a strategic plan for implementation. Companies should consider the following: -

Defining the customer experience: Companies must define their customer experience strategy, including how AI agents fit into that strategy.
Building a strategy for AI: Companies should decide what type of AI to implement and how that AI will work within their strategy.
Hiring the right talent: Companies must hire the right people to implement their AI strategy. This includes both AI agents and human agents that will collaborate with them.
Investing in the right technology: Companies must choose the right technology that supports their AI strategy.
Testing and training: Companies must ensure that AI works as intended before launching it to customers. They must also train their human and AI agents to work together.

Final Thoughts

The future of customer service is bright, but businesses must act soon to take advantage of the benefits of conversational AI. Companies must prepare by defining their strategy, investing in the right technology, and hiring the right talent. They must also consider the limitations of AI and have a plan for overcoming them. Finally, businesses must act quickly before the benefits of conversational AI are claimed by others.

Redefining Customer Engagement as Digital Bank

Anthony Mipawa — Fri, 09 Sep 2022 13:14:53 +0000

This article means a lot to digital banks on how they can use conversational Artificial intelligence to acquire, engage and retain customers.

This article was originally published on the Neurotech Africa blog.

Wow! the heading of this article brings you here, great to hear that.

And I will be sharing with you my understanding of the estimated degree and depth of people interacting with digital banks associated with artificial intelligence technology. Without further due let me pencil the topic in simple words.

But you already have something in your mind right?

Customer engagement is the means by which a company creates a relationship with its customer base to foster brand loyalty and awareness. This can be accomplished via marketing campaigns, new content created for and posted to websites, and outreach via social media and mobile and wearable devices, among other methods.

“Customer engagement is the ongoing interactions between company and customer, offered by the company, chosen by the customer.” by Paul Greenberg

Wonderful, now we are clear on the topic it’s all about when you let your customers choose how they’d like to engage with you, you’ll be more likely to uncover the type of interactions that they find valuable. By making it easier for customers to engage in ways they find valuable, you’ll strengthen their emotional investment in your digital bank.

Disruptive innovation in financial services is growing massively every now and then. The challenges facing this industry are making professionals continue brainstorming the right way of emerging with it using the existing technologies.In modern banking Artificial intelligence has developed an important and distinguished series of roles, from security automation, and loan automation to customer engagement processes.

Companies with well-defined data strategies have realized the great role played by this technology to bring value to their products. The journey of handling customers differs from one organization to another depending on culture, strategies, goals, and so on.

Why digital banks should care about redefining customer engagement?

In digital Banking customers are the kings, the interaction between the bank and people made the business simply to say one bank needs people who also other competitors need them.Redefining the interaction between your customers and the bank is important to provide good customer service for a successful business. With the advent of digital, the scope of good customer service has extended from providing timely and high-quality products and/or services to providing an experience that delivers value outside the original sale.

As the banking world has become more crowded, there’s been an overwhelming focus on clicks, conversions, and acquisition costs.

However, these acquisition strategies alone won’t be enough to grow your business sustainably. Finding ways to engage with your customers in between purchases strengthens their emotional connection to your brand, helping you retain the customers you already have while sustainably growing your business.

In fact, the revenue banks generate 95 percent rely on effective customer engagement through interest on loans and fees associated with their services.

According to constellation research on customer engagement, companies that have improved engagement increase cross-sell by 22 percent, drive up-sell revenue from 13 percent to 51 percent, and also increase order sizes from 5 percent to 85 percent.

The statistics show the impact of engaging your customers and how significant the revenue can increase.

About conversational Artificial intelligence

Conversational AI involves three concepts: artificial intelligence, human language, and automation. We can define it as the type of artificial intelligence that enables consumers to interact with computer applications the way they would with other humans.

Best conversation AI solutions show remarkable support for businesses. Think about the last time that you communicated with a business online and received the answer to your question within seconds all with little effort. This is conversational AI doing powerful work seamlessly and efficiently. The bonus? A conversational AI solution knows when to notify and transfer the customer to a live agent all within the same conversation stream when the situation warrants it.

Conversational Artificial Intelligence in customer engagement in digital Bank

The process of acquiring, engaging, and retaining customers can be boosted with technologies like conversational Artificial intelligence. In fact, the technology itself does not offer full focus on the process but specific means other factors can be considered, here are the cases for digital banks

Increasing customer attraction through socials:- making easier accessibility of digital banks’ services can impact their engagement with customers through social platforms like WhatsApp, telegram, etc also go a long way in keeping them engaged over time. Conversational AI makes it easier to handle this kind of engagement by using a natural conversation with your customers.
Manage payments and transactions:- On a regular, people have to clear bills, pay businesses, shop online, or perform any kind of online transaction. A conversation AI can help the user make and track these payments. Clearing payments can often be urgent and time-bound. More often than not, in such cases, switching platforms to complete transactions can be inconvenient. But with an omnichannel conversational AI, your customers can make payments right where they are, and avoid any delays!
Recommendation of new service:- with conversational AI, digital banks’ can simplify the process of selecting the right services or products for specific customers, from their day-to-day interactions. Meeting user expectations is a great win and this can improve the engagement with your bank.
Addressing frequently asked questions (FAQs):- With conversational AI handling, repetitive questions becomes easier instead of agent calls or scrolling over a long website page, customers can type or speak and get an answer to a query instantly.
Leads generation:- Conversational AI solutions have no match when interaction comes to play. They can interact with customers for the first time and understand their needs and sentiments behind the conversation. This, very human interaction, can help digital banks acquire new customers and also get their personal details. These details are then transferred to the sales team for taking the conversation forward.
Driving referral campaign with exiting customers:- with conversational AI driving engagement doesn’t have to be solely between your customers and your brand, it can also be between customers. Empowering your best customers to easily share your brand with their friends and family can not only help you acquire a new one but also engage the customers you have.

How does Neurotech’s conversational AI solution, redefine customer engagement for digital banks?

We offer customer support solutions for businesses to engage customers with a personalized experience at every touchpoint, across any digital channel through our internal engine called Sarufi. We care about memorable experiences that happen when customers are free to speak naturally. Our conversational solution(chatbots) understands customer, and provide seamless customer support across multiple platforms, enabling you to offer a more personalized, contextual service to customers, reduce call center overload, ensure reliable customer support 24/7, and you can explore more from here.

You can reach out for a demo of our banking conversational AI solution here.

Final thoughts:

Don’t confuse technology and business strategy, You should consider relying on your strategies which can be boosted with technology like Artificial intelligence.

Great customer experiences across every channel are imperative that digital banks’ cannot ignore. While the availability of digital footprints has made it possible to deliver pronounced mobile and digital experiences, digital banks need to ensure that the customer at the physical store is not deprived of the same seamless and immersive experience that the digital native or the millennial customer is accustomed to.

Filter Swahili SMS by categories using machine learning.

Elia — Wed, 10 Aug 2022 09:22:42 +0000

When you hear "ding" you almost fall over running to your phone in the hopes of seeing the long-awaited SMS and then sadly discover it's a promotional message from an XYZ brand. This can really be annoying, many of these promotional and spam SMS continue to clog up our inboxes and get worse with time, stealing our precious time and attention.

What can we learn from Gmail?

The problem is not very new, It also exists on the side of the email and one thing that email providers like GMAIL adopted and worked so well is grouping emails into categories depending on the intentions of the emails which can either be promotional, social, primary and also being able to filter out fraudulent emails (spam).

Can we replicate the Gmail Approach to SMS? If yes How?

The meat of this article is centered around answering that question, we are going to learn how can we classify SMS messages into categories according to the intention of the messages, then now you might be asking yourself how one gets to know and classify the intention of SMS? We are going to train a machine learning model that will learn the similarities of each category and then use its generalized learned model to group new SMS into categories.

Data Collection and Annotations

The first step was sourcing, collecting, and annotating SMS data that will then be used to train our machine learning model, the data collection was done using the SMS backup application from multiple individual contributors, and the app data output was a well-organized JSON data of SMS and their details as shown in the below snippet example.

[
    {
        "_id": "7126",
        "address": "TIGO",
        "body": "Tigo inakutakia maadhimisho mema ya siku ya Muungano.",
        "date": "1619430394016",
        "errorCode": "-1",
        "locked": "0",
        "messageDirection": "INCOMING",
        "messageType": "SMS",
        "protocol": "0",
        "read": "1",
        "replyPathPresent": "0",
        "seen": "1",
        "serviceCenter": "+2557********",
        "status": "-1",
        "text": "Tigo inakutakia maadhimisho mema ya siku ya Muungano.",
        "threadId": "492",
        "type": "1"
    },
    ...
  ]

We then annotated our data into distinct categories based on the context and intention of the text messages, these were the category that we came with;

Promotional
Notification
Transaction
Sports Bettings
Michezo ya Bahati Nasibu (General gambling SMS)
Survey
Verification
Informational
Personal
SPAM

We then exported our data into CSV format ready for crunching, *Where is the Data? Well we won't be able to share for now because some of the SMS contain identifiable personal information t*herefore we are currently working on cleaning and ensuring the data is of good quality and then will share through our Github repository.

Here we go

Now that you have a bit of background about the data that we are going to use to train our model, let's now get our hands dirty, let's break down our task into three steps: data preprocessing, training machine learning model, and model evaluation.

Data Preprocessing

Data preprocessing is a way of converting raw data into a format that can be easily parsed by a machine learning model. We need to preprocess our datasets to easily train our model. But first, let's read and view the structure of our datasets with the help of Pandas library

import pandas as pd

data = pd.read_csv('./raw sms data/data.csv')
data.head()

As we can see we have got a couple of columns in our dataset, let's start by exploring messageDirection our data has;

data["messageDirection"].value_counts()

# Output: INCOMING    3384  "There are 3384 incoming messages"
#         OUTGOING      62

Now that we know the data collected consists of both OUTGOING and INCOMING SMS but from the very nature of our task, our primal interest lies in the incoming messages only, therefore we need to filter only data whose messageDirection is INCOMING.

incoming_sms = data[data["messageDirection"] == "INCOMING"]

interested_data = incoming_sms[['address', 'text', 'label']]

Examining Label distribution

Examining the distribution of labels is crucial because it can reveal information about how well your model will work with a particular label.

As you can see, most of our messages are "NOTIFICATION" labeled. "SPAM" messages are the least which means our dataset is not balanced.

Let's also remove duplicates in our datasets

texts = interested_data['text'].tolist()
ids = interested_data.index.tolist()
dirty_dict = dict(zip(ids, texts))

cleaned_dict = {}
used_texts = []
for id, text in dirty_dict.items():
    if text in used_texts:
        continue
    cleaned_dict[id] = text
    used_texts.append(text)

ids = list(cleaned_dict.keys())
print(len(ids))

# Only Filter out interested_data whose id is in ids
cleaned_incoming_sms = interested_data[interested_data.index.isin(ids)]

len(cleaned_incoming_sms)
# Output: 1920

Our data has been reduced from 3384 to 1920 which means almost 43% of our datasets were duplicates but this is an 'okay' amount of data to train our model.

Now let's get a good look at our data by visualizing it in wordcloud. But before that, we need to remove a few stopwords. Then, we can see how often some words are used in the texts according to their categories.

# Removing stop words
stopwords = ["na", "ya", "wa", "kwa", "pia", "kisha", "au"]

cleaned_incoming_sms['text'] = cleaned_incoming_sms['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

cleaned_incoming_sms.loc[4:9]

The above result of our cleaned-incoming-sms is not particularly clean. We need to put in some extra effort.

Make all of them lowercase.
Remove all of the non-alphanumeric characters like ",", "+", "%", "!", ":"
Remove all numbers in the text messages.

import re
# Clean the texts
def clean_text(text):
    # remove all non-alphanumeric characters
    text = text.lower() #convert text to lower-case
    text = re.sub('[‘’“”…,]', '', text)
    text = re.sub('[()]', '', text)
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = re.sub(' +', ' ', text)
    return text

cleaned_incoming_sms['text'] = cleaned_incoming_sms['text'].apply(clean_text)

All of our texts are clean now, so we can start training our model.

Training Machine Learning Model

We are going to use the Scikit-learn library to provide us all useful tools to train our model. Let's import our required tools and train our model.

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(cleaned_incoming_sms['text'], cleaned_incoming_sms['label'], test_size=0.2, random_state=42)

pipeline = make_pipeline(
    TfidfVectorizer(lowercase=True, max_features=1000, stop_words=stopwords),
    RandomForestClassifier(n_estimators=100, random_state=42)
)

pipeline.fit(x_train, y_train)

Since our dataset is not too large, the model will finish training in a very short time. After it finishes training, then we can check its score.

pipeline.score(x_test, y_test)

# Output: 0.9380530973451328

As you can see our model has a score of 94% when evaluated with test data which is quite good.

Testing our model

Let's save our model for later use and then we will import it again into another file to test it with some other messages.

import joblib
joblib.dump(pipeline, './pipeline.pkl')

NOTE: Before we test our model with some messages, we have to remember to pass them into the clean_text function to preprocess our text(remove non-alphanumeric characters, remove numbers, etc. in the text we are going to input to our model).

pipeline = joblib.load('./pipeline.pkl')

with open('test_data.txt', "r") as f:
    test_data = f.readlines()

for text in test_data:
    print(f"Text: {text} Prediction: {pipeline.predict([clean_text(text)])[0]}")

Results

We tested our model with 14 messages that it has never seen before. As you can see from the result above, most of the messages in the test data were "SPAM" messages. But the model couldn't pick up most of them since there were few spam messages to train our model.

Also, the model didn't quite perform well in the "PROMOTIONAL" label, because after removing duplicated messages in our datasets, label distribution has changed a lot.

Label distribution after removing duplicates

Conclusion

Any model's performance is strongly influenced by the quantity and size of the datasets. We couldn't access large datasets, but by spending more time thoroughly cleaning our training data, we can attempt to improve the accuracy of our model. Furthermore, we can tweak some parameters before training our model or we can experiment with alternative Machine Learning Classifiers like Decision Tree, SVM, etc. to achieve the best results and improve the performance of our model.

Thank you.

4 Popular Natural Language Processing Techniques

Elia — Wed, 10 Aug 2022 08:55:11 +0000

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. Source: Wikipedia

It is most likely that you have used NLP in one or another way. If you have ever tried to contact a certain business through messages and got an immediate reply, probably it was NLP at work, or perhaps you have just gotten home from work, filled your cup with coffee, and asked Siri to play some relaxing seaside sounds. Without a doubt, you apply NLP.

Human language is very complex, filled with sarcasm, idioms, metaphors, and grammar to mention a few. All of these make it difficult for computers to easily grasp the intended meaning of a certain sentence.

Take an Example of a Sarcasm conversation:

John is sewing clothes while closing his eyes.
Martin: John, what are you doing, you're going to hurt yourself.
John: No I won't ,After a few moments, John accidentally injects himself with the needle.
Martin: well, what a surprise.

With Natural Language Processing(NLP) techniques we can break down human texts and sentences and process them so that can understand what's happening. In this article, we are going to learn with examples about the most common techniques and how they're applied, we will look on;

Sentiment Analysis
Text Classification
Text Summarization
And Named Entity Recognition

Sentiment Analysis

Most businesses want to know what are the customer's feedback concerning their services or products. But you might find millions of customers' feedback. Analyzing everything is very painful and boring even if you are offered a large amount of money when you accomplish that. Sentiment analysis can be useful in this situation.

Sentiment Analysis is a natural language processing technique which is used to analyse positive, negative or neutral sentiment to textual data.

Businesses use sentiment analysis to even determine whether the customer's comment indicates any interest in the product or service. Sentiment analysis can even be further developed to examine the mood of the text data (sad, furious, or excited).

Source: revechat.com

Use case

To accomplish this, let's use the hugging face transformers library. We are going to use a pre-trained model from hugging face models called "distilbert-base-uncased-finetuned-sst-2-english"

# let's first install transformers library

$ pip install transformers

Once the library is installed, completing the task is quite simple.

from transformers import pipeline
analyser = pipeline("sentiment-analysis")

The above code will import the library and use a default pre-trained model to perform sentiment analysis.

user_comment = "The product is very useful. It have helped me alot."
result = analyser(user_comment)
print(result)

# Output: [{'label': 'POSITIVE', 'score': 0.9997726082801819}]

The output shows that the sentiment of the user's comment is POSITIVE and the model is 99.9772% sure.

Text Classification

Text classification also known as text categorization is a natural language processing technique which analyses textual data and assigns them to a predefined category.

Spam emails occasionally arrive in your mailbox. When you click on one of these links, your computer may become infected with malware. Therefore, practically all email service providers employ this NLP technique to classify or categorize the email as either spam or not.

To effectively categorize your incoming emails, text classifiers are trained using a lot of spam and non-spam email data.

Use case

Let's try to create a simple text classifier to classify whether the text we input is spam. We are going to use the TextBlob library to achieve this.

Let's create some training data to train our own classifier.

train = [
    ('Congratulation you won a your prize', 'spam'),
    ('URGENT You have won a 1 week FREE membership in our 100000 Prize Jackpot', 'spam'),
    ('SIX chances to win CASH From 100 to 20000 pounds ', 'spam'),
    ('WINNER As a valued network customer you have been selected to receive 900 prize reward', 'spam'),
    ("Free entry in 2 a weekly competition to win FA Cup final tickets 21st May 2005. Text FA to 87121 to receive", 'spam'),
    ('I do not like this restaurant', 'no-spam'),
    ('I am tired of this stuff.', 'no-spam'),
    ("I can't deal with this", 'no-spam'),
    ('he is my sworn enemy!', 'no-spam'),
    ('my boss is horrible.', 'no-spam'),
    ('This job is bad', 'no-spam')
]

Now, let's import our classifier from the TextBlob library and train it with our created training data. We are going to use NaiveBayesClassifier

from textblob.classifiers import NaiveBayesClassifier

classifier = NaiveBayesClassifier(train)

After our training is complete (which might take less than two seconds according to your computer), we will input our text to see if it works.

classifier.classify("Congratulation you won a free prize of 20000 dollars and Iphone 13")

# Output: 'spam'

Our simple model correctly identified our message as "spam," which it is.

Text Summarization

Text summarization is a natural language processing technique for producing a shorter version of a long piece of text.

Let's imagine that when you are drowsily sleeping, your boss sends you a message telling you to read a specific document. The document is ten pages long when you check it. For you, text-summarization might be a ground-breaking concept.

Text summarization models often take the most crucial information out of a document and include it in the final text. However, some models go so far as to try to explain the meaning of the lengthy text in their own words.

Use case

For this, we'll also make use of the transformers library.

from transformers import pipeline

Then we are going to use the "summarization" pipeline to summarize our long text.

summarizer = pipeline("summarization")

# If the you don't have the summarization model in your machine, It will be downloaded from the internet.

The lengthy text can then be copied and pasted from anywhere for summarization.

long_text = """
    The Solar System is the gravitationally bound system of the Sun and the objects that orbit it. It formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud. The vast majority (99.86%) of the system's mass is in the Sun, with most of the remaining mass contained in the planet Jupiter. The four inner system planets—Mercury, Venus, Earth and Mars—are terrestrial planets, being composed primarily of rock and metal. The four giant planets of the outer system are substantially larger and more massive than the terrestrials. The two largest, Jupiter and Saturn, are gas giants, being composed mainly of hydrogen and helium; the next two, Uranus and Neptune, are ice giants, being composed mostly of volatile substances with relatively high melting points compared with hydrogen and helium, such as water, ammonia, and methane. All eight planets have nearly circular orbits that lie near the plane of Earth's orbit, called the ecliptic.
"""

# You can set an optional parameter of max_length to maximum number of words you want to be outputted

>>> summarizer(long_text, max_length=80)

# Output: [{'summary_text': " The Solar System formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud . The vast majority (99.86%) of the system's mass is in the Sun, with most of the remaining mass contained in the planet Jupiter . The four inner system planets are terrestrial planets, being composed primarily of rock and metal ."}]

Named Entity Recognition(NER)

Have you ever heard fanciful tales about how a particular firm listens in on all calls, chats, and online interactions to see what people are saying about it? Well, "if" this is true, one of their strategies might be named entity recognition. Because NER is a natural language processing technique that identifies and classifies named entities in text data.

Named entities are just real-world objects like a person, organization, location, product, etc. NER models identify ‘Dar-es-salaam’ as a location or ‘Michael’ as a man's name.

Source: shaip.com

Use case

We will use the SpaCy library for this task. We need to install it and download an English pre-trained model to help us to achieve our task faster.

$ pip install -U spacy

# Then downloading the model
python -m spacy download en_core_web_sm

We are going to import the spacy library and then load the model we downloaded so we can perform our task.

import spacy

nlp = spacy.load("en_core_web_sm")

After loading our model, we can simply input our text and spacy will give us named entities present in our text.

doc = nlp("The ISIS has claimed responsibility for a suicide bomb blast in the Tunisian capital earlier this week.")

for ent in doc.ents:
    print(ent.text, ent.label_)

spacy.displacy.render(doc, style="ent")

# Output: ISIS ORG
#         Tunisian NORP
#         earlier this week DATE

The output shows different entities detected by spaCy with their respective labels.

NOTE: If you didn't understand the meaning of an abbreviation in spaCy, you can use spacy.explain() to explain its meaning.

# Let's say you didn't understand the meaning of an abbreviation "ORG"

spacy.explain("ORG")

# Output: 'Companies, agencies, institutions, etc.'

The good news is that it's simple to get started with these techniques nowadays. Large language models like Google's Lamda and GPT3 are available to aid in NLP tasks. You may easily construct helpful Natural language processing projects with the use of tools like spaCy and hugging face.

Thanks.

How is conversational AI impacting the finance industry?

Anthony Mipawa — Tue, 09 Aug 2022 07:21:18 +0000

This article was originally published in the neurotech Africa blog.

The evolution of technology continues to spread across multiple industries, the finance industry can't be left behind when the list of industries experiencing immense transformation every now and then because is one of the important segments of the economy.

About Finance industry

The finance sector is wide and constitutes at least 20% of the global economy and the impact of this sector on economic growth is significant.

According to the finance and development department of the International Monetary Fund, financial services are the processes by which consumers or businesses acquire financial goods. For example, a payment system provider offers a financial service when it accepts and transfers funds between payers and recipients. This includes accounts settled through credit and debit cards, checks, and electronic funds transfers.

In developing countries, fin-tech firms are gaining prominence, aided by the rise of digital public goods and currencies. Migrating to online and mobile services will remain a priority for financial firms way to a cashless economy and financial services companies like banks, Tax, and accounting services and insurances will need to complete with emergent financial firms. Building internal services can be the best option for large companies but not all of them and every solution, speaking of small to medium microfinance the best way to migrate is by outsourcing to companies with the best talents to build solutions to meet the demand of the digital economy.

Conversational AI use cases in the Finance industry:

Conversational AI is one of the essential boosts in the Finance sector from sales, marketing, and customer services. Conversational AI solutions allow smooth customer services management both fast and efficiently, the essential advantage of this technology is to act as a listening channel and better understanding of your customers.

Collective way of understanding which product is performing better, how the customer views your services, what they don’t like what they like, feedback, and suggestions to work on. All of these pieces of information are the potential to improve your business by transforming services in a personalized way, and recommendations of new services or products will help customers to get better service according to what they utilize.

In fact, conversational AI solutions help businesses to reduce operational costs by improving the efficiency of their service, minimizing human error, and resolving customer queries quicker.

Benefits of conversational AI solution

What are the use cases of conversational AI in the finance industry?

Manage payments and transactions:- On a regular, people have to clear bills, pay businesses, shop online, or perform any kind of online transaction. A conversation AI can help the user make and track these payments. Clearing payments can often be urgent and time-bound. More often than not, in such cases, switching platforms to complete transactions can be inconvenient. But with an omnichannel conversational AI, your customers can make payments right where they are, and avoid any delays!
Leads generation:- Conversational AI solutions have no match when interaction comes to play. They can interact with customers for the first time and understand their needs and sentiments behind the conversation. This, very human interaction, can help banks acquire new customers and also get their personal details. These details are then transferred to the sales team for taking the conversation forward.
Resolve common and repetitive inquiries:- Some repetitive activities are really boring, there are some questions that most of your users likely ask frequently such as "how do I restore unsuccessful transaction?", "What are the steps to follow to get a loan? ", "what is the status of my loan application?", etc Instead of customers going through a long list of frequently asked questions, a conversational AI solution can handle this with instant reply.
Easy document collection and sharing:- Assume your customer wants to apply for a new loan but keeps getting sent back from the bank each time because of new inconsistencies in verification very annoying, right? Nobody can be happy with this situation neither your customer nor you. This is a pretty common scene to witness in a bank. This happens mostly because of a lack of knowledge and awareness on the customer’s front. However, form filling, document collection, and verification are common conversational AI use cases in banking and insurance.
Locate nearest service providers:- This may include ATMs, agents, and branches. Assume you're new in the city and you need to find a certain bank branch or even an ATM instead of asking multiple people, a conversational AI solution with geolocation of all your businesses can help your customer easier to navigate to the nearest service provider.
Feedback collections:- Customers would love to give feedback and reviews if their hard-earned money or other services is taken care of by the banks or insurance company. These reviews can be collected by the banking conversational AI, instead of using the long survey forms, banks can now integrate chatbots on their websites and apps for collecting feedback and reviews.
Handling suspicious activities:- Security and data privacy concerns any business. But for banks and financial organizations, their reputation relies on it. conversation AI solutions can effectively monitor and recognize the warning signs of fraudulent activity and issue alerts directly to the customer and the bank.

Conversational AI solution by Neurotech

Neurotech we are an AI company that builds solutions for businesses currently we do develop conversational AI for business needs which are controlled by our internal engine goes by the name Sarufi. We offer custom solutions to fit various business needs.

Why it is useful?

Our conversational ai solutions can provide seamless customer support across multiple platforms, enabling you to offer a more personalized, contextual service to customers, and you can explore more from here. Our solutions ****are developed in such a way that can understand the contextual meaning of the interaction or conversation with targeted audiences, Our custom chatbots can be deployed on social media platforms like Whatsapp, Facebook, Instagram, and Telegram. This depends on what our customers need.

Currently, our solutions can work in two languages only Swahili and English, it can help out your business with customer support, locate near service providers, save on labor costs and instead pay fewer support employees fair wages without being stretched to support a large staff, increase revenue and build opportunities with every customer interaction note that 😊.

Final thoughts

Businesses should consider focusing on business needs not on technology in fact the aim is to earn more and make sure things are moving in the right direction. Technology advancing at a rapid pace which sometimes may be confusing. Executives and business professionals may find that their decisions are lagging behind the rapidly growing technology like conversational AI, blockchain, data analysis, etc. This misunderstanding may lead to consuming non-actionable insights into your business operations and this can be stressful and overwhelming to your customers.

To avoid that mistake and overloading customers with unnecessary information, executives should get close to technology experts to clearly understand what can be solved based on their needs with only actionable insights, this will help to avoid unnecessary costs and expectations on something that can’t work on your business.

Get in touch with Neurotech’s team to discover how you can benefit from our conversational solutions to boost your business, we do consultations on best practices for using data insights to address your business needs.

Find the needs of your business, build solutions don’t implement something simply because ABC company has implemented it, and do something that is potential for your business growth.

The cause of a decision in Swahili social media sentiments

Anthony Mipawa — Tue, 09 Aug 2022 07:07:00 +0000

This article was originally published in the neurotech Africa blog.

As a data professional one of the best practices is to be accountable for the solutions at hand, by understanding how the model you have built is performing and predicting the results. I came across Swahili social media sentiments and since I'm a Swahili speaker I was curious to understand the cause of decisions in Swahili sentiment analysis using machine learning algorithms.

In today's article, I will work with you through building a machine learning model for Swahili social media sentiment classification with the interpretability of each prediction of our final model using Local Interpretable Model-Agnostic Explanations.

“Why Should I Trust You?” Explaining the Predictions of Any Classifier ~ Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin.

Kiswahili is a lingua franca spoken by up to 150 million people across East Africa. It is an official language in Tanzania, DRC, Kenya, and Uganda. On social media, Swahili speakers tend to express themselves in their own local dialect.

Building Swahili social media sentiment classifier

Sentiment analysis relies on multiple word senses and cultural knowledge and can be influenced by age, gender, and socio-economic status. In today's task, I will be using datasets from Twitter originally hosted at Google Natural language processing hack series by zindi Africa, with the aim of classifying whether a Swahili sentence is of positive, negative, or neutral sentiment.

The dataset contains three columns which are id as the unique ID of a unique Swahili tweet, tweets containing the actual text of the Swahili tweet, and labels the label of the Swahili tweet, either negative(-1), neutral(0), positive(1) with 2263 observations.

How about label distribution?

Most of the tweets collected are neutral, which shows that our labels are imbalanced.

Let's work on preprocessing the dataset to make everything ready for building our final machine learning model. This will involve a range of steps for cleaning the texts

removing non-alphanumeric text.
removing stopwords
converting all tweets into lowercase.
removing punctuation, links, emojis, and white spaces.
tokenize the text based on each word.
the final piece is to append all clean tweets into new columns named clean_tweets.

Point to note, nltk doesn't consist of Swahili stopwords but you have to create your own list and apply it to the tweets. I just created a CSV file with a couple of Swahili stopwords like na, kwa, kama, lakini, ya, take, etc which I will apply here.

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

To make things smooth let's just use one function to perform all of the tasks.

def clean_tweets(tweet):
    '''
        function to clean tweet column, make it ready for transformation and modeling
    '''
    tweet = tweet.lower() #convert text to lower-case
    tweet = re.sub('[‘’“”…,]', '', tweet) # remove punctuation
    tweet = re.sub('[()]', '', tweet) # remove parenthesis
    tweet = re.sub("[^a-zA-Z]"," ",tweet) #remove numbers and keep text/alphabet only
    tweet_list = nltk.word_tokenize(tweet)
    clean_tweets=[i for i in tweet_list if i not in swstopwords] # remove stop words
    return ' '.join(clean_tweets)

   df['clean_tweets'] = df['Tweets'].apply(clean_tweets)

function to clean tweet column, make it ready for transformation and modeling

Now the tweets are clean and ready for further processes

datasets after applying the clean_tweet function

Time to work on the analysis of the Swahili tweets by looking at polarity and subjectivity. But wait! what do polarity and subjectivity mean?

Polarity is the expression that determines the sentimental aspect of an opinion. In textual data, the result of sentiment analysis can be determined for each entity in the sentence, document, or sentence. The sentiment polarity can be determined as positive, negative, and neutral. Usually defined as a float that ranges from 1 (entirely positive) to -1 (entirely negative)

Sentiment polarity for an element defines the orientation of the expressed sentiment, i.e., it determines if the text expresses the positive, negative or neutral sentiment of the user about the entity in consideration.

Subjectivity is the measure of how factual the text is, ranging from 0 (pure fact) and 1 (pure opinion)

I will be using textblob to analyze tweets

def polarity_score(tweet):
  '''
        This function takes in a text data and returns the polarity of the text
        Polarity is float which lies in the range of [-1,0,1] where 1 means positive statement, 0 means positive statement
        and -1 means a negative statement
    '''
  return TextBlob(tweet).sentiment.polarity

  def subjectivity_score(tweet):
  '''
      This function takes in a text data and returns the subectivity of the text.
      Subjectivity sentences generally refer to personal opinion,
      emotion or judgment whereas objective refers to factual information.
      Subjectivity is also a float which lies in the range of [0,1].
  '''
  return TextBlob(tweet).sentiment.subjectivity

  #apply above functions to the data
df['polarity_score']=df['clean_tweets'].apply(polarity_score)
df['subjectivity_score']=df['clean_tweets'].apply(subjectivity_score)

polarity score

Now let's try to aggregate the overall polarity and subjectivity of the entire dataset

The overall polarity of the tweet data is 0.01

The overall subjectivity of the tweet data is 0.03

The overall polarity of the tweet data indicates that the tweets are fairly neutral.

Let's try to visualize the polarity and subjectivity distributions of each class independently

# visualization
fig = make_subplots(rows=3, cols=2,subplot_titles=("Polarity Score Distribution-Negative", "Subjectivity Score Distribution-Negative",
                                                   "Polarity Score Distribution-Neutral", "Subjectivity Score Distribution-Neutral",
                                                   'Polarity Score Distribution-Positive','Subjectivity Score Distribution-Positive'),
                    x_title="Score",y_title='Frequency')
fig.add_trace(
    go.Histogram(x=df[df['Labels']== -1 ]['polarity_score']),
    row=1, col=1)
fig.add_trace(
    go.Histogram(x=df[df['Labels']== -1 ]['subjectivity_score']),
    row=1, col=2)
fig.add_trace(
    go.Histogram(x=df[df['Labels']== 0 ]['polarity_score']),
    row=2, col=1)
fig.add_trace(
    go.Histogram(x=df[df['Labels']== 0 ]['subjectivity_score']),
    row=2, col=2)
fig.add_trace(
    go.Histogram(x=df[df['Labels']== 1]['polarity_score']),
    row=3, col=1)
fig.add_trace(
    go.Histogram(x=df[df['Labels']== 1]['subjectivity_score']),
    row=3, col=2)
fig.show(renderer="colab")

Now here we go,

distribution of each class on polarity and subjectivity

In terms of Subjectivity, all three classes tend to be similar no significant difference can be stated, but the polarity of the negative class is different from the positive and neutral classes in terms of skewness.

Let's try to understand the content by visualizing the most used words in all classes and later we can jump to each class independently.

word_freq=pd.DataFrame(df['clean_tweets'].str.split(expand=True).stack().value_counts()).reset_index()
word_freq=word_freq.rename(columns={'index':'Word', 0:'Count'})
fig = px.bar(df, x=word_freq['Word'][0:20], y=word_freq['Count'][0:20])
fig.update_layout(xaxis_title="Word", yaxis_title="Count", title="Top 20 most Frequent words in across entire tweet data")
fig.show(renderer="colab")

habari, leo, siku, namba are the top frequent words in the overall tweet contents.

# Negative Tweets Word Frequency
word_freq_neg=pd.DataFrame(df[df['Labels']== -1]['clean_tweets'].str.split(expand=True).stack().value_counts()).reset_index()
word_freq_neg=word_freq_neg.rename(columns={'index':'Word',0:'Count'})

# Neutral Tweets Word Frequency
word_freq_neut=pd.DataFrame(df[df['Labels']== 0]['clean_tweets'].str.split(expand=True).stack().value_counts()).reset_index()
word_freq_neut=word_freq_neut.rename(columns={'index':'Word',0:'Count'})

# Positive Tweets Word Frequency
word_freq_pos=pd.DataFrame(df[df['Labels']== 1]['clean_tweets'].str.split(expand=True).stack().value_counts()).reset_index()
word_freq_pos=word_freq_pos.rename(columns={'index':'Word',0:'Count'})

fig = make_subplots(rows=1, cols=3,subplot_titles=("Top 20 most frequent words-Negative", "Top 20 most frequent words-Neutral", "Top 20 most frequent words-Positive"),
                    x_title="Word",y_title='Count')
fig.add_trace(
    go.Bar(x=word_freq_neg['Word'].iloc[0:20], y=word_freq_neg['Count'].iloc[0:20]),
    row=1, col=1
    )
fig.add_trace(
    go.Bar(x=word_freq_neut['Word'].iloc[0:20], y=word_freq_neut['Count'].iloc[0:20]),
    row=1, col=2
    )
fig.add_trace(
    go.Bar(x=word_freq_pos['Word'].iloc[0:20], y=word_freq_pos['Count'].iloc[0:20]),
    row=1, col=3
    )
fig.show(renderer="colab")

Across the negative class tweets, the most used words are watu, leo and siku , across the neutral class tweets, the most used words are habari, kazi and mtu and across the positive class tweets, the most frequently used words are habari, leo, asante.

Let's prepare our final dataset for modeling by splitting it into two groups(train and test)

# data split
X = df["clean_tweets"]
y = df["Labels"]

seed = 42

x_train,x_test,y_train,y_test= train_test_split(X,y,stratify=y, test_size=0.2, random_state=seed)

Then, I can make our data pipeline ready for training Swahili sentiments by defining TfidfVectorizer as a vectorizer and LogisticRegression as an algorithm for building our model. Using initialized pipeline, I can train the classifier using the training set of tweets.

# instantiating model pipeline
model = make_pipeline(
    TfidfVectorizer(),
    LogisticRegression()
)

# training model
model.fit(x_train,y_train)

Great! I have trained our classifier for Swahili social media sentiments, and now it's time to evaluate the performance of our model.

print("Classification Report")
print("\t", "_" * 45, "\n" * 2)
print(
    classification_report(
        y_test,
        model.predict(x_test),
        target_names=["Negative", "Neutral","Positive"]
        )
    )

With the classification report, the performance is not very good, our model has a 60% accuracy

Results Interpretability

It's time to understand the cause of the decision of our classifier, we should bring LIME to help us in the interpretation of each prediction of our model, for understanding let me opt to filter out three kinds of prediction(negative, neutral, and positive).

The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.

I should consider predicting probabilities with a LogisticRegression classifier instead of 0 or 1 simply because LIME requires a model that produces probability scores for each prediction to explain the decision's cause.

Here we go, the above observation shows that the probability of a positive class is higher(0.47) compared to other classes, and the cause of decision by words serikali, mwisho and vyema way back in our previous visualization of top frequent words for the positive class to conclude the classifier decision.

The above observation shows that the probability of a neutral class is higher(0.72) compared to the other two classes, and the cause of the decision comes from words walimu, walikuwa, and mwanzoni .

The above observation shows that both of the three classes weigh comparable but due to the high weighting of the word polisi the tweet predicted a negative class.

How companies can benefit from customer sentiment analysis?

Sentiment analysis can help to understand the potential of customers to see an overview of what’s good, and what’s lacking. This can help to improve the strategy of marketing and operations based on customer sentiments.

The power of deep insights from sentiment can help capture what specifically people don’t like about the service, product, or policy and after the business has taken steps to fix the issue, or improve a process, also can track how that has improved customer satisfaction. Insights from customer sentiments can also differentiate between feedback that is frequent and feedback that influences satisfaction scores.

Final thoughts

Understanding the cause of the decision of individual predictions from classifiers is important for data professionals. Having explanations lets you make an informed decision about how much you trust the prediction or the model as a whole, and provides insights that can be used to improve the model.

The complete code used in this article can be found on the GitHub repository.

How conversational AI is transforming the Insurance industry

Anthony Mipawa — Tue, 26 Jul 2022 05:51:00 +0000

This article was originally published in the Neurotech Africa blog post

Everyday technology continues to evolve from different angles, in this blog post I will explain through leveraging the power of conversational AI can make a difference in the insurance industry.

About the Insurance sector:

The insurance sector is made up of companies that offer risk management in the form of insurance contracts. The basic concept of insurance is that one party, the insurer, will guarantee payment for an uncertain future event. Meanwhile, another party, the insured or the policyholder, pays a smaller premium to the insurer in exchange for that protection on that uncertain future occurrence.

According to the 2020 Tanzania insurance report, “the Tanzania insurance sector is growing steadily, with 30 insurance companies and 112 insurance brokers currently active in the market (2014 TIRA data)”.

Through the statistics you can realize that the contribution of insurance to the National Gross Domestic Product remains very limited, paving the way for plenty of room for further growth.

Digital transformation in the Insurance industry

Digital transformation varies across multiple industries, but the worth truth is that 70% of digital transformation fails in the sense that they don’t meet their objectives, this is based on studies from International Data Group. The fun fact is that a company or an industry can’t be fully digital transformed at once but better be staged. May begin with system operational to employees to be aware of what transformation is capable of and how their contribution can improve the whole process of adopting digital transformation.

Source: tibco.com

The insurance industry is among the oldest financial businesses in the world. In fact, the industry tends to stay traditional and is slow to change, however, new technology trends have been impacting the insurance marketplace, creating extreme competition. The immense experience was during Covid-19 when insurance companies found themselves in the middle of the storm. A couple of times operations were done remotely and at the same time, they were fielding calls about changing coverage. Answering questions about business interruption policies and continuing to pay claims for life, health, and disability insurance.

The Need of accelerating digital transformation in the insurance industry

Digital transformation will help the insurance industry to solve some challenges and improve its business strategies. Let me just highlight some of the potentials of accelerating digital transformations:-

Customer experience: spend enough time getting to know the customer and figuring out what it is they want and how to respond to it. This bit of the process should occur throughout the customer lifecycle, from prospecting until the moment of withdrawal.
Value generation through data: data-driven is essential for decision-making in insurance companies. Understanding how you want to use data in a way that you can create value is important, and this experience can be influenced by executives to normal employees. By doing that, it will be possible to determine the various uses of data.
Ecosystem development: the process of redesigning insurance strategies involves a couple of tasks like measuring, controlling, and assessing risks all of these are being transformed by the digital environment, and the leading insurance market leaders are aware of this. Understanding the ecosystem to know how strategies can be applied to regions or branches depending on the scenarios rather than using them because perform well around the city or elsewhere.
Margin management: the digital transformation of the insurance business can only do one of two things: either reduce costs or increase them. Either way, it hinges on making the right decisions and then adopting new technologies to create business models based on those decisions
Multichannel Strategy: using several channels means that your brand will utilize two or more marketing methods to share your content and messaging across several platforms. In simple terms, a multichannel strategy makes it easier for consumers to complete their sale transactions and interact with brands through the most suitable platforms.

Conversational AI use-case in the insurance industry

Manage internal operations: through automation and speeding up repetitive tasks, employees can focus on more complicated tasks and further on developing their skills to improve operations.
Customer awareness and education: conversation AI can bring closer customer awareness and education on how the process works, benefits, availability of offers, and compare as well as suggest the optimal policy, from multiple carriers, based on the customer’s profile and inputs. But also the engagement and interaction with customers, this can be through websites or social media platforms.
Risk evaluation: leveraging conversational AI can improve the ways and processes of taking control of the data overwhelming to assess risks with high accuracy, better insights understanding, plans customizations, and make better decisions.
Claims management: this involves claim processing and payment assistance, conversational AI can be trained to address your customer’s insurance claims and also follow up with them on the existing ones. But how about automating payment processes according to the preferences of customers.
Customer feedback and reviews: most customers tend to share their feedback immediately after service apart from that it is rarely. Most studies suggest that customers are more likely to respond over live chat than email, and they do feel confident and well to contact the business through message rather than calls.
Fraudulent prevention: Insurance firms must take care of customer data privacy and security. Conversational AI is efficient in monitoring and detecting warning signs of fraudulent activity and can alert both the insurer and the customer.

About Neurotech’s conversational AI solutions

Why it is useful?

Currently, our solutions can work in two languages only Swahili and English, it can help out your business with customer support, save on labor costs and instead pay fewer support employees fair wages without being stretched to support a large staff, increase revenue and build opportunities with every customer interaction note that 😊.

Final thoughts

Is there a need of customizing customer experience in the insurance industry? Absolutely yes, innovation in the insurance industry with conversational AI in transforming the entire cycle of processes such as claims, can help to improve the awareness and the education of a large population with less cost money, and effort. Through conversational AI will ensure faster settlements and optimized customer experiences, leading to improved risk evaluation with new technologies like Machine learning and Artificial intelligence in making appropriate decisions, ensuring personalized and customized customer services and experience.

How can neurotech Africa transform your business with Conversational AI

Anthony Mipawa — Sun, 17 Jul 2022 20:09:33 +0000

This article was originally published on the neurotech blog post

How does Neurotech use Conversational AI?

We build custom conversational solutions to help businesses improve their customer experiences and services with our internal tool, which goes by the name Sarufi. The best thing about our solution we use Natural Language Processing to provide a more conversational approach to customer service and a deeper understanding of the context of what people say depending on the industry of the business.

As per use-case, our approaches differ depending on the customer specifications. With our conversational AI solutions, you can get access to incredibly intelligent control of the market of your business without needing to invest the time, money, and resources to train to build the solutions with the internal team.

Our solutions can be deployed across a range of platforms starting with a website if you have, social platforms like WhatsApp, telegram, Instagram, and Facebook messenger. This can depend on where the client prefers to host their business. At Neurotech, we offer full support of our solutions from our talented team to make sure that our clients' businesses benefit from what we offer. This helps make sure you’re getting the most value out of a conversational AI solution for your business.

How can Neurotech transform your business?

Our experts and Sarufi engine provide fast and easy deployment of solutions. With our solution, we transform everything into a custom experience that will help your business to save costs and increase revenue, understand what is missing from your product's service, and keep in touch with your customers.

Through user interaction with your business, you will be able to know better what works and what not working without using extreme effort.

This is a more comfortable transformation simply because the service will be available 24/7 without paying any additional costs to employees, and customers able to instantiate conversation in their natural languages. This can be achieved through a couple of steps:-

Our team of experts will work with the client to determine the requirements and the efficient way the conversational experience will be integrated into the business.
Then, we work on building the solution by considering training models to act upon the inputs provided by consumers with continuous reviews of the results.
In the final, we deploy the solution and offer to support and consulting services to our clients.

What are the benefits of conversational AI solutions?

Personalize customer experience:- Businesses can provide a more personalized experience to both existing customers and potential clients by using conversational AI (such as chatbots) to create a deeper level of interactivity and familiarity with the brand.
Improving marketing experience:- Conversational AI helps to improve marketing by creating a better experience for each customer, based on their needs and desires.A more convenient mode of communication because of the combination of various functionalities would make it convenient for customers across multiple channels.
Cost-effective:- Depending on their learnings and training techniques, they reduce the requirement of human resources to answer customer queries. They are also proficient in handling multiple chats simultaneously with accuracy.
Enhance Operations beyond borders:- Expand business outreach to the potential population.
Self-evolving platforms from experience:- conversational AI learn from their experiences. The more they interact with human beings, the more quickly their intelligence improves. Also learn from any existing data, such as customer databases and previous customer interactions. Clever Conversational Interfaces learn from their mistakes just as human beings do. They take note of what questions the customers ask and what kinds of responses seem to be informative. They try new approaches until they find a way that is both effective and efficient.
Insights driven:- Conversational solutions make effective use of analytics, which essentially helps in gleaning data and information from outside the organization. A mix of both internal and external data can be a great advantage.
Round-the-clock support:- Conversational AI can provide real-time customer assistance. This means that businesses can address customer queries and complaints as they occur, significantly improving customer satisfaction. Provide 24/7 client support, so existing and potential customers can try and solve their problems after work hours and on weekends.
Fast-paced communication:- can help businesses provide quicker and more efficient customer service. This is because chatbots can handle a large number of customer inquiries simultaneously. They can also route customers to the right agent, which reduces the wait time, and works 24/7/365, a huge advantage for businesses. Properly programmed chatbots are always polite and their behavior does not depend on the mood.

Conversational AI solutions are not perceived as a human replacement but rather as human augmentation, enhancing easier access to business both internally and externally.

Get in touch with Neurotech’s team to discover how you can benefit from our conversational solutions to boost your business, the time is now to leverage benefits from Artificial intelligence Technology.

Potentials of conversational AI for businesses

Anthony Mipawa — Thu, 14 Jul 2022 20:15:30 +0000

This article was originally published on the Neurotech blog post.

Speaking about the evolution of technology you can't skip mentioning artificial intelligence simply because in our day-to-day activities we do interact with the technology mostly even without knowing that we do. If you own a smartphone, laptop, smartwatches, desktop, and so many devices yes you do interact with artificial intelligence or use it to accomplish some of your tasks such as google search, Camera, meeting platforms like zoom, Google spreadsheets, Microsoft Cortana, Apple Siri, Google Assistant, Google map, Apple map, Google lens, social media interaction, etc. The scope of artificial intelligence has expanded and evolved over time. So, it is time to think about how you can leverage this technology to improve the revenue of your business in this article I will highlight the potential of conversational artificial intelligence for businesses.

About conversational AI

Conversational AI involves three concepts artificial intelligence, human language, and automation. We can define it as the type of artificial intelligence that enables consumers to interact with computer applications the way they would with other humans. Conversational AI has primarily taken the form of advanced chatbots that contrast with conventional chatbots and combines natural language processing with traditional software, voice assistants, or an interactive voice recognition system to help customers through either a spoken or typed conversation interface.

How does conversational AI work?

Conversational AI involves three main key components which are:-

Natural language processing
Algorithm Training and Machine Learning
Sentiment Analysis

Through the conversational interface, a user provides inputs either through voice or text. For text-based inputs requires NLU to understand the contexture meaning of the inputs and the case of speech-based inputs requires ASR to parse audio into language tokens that can be analyzed. After that best option is answered to a user response this depends on how trained and programmed to perform tasks.

Use cases of conversational AI:

Customer service: conversational AI has been extreme in this industry through the automation of customer support activities to improve access and reduce costs. Activities such as travel booking, FAQs, supporting customers to bill something, and handling complaints. Also, conversational AI is interesting to handle surveys with your customer to understand what they feel about what you provide or even if there is a new product.
Retail industry: speaking of lead generation, lead qualification, and lead nurturing to 24/7 concierge service, faster order fulfillment, amplifying marketing messages, and more can be with conversational AI. In the retail field, things can go more advanced with the recommendation of products to customers, and multichannel integrations to follow your customer to the platform they love to use like WhatsApp, Facebook, Instagram, and Tiktok. But last is being able to serve your customer anytime they want service.
Finance and Banking industry: Conversational AI has greatly helped banking and financial services reduce operating costs, automate functions, and improve the overall customer experience. In accessing and analyzing users’ spending patterns or bank accounts to help them decide how to spend their money, resolve customer queries by automating repetitive processes that typically take a human agent much longer, and through AI bot help in checking balances and detecting fraudulent transactions, etc.
Health industry: In the case of handling schedules in hospital appointments conversational AI is being used for automating this process across the health industry which helps patients to manage their appointments and paperwork. The experience of Cognitive Behavioral Therapy, using conversational AI creates an immersive way to manage anxiety and other mental health issues.
Sales and Marketing industry: most consumers do prefer self-service technology for shopping experiences instead of human sales agents. Conversational AI generates and nurtures leads, optimizes the sales cycle, and gets and updates data instantly while maintaining accuracy with conversational automation.

Use cases of conversational AI are more than the few I just mentioned, these are just some of them you can explore more use cases from here.

What are the impacts of conversational AI on your business?

Customer retention
Customer personalization
Get customer feedback in a seamless manner

All of this help to boost an increase in revenue and reduce costs with more accurate and timely marketing efforts while ensuring a seamless and pleasant experience for your customers. It is not enough to have chatbots on your website as a solution for customer support. Businesses need to have intelligent chatbots with natural language processing and understanding for the best customer support experience.

How Neurotech’s conversational ai solutions are best for your businesses?

Okay, Neurotech we are an AI company that builds solutions for businesses currently we do develop conversational AI for business needs which are controlled by our internal engine and go by the name Sarufi. We offer custom solutions to fit various business needs.

Why it is useful?

Our conversational ai solutions are developed in such a way that can understand the contextual meaning of the interaction or conversation with targeted audiences, Our custom chatbots can be deployed on social media platforms like Whatsapp, Facebook, Instagram, and Telegram. This depends on what our customers need. Currently, our solutions can work in two languages only Swahili and English, it can help out your business with customer support, increase revenue and build opportunities with every customer interaction note that 😊.

Final Thoughts

Now think of your business, it is time to get more close to your customers using conversational AI not too late, by automating workflows for FAQs, and repetitive tasks that staff has to go after with conversational artificial intelligence. The worth truth is conversational AI continues to evolve, making itself absolutely necessary to various industries such as finance, online marketing, healthcare, real estate, customer support, retail, and more. But don't worry we have Sarufi for your business needs, if you may be interested to have a discussion with Neurotech don't hesitate to reach out we do consult depending on what would be best for your business challenges.

GET STARTED WITH TOPIC MODELLING USING GENSIM IN NLP

Anthony Mipawa — Wed, 25 May 2022 03:49:49 +0000

INTRODUCTION

As one application of NLP Topic modeling is being used in many business areas to easily scan a series of documents, find groups of words (Topics) within them, and automatically cluster word groupings, this has saved time and reduced costs.

In this article, you're going to learn how to implement topic modeling with Gensim, hope you will enjoy it, let's get started.

Have you ever wondered how hard is to process 100000 documents that contain 1000 words in each document? , that means it takes 100000 * 1000 =100000000 threads to process all documents. This can be hard, time, and memory-consuming if done manually, that's where Topic modeling comes into play as it allows to programmatically achieve all of that, and that's what you're going to learn in this article

WHAT IS TOPIC MODELLING?

Topic Modelling can be easily defined as the statistical and unsupervised classification method that involves different techniques such as Latent Dirichlet Allocation (LDA) topic model to easily discover the topics and also recognize the words in those topics present in the documents. This saves time and provides an efficient way to understand the documents easily based on the topics.

Topic modeling has many applications ranging from sentimental analysis to recommendation systems. consider the below diagram for other applications.

applications of topic modeling -source

Now that you have a clear understanding of what the topic modeling means, Let's see how to achieve it with Gensim, But wait someone there asked what is Gensim?

WHAT IS GENSIM?

Well, Gensim is a short form for the general similarity that is Gen from generating and sim from similarity, it is an open-source fully specialized python library written by Radim Rehurek to represent document vectors as efficiently(computer-wise) and painlessly(human-wise) as possible.

Genism is designed to be used in Topic modeling tasks to extract semantic topics from documents, Genism is your tool in case you're want to process large chunks of textual data, it uses algorithms like Word2Vec, FastText, Latent Semantic Indexing (LSI, LSA, LsiModel), Latent Dirichlet Allocation (LDA, LdaModel) internally.

Gensim history - source Radim Rehurek

WHY GENSIM?

It has efficient, implementations for various vector space algorithms as mentioned above.
It also provides similarity queries for documents in their semantic representation.
It provides I/O wrappers and converters around several popular data formats.
Gensim is so fast, because of its design of data access and implementation of numerical processing.

HOW TO USE GENSIM FOR TOPIC MODELLING IN NLP.

We have come to the meat of our article, so grab a cup of coffee, fun playlists from your computer with Jupyter Notebook opened ready for hands-on. let's start.

In this section, we'll see the practical implementation of the Gensim for Topic Modelling using the Latent Dirichlet Allocation (LDA) Topic model,

Installation

Here we have to install the gensim library in a jupyter notebook to be able to use it in our project, consider the code below;

! pip install --upgrade gensim

Loading the datasets and importing important libraries

We are going to use an open-source dataset containing the news of millions of headlines sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)Agency Site: (ABC).

The datasets contain two columns that are publish_date and headlines_texts column with millions of the headlines.

Consider the below code for importing the required libraries.

#importing library
import pandas as pd #loading dataframe
import numpy as np  #for mathematical calculations

import matplotlib.pyplot as plt #visualization
import seaborn as sns #visualization
import zipfile #for extracting the zip file datasets

import gensim #library for topic modelling
from gensim.models import LdaMulticore
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk   #natural language toolkit for preprocessing the text data

from nltk.stem import WordNetLemmatizer   #used to Lemmatize using WordNet's    #built-in morphy function.Returns the input word unchanged if it cannot #be found in WordNet.

from nltk.stem import SnowballStemmer #used for stemming in NLP
from nltk.stem.porter import * #porter stemming

from wordcloud import WordCloud #visualization techniques for #frequently repeated texts

nltk.download('wordnet')  #database of words in more than 200 #languages

Now, we have managed to install Gensim and import the supporting libraries into our working environment, consider the below codes for installation of the other libraries if not installed yet in your jupyter notebook,

! pip install nltk       #installing nltk library
! pip install wordcloud  #installing wordcloud library

After successful importing the above libraries, let's now extract the zip datasets into a folder named data_for_Topic_modelling as shown on the below codes;

#Extracting the Datasets
with zipfile.ZipFile("./abcnews-date-text.csv.zip") as file_zip:
    file_zip.extractall("./data_for_Topic_modelling")

Nice, we have successfully unzipped the data from zip file libraries that we imported above, remember? , Now let's load the data into a variable called data, since the datasets have more than millions of news for this tutorial we are going to use 500000 rows using slicing techniques in python language of the headline news from ABC.

consider the code below for doing that;

#loading the data
#Here we have taken 500,000 rows of out dataset for implementation

data=pd.read_csv("./data_for_Topic_modelling/abcnews-date-text.csv")
data=data[:500000] #500000 rows taken

EDA and processing the data

Nice, after having the data on our variable named data as above shown from code, we have to check how it looks like hence EDA means exploratory data analysis and hence we will do some processing the data to make sure we have dataset ready for the algorithm to be trained,

Here in the code below, we have used the .head() function that prints the first five rows from the datasets, this helps us to know the structure of the data and hence confirmed it is of texts.

#Checking the first columns
data.head()

Here we try to check the shape of the dimension of the dataset and hence confirmed we have the rows that we selected at the start of loading the data, hence, pretty ready to go.

#checking the shape
#as you see there are 500000 the headline news as the rows we selected above.

data.shape

Now, we have to delete the publish_date column from the dataset using the keyword del as shown below codes, why? because we don't want it our main focus is to model the topics according to the document that has a lot of headline news, so we consider the headline _text column.

#Deleting the publish data column since we want only headline_text #columns.

del data['publish_date']

#confirm deleteion
data.head()

Now we have remained with our important column which is headline_text as seen above, and here we now using wordcloud to get a look at the most frequently appearing words from our datasets in headline_text columns, this increase more understanding about the datasets, consider the code below

#word cloud visualization for the headline_text
wc = WordCloud(
    background_color='black',
    max_words = 100,
    random_state = 42,
    max_font_size=110
    )
wc.generate(' '.join(data['headline_text']))
plt.figure(figsize=(50,7))
plt.imshow(wc)
plt.show()

Hereafter visualizing the data, we process the data by starting with stemming, which is simply the process of reducing a word to its word stem that is to say affixes to suffixes and prefixes or to the roots of words known as a lemma. Example cared to care. Here we are using the snowballStemmer algorithm that we imported from nltk, remember right?

consider the below code function code;

#function to perform the pre processing steps on the  dataset
#stemming

stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

Then continue to tokenize and lemmatize, where here we split the large texts in headline text into a list of smaller words that we call tokenization, and finally append the lemmatized word from the lemmatize_stemming function above code to the result list as shown below;

# Tokenize and lemmatize

def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            #Apply lemmatize_stemming on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result

Then after the above steps, here we just call the preprocess() function

#calling the preprocess function above
processed_docs = data['headline_text'].map(preprocess)
processed_docs[:10]

Create a dictionary from 'processed_docs' from gensim.corpora containing the number of times a word appears in the training set, and call it a name it a dictionary, consider below code

 dictionary = gensim.corpora.Dictionary(processed_docs)

Then, after having a dictionary from the above code, we have to implement bags of words model (BoW), BoW is nothing but a representation of the text that shows the occurrence of the words that are within the specified documents, this keeps the word count only and discard another thing like order or structure of the document, Therefore we will create a sample document called document_num and assigned a value of 4310.

Note: you can just create any sample document of your own,

#Create the Bag-of-words(BoW) model for each document
document_num = 4310
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Checking Bag of Words corpus for our sample document that is (token_id, token_count)

bow_corpus[document_num]

Modeling using LDA (Latent Dirichlet Allocation) from bags of words above

We have come to the final part of using LDA which is LdaMulticore for fast processing and performance of the model from Gensim to create our first topic model and save it

#Modelling part
lda_model = gensim.models.LdaMulticore(bow_corpus,
                                       num_topics=10,
                                       id2word = dictionary,
                                       passes = 2,
                                       workers=2)

For each topic, we will explore the words occurring in that topic and their relative weight

#Here it should give you a ten topics as example shown below image
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Let's finish with performance evaluation, by checking which topics the test document that we created earlier belongs to, using LDA bags of word model, consider the code below

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

Congrats! if you have managed to reach the end of this article, as you see above we have implemented a successful model using LDA from the Gensim library using bags of the words to easily model the topics present in the documents with 500,000 headline news. The full codes and datasets used can be found here

Relationship Between Neurotech and Natural Language Processing(NLP)

Natural Language Processing is a powerful tool when your solve business challenges, associating with the digital transformation of companies and startups. Sarufi and Neurotech offer high-standard solutions concerning conversational AI(chatbots). Improve your business experience today with NLP solutions from experienced technical expertise.

Hope you find this article useful, sharing is caring.