DEV Community: Debapriya Das

Nyaya-GPT: Building Smarter Legal AI with ReAct + RAG

Debapriya Das — Wed, 23 Oct 2024 15:34:07 +0000

Let’s dive into some advanced AI concepts with a real-world problem like building a smart legal assistant, Nyaya-GPT. This project helps users to query legal documents like the Indian Constitution and Bharatiya Nyaya Sanhita (BNS) with precision. We'll explore concepts like ReAct, RAG (Retrieval-Augmented Generation) and Vector Databases and how they work together to push the boundaries of simple fact retrieval.

Problem Statement and Use Case

Legal documents are vast, complex, and difficult to navigate without specialized knowledge. A traditional chatbot using basic retrieval systems can get overwhelmed with legal jargon or fail to pull the most relevant information from large documents. Nyaya-GPT solves this by combining ReAct + RAG to create a chatbot that not only retrieves legal facts but also reasons through them, offering more nuanced responses.

RAG (Retrieval-Augmented Generation): What Is It?

At its core, RAG combines retrieval with generation. In simpler terms, it looks for the most relevant information from a database or document (retrieval) and uses an LLM (Language Model) to formulate the response (generation).

However, basic RAG setups have limitations:

Only factual: They retrieve facts using semantic search from a vector database without any kind of reasoning, which makes them okay for straight-up questions but lacking nuance.
No deeper understanding: Naive RAG doesn’t really "understand" the query or refine the answers. It’s like asking a librarian for a book on a topic—they give you a book, but you still need to read and understand it yourself.
No memory: These systems don’t remember previous queries or correct mistakes—they answer in a single shot, leaving no room for back-and-forth conversation.

Why ReAct + RAG?

Now, this is where ReAct (Reason + Act) steps in to supercharge our naive RAG. Instead of just pulling up facts, ReAct allows the model to reason through a query and then act to retrieve relevant info in multiple steps. It uses a "think before you act" kind of approach, where the agent breaks down the query, performs actions (like retrieving data), and refines the answer before responding.

Here’s why ReAct + RAG is superior:

Query Understanding: It doesn’t just do a blind search—it thinks about what you're asking. If the first attempt isn't great, it revises its actions.
Multi-step Reasoning: Rather than fetching a single fact, it performs multiple steps to ensure the answer is accurate and contextually appropriate.
Error Handling and Memory: This loop allows the system to handle mistakes and track the conversation, leading to better results over time. For example, if the prompt contains any kind of typo or it is vague or incomplete in some manner, the reasoning loop will try to handle that as per its capability.

The Role of a Vector Database

To make retrieval smarter, Nyaya-GPT uses a vector database. But what exactly is a vector database?

Instead of storing data as simple text, vector databases store it as embeddings—numerical representations of meaning in the text. For Nyaya-GPT, this means breaking down the legal documents into chunks and storing them as vectors. When you ask a question, the system converts your query into a vector and searches for semantically similar vectors from the stored chunks.

Why is this important?

Efficient Semantic Search: A vector database helps the system understand the meaning behind words, not just match keywords.
Scalability: As new legal documents are added, the system can handle larger datasets efficiently.
Relevance: It retrieves the most relevant chunks of information, which are then used by the LLM to craft a detailed response.

For instance, if you ask Nyaya-GPT about “fundamental rights,” it doesn’t just look for exact keywords—it searches for related legal concepts and sections, thanks to the vector database.

Workflow of Nyaya-GPT

Here’s how everything comes together in Nyaya-GPT’s workflow:

User Query: An user submits a legal question like, “What are the key provisions of the Indian Constitution?”
ReAct Loop: The system analyzes the question and determines whether it requires retrieval or reasoning.
RAG & Vector Database: It fetches relevant legal text from the FAISS(Facebook AI Similarity Search) vector database, using semantic search based on embeddings.
Thought-Action Cycle: The agent reasons through the query and refines the result using the information retrieved.
Final Answer: The system synthesizes the retrieved information into a detailed, accurate response.

Code Walkthrough: How Nyaya-GPT Implements ReAct + RAG

Let’s dive into some code snippets to see how Nyaya-GPT brings these concepts to life.

Step 1: Agent Creation Using ReAct and Tools

The heart of Nyaya-GPT is the ReAct agent, which handles both reasoning and tool invocation. Below is the key function from agent.py:

from langchain_groq import ChatGroq
from langchain.agents import create_react_agent, AgentExecutor
from tools.pdf_query_tools import indian_constitution_pdf_query, indian_laws_pdf_query

def agent(query: str):
    LLM = ChatGroq(model="llama3-8b-8192")
    tools = [indian_constitution_pdf_query, indian_laws_pdf_query]
    prompt_template = get_prompt_template()

    agent = create_react_agent(LLM, tools, prompt_template)

    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False, handle_parsing_errors=True)
    result = agent_executor.invoke({"input": query})

    return result["output"]

This function sets up the LLM, along with the necessary tools for interacting with the vector database. When a query is received, the agent invokes the ReAct loop, reasoning through the query and determining if it needs to retrieve any documents using the tools.

Step 2: PDF Query Tools and Vector Search

Nyaya-GPT uses FAISS to store document embeddings and perform semantic search. In pdf_query_tools.py, this function loads the Indian Constitution as a vector database and retrieves relevant sections based on the query:

from langchain_community.vectorstores import FAISS
from PyPDF2 import PdfReader

def indian_constitution_pdf_query(query: str) -> str:
    embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

    try:
        db = FAISS.load_local("db/faiss_index_constitution", embeddings_model)
    except:
        reader = PdfReader("tools/data/constitution.pdf")
        raw_text = ''.join(page.extract_text() for page in reader.pages if page.extract_text())

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=400)
        texts = text_splitter.split_text(raw_text)
        db = FAISS.from_texts(texts, embeddings_model)
        db.save_local("db/faiss_index_constitution")

    retriever = db.as_retriever(k=4)
    return retriever.invoke(query)

This snippet demonstrates how the Indian Constitution PDF is loaded, processed into text chunks, and embedded into the FAISS database. When a user queries Nyaya-GPT, the system searches through these embeddings to find the most relevant text.

Step 3: Streamlit Interface

The Streamlit app serves as the front end for interacting with Nyaya-GPT. Users can input their queries directly into the interface, which then calls the agent function to retrieve and display answers.

import streamlit as st
from agent import agent

st.title("Nyaya-GPT⚖️")

if "store" not in st.session_state:
    st.session_state.store = []

store = st.session_state.store

if prompt := st.chat_input("What is your query?"):
    st.chat_message("user").markdown(prompt)
    response = agent(prompt)
    store.append(response)
    st.chat_message("assistant").markdown(response.content)

This interface provides a simple yet effective way to interact with the chatbot, allowing users to query legal documents and receive answers.

Checkout this article for a detailed walkthrough to on building chatbot using LLMs powered by Groq and Streamlit: https://dev.to/debapriyadas/create-an-end-to-end-personalised-ai-chatbot-using-llama-31-and-streamlitpowered-by-groq-api-3i32

Conclusion

Nyaya-GPT demonstrates the power of combining RAG and ReAct to build a sophisticated legal assistant capable of answering complex legal queries. By leveraging FAISS vector databases for efficient retrieval and ensuring that the model reasons through its responses, the system offers a more reliable and scalable solution than traditional approaches.

For more information and to access the code, check out the repository:

Nyaya-GPT GitHub Repository

Additional Resources:

This combination of structured reasoning, powerful retrieval, and an intuitive user interface creates an efficient legal research tool that can assist users in navigating complex laws with ease.
Edit the documents and the codebase to create your own personalized assistant.

Create an end-to-end personalised AI chatbot🤖 using Llama-3.1🦙 and Streamlit powered by Groq API

Debapriya Das — Sun, 08 Sep 2024 20:15:53 +0000

In this tutorial, we'll build and deploy a personalised AI-powered chat application using Streamlit and the latest AI model llama-3.1-8b-instant. We'll use Groq for faster inference. Also we are going to deploy it for free!
We'll take you through the code, explaining each section and providing useful tips for customization.

Getting Started

First sign in to https://groq.com/ and click start building

Click on Create API key then create a new key, copy it and keep it somewhere safe.

Now install the necessary libraries:
Create the requirements.txt file and paste this



groq==0.9.0
streamlit==1.37.0
python-dotenv

Install these using



pip install -r requirements.txt

Let's create our main.py file and import the required libraries:



import os
from dotenv import dotenv_values
import streamlit as st
from groq import Groq

We'll use streamlit for building the chat interface, dotenv for handling environment variables, and groq for fast inference from the AI model.

Configuring the Page

Let's set up the page configuration using Streamlit:



st.set_page_config(
    page_title="The Tech Buddy ",
    page_icon="",
    layout="centered",
)

This will give our chat application a professional look and feel.

Handling Environment Variables

We'll use environment variables to store sensitive information like API keys and and the application specific prompts.
In your root folder create a .env file like this:



GROQ_API_KEY='YOUR_GROQ_API_KEY'

INITIAL_RESPONSE="Enter what you want to show as the first response of your bot, example: Hello! my friend I am a painter from 70's. Whatsup?"

CHAT_CONTEXT="Enter how do you want to personalize your chatbot, example: You are a painter from the 70's and you are respond sentences with painting references.(This is for the system)"

INITIAL_MSG="Enter the first message from the assistant to initiate the chat history, example: Hey there! I know everything about painting, ask me anything.(This is for the assistant)"

This part is crucial to personalize your application as per your need. So play with it and explore.

Now configure this environment variables in our python file:



try:
    secrets = dotenv_values(".env")  # for dev env
    GROQ_API_KEY = secrets["GROQ_API_KEY"]
except:
    secrets = st.secrets  # for streamlit deployment
    GROQ_API_KEY = secrets["GROQ_API_KEY"]

# Save the API key to environment variable
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

INITIAL_RESPONSE = secrets["INITIAL_RESPONSE"]
INITIAL_MSG = secrets["INITIAL_MSG"]
CHAT_CONTEXT = secrets["CHAT_CONTEXT"]

In the try block we are getting the environment variables from the .env file to run it and test it locally.
But when we'll deploy it using streamlit we will not get any access of the .env file. So that time we will store our secrets using streamlit and to access those secrets we will use st.secretes that returns a python dict, same like dotenv_values(".env"). So after deployment the except block gets executed.

Initializing the Chat Application

Let's set up the chat history and initialize the AI model:

Copy your favourite AI-model's Model ID from https://console.groq.com/docs/models:

I used llama-3.1-8b-instant for my project.

Initialize your model: ```python

Initialize the chat history if present as Streamlit session

if "chat_history" not in st.session_state:
st.session_state.chat_history = [
{"role": "assistant",
"content": INITIAL_RESPONSE
},
]

client = Groq()

We'll store the chat history in the st.session_state object, which allows us to persist data across session refreshes.

## Displaying the Chat Application
Let's create the chat interface using Streamlit:
```python


# Page title
st.title("Hey Buddy!")
st.caption("Let's go back in time...")

# Display chat history
for message in st.session_state.chat_history:
    with st.chat_message("role", avatar=''):
        st.markdown(message["content"])

We'll use the st.chat_message function to display each message in the chat history.

User Input Field

Let's create a text input field for the user to enter their question:



user_prompt = st.chat_input("Let's chat!")

When the user submits their prompt, we'll append it to the chat history and generate a response from the AI model.

Generating a Response from the AI Model

Let's create a response from the AI model using the Groq library:



def parse_groq_stream(stream):
    for chunk in stream:
        if chunk.choices:
            if chunk.choices[0].delta.content is not None:
                yield chunk.choices[0].delta.content

if user_prompt:
    with st.chat_message("user", avatar=""):
        st.markdown(user_prompt)
    st.session_state.chat_history.append(
        {"role": "user", "content": user_prompt})

    messages = [
        {"role": "system", "content": CHAT_CONTEXT
         },
        {"role": "assistant", "content": INITIAL_MSG},
        *st.session_state.chat_history
    ]

    stream = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=messages,
        stream=True  # for streaming the message
    )
    response = st.write_stream(parse_groq_stream(stream))
    st.session_state.chat_history.append(
        {"role": "assistant", "content": response})

We'll use the client.chat.completions.create() method to generate a steam and then parse it to a actual response from the AI model, and then append it to the chat history.

Run it locally

Congratulations! You've built a personalised AI-powered chat application using Streamlit, Groq, and a llama-3.1-8b-instant model.

Here is the whole main.py file:



import os
from dotenv import dotenv_values
import streamlit as st
from groq import Groq


def parse_groq_stream(stream):
    for chunk in stream:
        if chunk.choices:
            if chunk.choices[0].delta.content is not None:
                yield chunk.choices[0].delta.content


# streamlit page configuration
st.set_page_config(
    page_title="The 70's Painter",
    page_icon="🎨",
    layout="centered",
)


try:
    secrets = dotenv_values(".env")  # for dev env
    GROQ_API_KEY = secrets["GROQ_API_KEY"]
except:
    secrets = st.secrets  # for streamlit deployment
    GROQ_API_KEY = secrets["GROQ_API_KEY"]

# save the api_key to environment variable
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

INITIAL_RESPONSE = secrets["INITIAL_RESPONSE"]
INITIAL_MSG = secrets["INITIAL_MSG"]
CHAT_CONTEXT = secrets["CHAT_CONTEXT"]


client = Groq()

# initialize the chat history if present as streamlit session
if "chat_history" not in st.session_state:
    # print("message not in chat session")
    st.session_state.chat_history = [
        {"role": "assistant",
         "content": INITIAL_RESPONSE
         },
    ]

# page title
st.title("Hey Buddy!")
st.caption("Let's go back in time...")
# the messages in chat_history will be stored as {"role":"user/assistant", "content":"msg}
# display chat history
for message in st.session_state.chat_history:
    # print("message in chat session")
    with st.chat_message("role", avatar='🤖'):
        st.markdown(message["content"])


# user input field
user_prompt = st.chat_input("Ask me")

if user_prompt:
    # st.chat_message("user").markdown
    with st.chat_message("user", avatar="🗨️"):
        st.markdown(user_prompt)
    st.session_state.chat_history.append(
        {"role": "user", "content": user_prompt})

    # get a response from the LLM
    messages = [
        {"role": "system", "content": CHAT_CONTEXT
         },
        {"role": "assistant", "content": INITIAL_MSG},
        *st.session_state.chat_history
    ]

    # Display assistant response in chat message container
    with st.chat_message("assistant", avatar='🤖'):
        stream = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=messages,
            stream=True  # for streaming the message
        )
        response = st.write_stream(parse_groq_stream(stream))
    st.session_state.chat_history.append(
        {"role": "assistant", "content": response})

To run it locally enter the following command in your terminal:



streamlit run main.py

Deployment

We are now all set to deploy our app.
First upload the codebase in a GitHub repository.
Then click here to sign in to your streamlit account and go to My Apps section:

Click on Create app at the upper right corner.
Click on first option:
Locate your github repository:
Locate the your main.py file:
Create a custom url for your deployed app(optional):
Click on additional settings and paste everything from your .env file (this is the st.secrets):
Click on deploy:

Congrats! you have successfully deployed your own personalised AI app for free.

Conclusion

This tutorial should give you a solid foundation for creating engaging chat interfaces in your future projects.

Additional Tips

Make sure to handle errors and exceptions properly to provide a smooth user experience.
Customize the chat interface to fit your application's theme and design.
Experiment with different AI models and fine-tune the chat application to improve its accuracy and performance.
I hope this tutorial has been helpful! If you have any questions or need further clarification on any section, feel free to ask in the comments below.

Recourses

Click here to checkout my implementation of a personalized DSA instructor app https://the-tech-buddy.streamlit.app/
Source code: https://github.com/Debapriya-source/llama-3.1-chatbot

How to Run Llama-3.1🦙 Locally Using Python🐍 and Hugging Face 🤗

Debapriya Das — Tue, 30 Jul 2024 17:44:21 +0000

Introduction

The latest Llama🦙 (Large Language Model Meta AI) 3.1 is a powerful AI model developed by Meta AI that has gained significant attention in the natural language processing (NLP) community. It is the most capable open-source llm till date. In this blog, I will guide you through the process of cloning the Llama 3.1 model from Hugging Face🤗 and running it on your local machine using Python. After which you can integrate it in any AI project.

Prerequisites

Python 3.8 or higher installed on your local machine
Hugging Face Transformers library installed (pip install transformers)
Git installed on your local machine
A Hugging Face account

Step 1: Get access to the model

Click here https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct to open the official hugging face repository of Meta's Llama-3.1-8B-Instruct (you can use other llama 3.1 models in the same way).

At the beginning you should be seeing this:

Submit the form below to get access of the model

Once you see "You have been granted access to this model", you are good to go...

Step 2: Create an ACCESS_TOKEN

Go to "Settings" (Bottom right corner of the below image):

Go to "Access Tokens" click "Create new token"(upper right corner of the image):

Give read and write permissions and select the repo as shown:

Copy the token and place it somewhere safe and secure as it will be needed in the future.(note: once you copy it you cannot copy it again, so if you anyhow forget the key, you have to create a new one to begin with :))

Step 3: Clone the LLaMA 3.1 Model

Now run the following command on your favorite terminal.
The ACCESS_TOKEN is the one you copied and the <huggingface-user-name> is the username of your hugging face account.



git clone https://<huggingface-user-name>:<ACCESS_TOKEN>@huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

This can take a lot of time depending on your internet speed.

Step 4: Install Required Libraries

Once the cloning is done, go to the cloned folder and install all the dependencies from the requirements.txt. (you can create an virtual-environment using conda(recommended) or virtualenv)
You can find out the requirements file in my GitHub provided in the resources section below.

Using conda:



cd Meta-Llama-3.1-8B-Instruct
conda install --yes --file requirements.txt

Using pip:



cd Meta-Llama-3.1-8B-Instruct
pip install -r requirements.txt

Step 5: Run the Llama 3.1 Model

Create a new Python file (e.g., test.py) and paste the location of the model repository you just cloned as the model_id (such as, "D:\\Codes\\NLP\\Meta-Llama-3.1-8B-Instruct"). Here is an example:



import transformers
import torch

## Here you paste your cloned repos location
model_id = "D:\\Codes\\NLP\\Meta-Llama-3.1-8B-Instruct" 

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

You can set device_map=cuda if you want use the gpu also.

Step 6: Run the Python Script



python test.py

Output

Issues you can face

OSError: [WinError 126] fbgemm.dll
- To solve this error make sure you have Visual Studio installed.
  - In case you don't have it, click here and install it.
  - Then restart the computer.
If there is still any errors with pytorch versions, use anaconda or miniconda to configure a new environment with suitable python version and dependencies.
If you are facing any other issue or error feel free to comment below.

Resources

For more details on llama 3.1 check out: https://ai.meta.com/blog/meta-llama-3-1/

My implementation https://github.com/Debapriya-source/llama-3.1-8B-Instruct.git

Conclusion

In this blog, we have successfully cloned the LLaMA-3.1-8B-Instruct model from Hugging Face and run it on our local machine using Python. You can now experiment with the model by modifying the prompt, adjusting hyperparameters, or integrate with your upcoming projects. Happy coding!

Exploring Word Embedding Techniques Based on Count or Frequency: A Practical Guide

Debapriya Das — Sun, 21 Jul 2024 16:31:23 +0000

In the rapidly evolving field of Natural Language Processing (NLP), word embeddings are essential for converting text into numerical representations that algorithms can process. This article delves into three primary word embedding techniques based on count or frequency: One-Hot Encoding, Bag of Words (BoW), and Term Frequency-Inverse Document Frequency (TF-IDF), with practical Python implementations using scikit-learn.

1. One-Hot Encoding

Overview

One-Hot Encoding is a fundamental technique where each word in the vocabulary is represented as a binary vector. In this representation, each word is assigned a unique vector with a single high (1) value and the rest low (0).

Example

For a vocabulary of ["cat", "dog", "mouse"], the one-hot vectors would be:

"cat": [1, 0, 0]
"dog": [0, 1, 0]
"mouse": [0, 0, 1]

Code Example

Here’s how you can implement One-Hot Encoding using scikit-learn:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample vocabulary
vocab = np.array(["cat", "dog", "mouse"]).reshape(-1, 1)

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the vocabulary
onehot_encoded = encoder.fit_transform(vocab)

# Print the one-hot encoded vectors
print(onehot_encoded)

Output

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Advantages

Simple and easy to implement.
Suitable for small datasets.

Disadvantages

High dimensionality for large vocabularies.
Does not capture semantic relationships between words.

Use Cases

Basic text classification tasks.
Simple NLP applications where semantic context is not crucial.

2. Bag of Words (BoW)

Overview

Bag of Words represents text by the frequency of words, disregarding grammar and word order. This technique constructs a vocabulary of known words and counts their occurrences in the text.

Example

For the sentences "The cat sat on the mat" and "The dog lay on the mat", the BoW representation would be:

"The cat sat on the mat": [1, 1, 1, 1, 2]
"The dog lay on the mat": [1, 1, 1, 1, 2]

Code Example

Here’s how you can implement Bag of Words using scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
sentences = ["The cat sat on the mat", "The dog lay on the mat"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

Output

['cat' 'dog' 'lay' 'mat' 'on' 'sat' 'the']
[[1 0 0 1 1 1 2]
 [0 1 1 1 1 0 2]]

Advantages

Simple and effective for various tasks.
Suitable for text classification and document similarity.

Disadvantages

High dimensionality with large vocabularies.
Loses semantic and contextual information.

Use Cases

Document classification.
Spam detection and sentiment analysis.

3. Term Frequency-Inverse Document Frequency (TF-IDF)

Overview

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).

Formula

Term Frequency (TF): Measures how frequently a word appears in a document.

Formula: $\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} ]$

Inverse Document Frequency (IDF): Measures how important a word is across multiple documents.

Formula: $\text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) ]$

TF-IDF: Product of TF and IDF.

Formula: $\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) ]$

Example

For a document set with "The cat sat on the mat" and "The dog lay on the mat":

"mat" may have a lower weight if it appears frequently in many documents.
"cat" and "dog" would have higher weights as they appear less frequently.

Code Example

Here’s how you can implement TF-IDF using scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
sentences = ["The cat sat on the mat", "The dog lay on the mat"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

Output

['cat' 'dog' 'lay' 'mat' 'on' 'sat' 'the']
[[0.         0.         0.         0.46979135 0.58028582 0.58028582
  0.46979135]
 [0.         0.58028582 0.58028582 0.46979135 0.46979135 0.
  0.46979135]]

Advantages

Highlights important words while reducing the weight of frequently occurring but less informative words.
Effective in filtering out common words and emphasizing unique terms.

Disadvantages

Still results in high-dimensional sparse matrices.
Does not capture semantic relationships between words.

Use Cases

Information retrieval and search engines.
Document clustering and topic modeling.

Conclusion

Count or frequency-based word embedding techniques like One-Hot Encoding, Bag of Words, and TF-IDF are foundational methods in NLP. While they are straightforward to implement and useful for various text processing tasks, they have limitations in capturing semantic relationships and handling large vocabularies. As the field advances, more sophisticated embedding techniques are emerging, offering richer and more nuanced representations of textual data.

I hope this article provides a clear and professional overview of count or frequency-based word embedding techniques with practical implementations. Happy learning! 🚀

Here is the notebook containing the examples: https://github.com/Debapriya-source/NLP-Codes/blob/main/word_embedding.ipynb

Exploring Text Preprocessing Techniques in Natural Language Processing

Debapriya Das — Thu, 18 Jul 2024 12:26:39 +0000

image credit: www.google.com

As developers and data enthusiasts, diving into Natural Language Processing (NLP) opens up a world of possibilities in understanding and extracting insights from textual data. In this article, we'll explore foundational techniques in text preprocessing that form the backbone of NLP applications.

Basic Terminologies in NLP

Before delving into techniques, let's grasp some fundamental terms:

Corpus: A collection of texts used for language analysis. It could range from news articles to social media posts.
Documents: Individual units within a corpus, like a single article or tweet.
Vocabulary: Unique words in a corpus, critical for understanding language diversity.
Words: Basic units of language, each with its own meaning and context.

Let's load a corpus and view its vocabulary using NLTK:

import nltk
from nltk.corpus import gutenberg

nltk.download('gutenberg')
nltk.download('punkt')

# Load a corpus
corpus = gutenberg.words('austen-emma.txt')

# Display the first 10 words
print(corpus[:10])

# Create a vocabulary
vocabulary = set(corpus)
print(f"Vocabulary size: {len(vocabulary)}")
print(list(vocabulary)[:10])

Tokenization

Tokenization breaks down text into meaningful units, such as words or sentences:

Word Tokenization: Splits text into words. Example: "NLP is fascinating" becomes ["NLP", "is", "fascinating"].
Sentence Tokenization: Splits text into sentences. Example: "NLP is fascinating. It has many applications." becomes ["NLP is fascinating.", "It has many applications."].

Here's how you can tokenize text using NLTK:

from nltk.tokenize import word_tokenize, sent_tokenize

# Sample text
text = "NLP is fascinating. It has many applications."

# Word Tokenization
word_tokens = word_tokenize(text)
print(f"Word Tokens: {word_tokens}")

# Sentence Tokenization
sent_tokens = sent_tokenize(text)
print(f"Sentence Tokens: {sent_tokens}")

Stemming Techniques

Stemming reduces words to their root form, simplifying analysis:

Porter Stemmer: Converts "running" to "run".
Lancaster Stemmer: More aggressive, converting "happiness" to "happy".
Snowball Stemmer: Supports multiple languages, akin to Porter.

Here’s an example of stemming in action using NLTK:

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# Sample words
words = ["running", "jumps", "easily", "happiness"]

# Porter Stemmer
porter = PorterStemmer()
print("Porter Stemmer Results:", [porter.stem(word) for word in words])

# Lancaster Stemmer
lancaster = LancasterStemmer()
print("Lancaster Stemmer Results:", [lancaster.stem(word) for word in words])

# Snowball Stemmer
snowball = SnowballStemmer(language='english')
print("Snowball Stemmer Results:", [snowball.stem(word) for word in words])

Conclusion

Text preprocessing lays the groundwork for effective NLP applications. By understanding and applying these techniques, developers can harness the power of textual data to drive insights and innovation in various domains.

Start your NLP journey today and explore the endless possibilities of language understanding!

Ready to transform text into insights? Let's dive into #NLP and #TextProcessing together! 🚀💬