NLP: Deep dive Term Frequency

Rishi Agrawal — Sat, 17 May 2025 11:38:02 +0000

What is NLP?

NLP stands for Natural Language Processing. It is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

You might say, Computer understand only 0's and 1's, how come it can understand Human Language? If you had the same question, lets dive in how we make the computer understand what the natural language mean.

All words are just Numbers.

Yes, the title is correct—all words are just numbers when it comes to how computers understand language. In Natural Language Processing (NLP), every word is transformed into a numerical representation, typically a vector. These vectors exist in a high-dimensional space known as vector space, where words with similar meanings are located closer to one another.

This mathematical transformation allows the computer to process and analyze human language efficiently. Instead of understanding words the way humans do, the computer compares these vectors, calculates distances, and identifies patterns based on proximity.

When a task such as prediction, translation, or classification is required, the computer doesn't think in words—it simply locates the vector that is most similar or "nearest" to the input and produces the corresponding output. The rich complexity of human language is distilled down into numerical patterns that machines can rapidly process, making NLP both a fascinating and powerful field in artificial intelligence.

There are various methods for this and in this one we are going to see the fundamentals: the TF Algorithm, which is used for finding the importance of the word.

Let's explore this mathematically and then we will implement it python. Why maths? Reason is below!

Term Frequency

Term frequency refers to the number of occurrence of each word in the document. Let see this by a sample:

"There was an amazing community event happened in Chennai last week and everyone loved all the talks and described it as one of the greatest community events of the entire month. There were 4 talks."

Above statement is the sample corpus.

Term Frequency is expressed mathematically as:

TF (x, d) = \frac{N u mb er o f t im es t h e w or d ha s occ u re d}{T o t a l n u mb er o f w or d s in Doc u m e n t}

So, let's pick one word from the list: "community"

Number of times the community has occurred = 2
Total number of words in Document = 35

TF (community,d) = 2/35 = 0.057

To implement this in python, here is the code:

Basic Implementation

To implement with vanilla python: we will have to first implement the counter function that will count the words in the given document, find the total number of words and then compute the TF formula. Let's do this step by step:

# Word Counter function

# Input (Document:str) -> Counter Function -> Output (Number of times the word is repeated)

def wordCounter(wordToCount:str,document:str) -> int: 
    docLower = document.lower() # Converting to all lower case
    wordDict = {} # initalizing the return dict
    for char in "-.,\n!?;:":
        docNoPunc = docLower.replace(char, " ") # Repalcing all the punctuation so that it will not interfere.
    words = docNoPunc.split(" ") # Splitting the sentence into an array with the breaking point as a space.
    for word in words: 
        wordDict[word] = wordDict.get(word,0) + 1 # Counter logic, counting the words
    return wordDict[wordToCount] # returning count of the word

# Total number of words Counter
#Input(Document:str) -> TotalWordsCounter Function -> no of words:int

def totalWordsCounter(document:str) -> int:
    words = document.split(" ") # Splitting the sentence into an array with the breaking point as a space.
    return len(words) # Returning the length of the array that is equal to number of words.

# Implementing the TF formula: 

def tf(x:str,d:str) -> float:
    numberWordRepeat = wordCounter(x,d)
    totalWordsInDoc = totalWordsCounter(doc)
    return .4fnumberWordRepeat/totalWordsInDoc

For detailed code please refer: Here

In the next one we will see the IDF (Inverse Document Frequency) to complete the foundational part for the TF-IDF Algorithm.

Unlock LLMs: SaaS vs. Local Solutions & Crafting Custom LLM for Swagger - Series Intro

Rishi Agrawal — Sun, 24 Mar 2024 01:14:24 +0000

As I embark on my journey into the realm of Large Language Models (LLMs), I'm discovering fascinating applications that redefine how we work. From leveraging AI like GitHub Co-pilot for coding to harnessing ChatGPT for email composition, the possibilities seem endless. However, I'm also intrigued by the limitations posed by these solutions being Software-as-a-Service (SaaS) products, lacking full control. In this blog, I delve into a topic often overlooked: Swagger API documentation. Join me as I explore the potential of local setups and document my journey. As a newcomer to the world of LLMs, I seek to uncover practical applications and share insights along the way."

Join us in a groundbreaking series as we delve into the world of Large Language Models (LLMs), examining both Software-as-a-Service (SaaS) solutions and local setups. Together, we'll compare their capabilities and uncover the potential of crafting our own local LLM for a unique purpose: Swagger documentation. This uncharted territory promises to revolutionize how we document APIs. Don't miss out on this pioneering exploration!.

My goal with this article is to provide a path to understand the LLM and potentially help software engineers to use LLM to full potential into their projects.

What is Machine Learning and why do we need to have Machine Learning in the first place?
Today as our dependency on the technology is growing and to surpass the intelligence of the humans, we started to use Machine Learning to predict / classify things to automate the process and eliminating the human intervention. As we know all of the nature and all the processes can be represented as a function which takes some input and produce some output.

"Functions Describes the world"
Quote from the introduction to Thomas Garrity's "Mathematical Maturity"

And the fact that the computers are good at crunching numbers we can use machine learning to basically approximate any function with appropriate data to train on.

Let me define Machine Learning in Leman terms

Machine learning is like teaching a computer to learn from examples rather than programming it with specific instructions. Just like how we learn from experiences, machine learning algorithms analyze data to recognize patterns and make predictions or decisions. Imagine you're teaching a child to differentiate between animals. You show them various animals like cat, dogs, and cows, explaining their unique features like appearance, sound they make, diet, etc. Over time, the child learns to identify each animal correctly without explicit instructions. Similarly, in machine learning, algorithms learn from data to perform tasks such as recognizing spam emails, recommending movies, or even driving cars. It's about enabling computers to learn and improve from data, making them more intelligent and adaptable.

Different areas of machine learning:

Statistical Machine Learning
Deep learning (Neural Networks)
Reinforcement Learning

AI is the sub-domain of machine learning which covers the fields like computer vision and NLP. For me, I started learning AI in my Freshmen year in my undergrad degree and the first project I built was Clone of AlexNet.

As said by Jensen Huang, CEO of Nvidia in GTX 2024, in 2012 we gave the input of image 32X32 pixels and use to get one word as a answer, the potential was clear. Today, we give that one word/vector to the AI model and it generates millions of the pixel back, thats the age of Generative AI we are heading towards.

Hope you will enjoy the series and share with your friends who wants to build amazing understanding of LLM.

Tentative Articles:(Titles might be different)

Exploring the realm of LLM - what and why?
LLM SAAS offerings - OpenAI, Cloud and HuggingFace models
LLM Local - deploying the LLM in local environment.