NLP: Deep dive Term Frequency

#nlp #ai #programming #python

What is NLP?

NLP stands for Natural Language Processing. It is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

You might say, Computer understand only 0's and 1's, how come it can understand Human Language? If you had the same question, lets dive in how we make the computer understand what the natural language mean.

All words are just Numbers.

Yes, the title is correct—all words are just numbers when it comes to how computers understand language. In Natural Language Processing (NLP), every word is transformed into a numerical representation, typically a vector. These vectors exist in a high-dimensional space known as vector space, where words with similar meanings are located closer to one another.

This mathematical transformation allows the computer to process and analyze human language efficiently. Instead of understanding words the way humans do, the computer compares these vectors, calculates distances, and identifies patterns based on proximity.

When a task such as prediction, translation, or classification is required, the computer doesn't think in words—it simply locates the vector that is most similar or "nearest" to the input and produces the corresponding output. The rich complexity of human language is distilled down into numerical patterns that machines can rapidly process, making NLP both a fascinating and powerful field in artificial intelligence.

There are various methods for this and in this one we are going to see the fundamentals: the TF Algorithm, which is used for finding the importance of the word.

Let's explore this mathematically and then we will implement it python. Why maths? Reason is below!

Term Frequency

Term frequency refers to the number of occurrence of each word in the document. Let see this by a sample:

"There was an amazing community event happened in Chennai last week and everyone loved all the talks and described it as one of the greatest community events of the entire month. There were 4 talks."

Above statement is the sample corpus.

Term Frequency is expressed mathematically as:

TF(x,d) = \frac{Number\ of\ times\ the\ word\ has\ occured}{Total\ number\ of\ words\ in\ Document}

So, let's pick one word from the list: "community"

Number of times the community has occurred = 2
Total number of words in Document = 35

TF (community,d) = 2/35 = 0.057

To implement this in python, here is the code:

Basic Implementation

To implement with vanilla python: we will have to first implement the counter function that will count the words in the given document, find the total number of words and then compute the TF formula. Let's do this step by step:

# Word Counter function

# Input (Document:str) -> Counter Function -> Output (Number of times the word is repeated)

def wordCounter(wordToCount:str,document:str) -> int: 
    docLower = document.lower() # Converting to all lower case
    wordDict = {} # initalizing the return dict
    for char in "-.,\n!?;:":
        docNoPunc = docLower.replace(char, " ") # Repalcing all the punctuation so that it will not interfere.
    words = docNoPunc.split(" ") # Splitting the sentence into an array with the breaking point as a space.
    for word in words: 
        wordDict[word] = wordDict.get(word,0) + 1 # Counter logic, counting the words
    return wordDict[wordToCount] # returning count of the word

# Total number of words Counter
#Input(Document:str) -> TotalWordsCounter Function -> no of words:int

def totalWordsCounter(document:str) -> int:
    words = document.split(" ") # Splitting the sentence into an array with the breaking point as a space.
    return len(words) # Returning the length of the array that is equal to number of words.

# Implementing the TF formula: 

def tf(x:str,d:str) -> float:
    numberWordRepeat = wordCounter(x,d)
    totalWordsInDoc = totalWordsCounter(doc)
    return .4fnumberWordRepeat/totalWordsInDoc