DEV Community

Sirisha Chiruvolu
Sirisha Chiruvolu

Posted on

Finding Similarity Scores Between Text in Natural Language Processing.

For the past couple of years, I have been toying with the problem of finding similarity between text. I have tried many algorithms, and cosine similarity stands out. The algorithm is based on vector algebra. The basic idea is to convert each of the text into vector representation. First, we need to find the magnitude of each vector, which is typically termed as norm.
Step 1:
|(|x|)|2=√∑_1^n▒x2i
Where:
x=(x_1,x_2,…,x_n)
x_iare the vector components
nis the number of dimensions
Step 2:
Find the dot product of the two vectors
X . Y
Cos θ = Adjacent side/Hypoteneus
= x.yT/||x||2 . ||y||2

           We get directional similarity, eliminating  magnitude.
Enter fullscreen mode Exit fullscreen mode

When cosθ=0 the vectors are perpendicular and dissimilar and cosθ=1 means the vectors are aligned and similar
Step 2:
The next step is to customize logic to find best match using dynamic programming. In simpler cases we can use the maximum value of cosine similarity to find maximum aligning strings. However, if we customize certain characters how they need to be interpreted, we can use dynamic programming by taking top picks and run the DP.
Step 3:
We can then find the confidence score between the generated string and the original string using “Recall-Oriented Understudy for Gisting Evaluation” which will calculate how many n grams appear in both the strings and gives f1score.
Recall = (Overlapping words/n grams)/(total characters/n grams in original text)
Precision=(Overlapping words/n grams)/(total characters/ n grams in generated text)
F1score =2 * (precision * recall)/precision+ recall
We need to focus on precision so that the generated text matches the original text as closely as possible and so recall will indicate how many positives matches are found. We need to find right balance between precision and recall. It is very important to find the right balance between recall and precision so the f1 score will give the right metric on the match.
I have also experimented to see if any regex patterns or wild card characters can be counted using reward and penalty count while finding the right score to find the match the desired metric and found good results.

Top comments (0)