<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sirisha Chiruvolu</title>
    <description>The latest articles on DEV Community by Sirisha Chiruvolu (@sirisha_chiruvolu_f5136d5).</description>
    <link>https://dev.to/sirisha_chiruvolu_f5136d5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3953230%2F4b61542b-146d-470d-8416-216b7b9ab50e.png</url>
      <title>DEV Community: Sirisha Chiruvolu</title>
      <link>https://dev.to/sirisha_chiruvolu_f5136d5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sirisha_chiruvolu_f5136d5"/>
    <language>en</language>
    <item>
      <title>Finding Similarity Scores Between Text in Natural Language Processing.</title>
      <dc:creator>Sirisha Chiruvolu</dc:creator>
      <pubDate>Mon, 08 Jun 2026 23:55:55 +0000</pubDate>
      <link>https://dev.to/sirisha_chiruvolu_f5136d5/finding-similarity-scores-between-text-in-natural-language-processing-239n</link>
      <guid>https://dev.to/sirisha_chiruvolu_f5136d5/finding-similarity-scores-between-text-in-natural-language-processing-239n</guid>
      <description>&lt;p&gt;For the past couple of years, I have been toying with the problem of finding similarity between text. I have tried many algorithms, and cosine similarity stands out. The algorithm is based on vector algebra. The basic idea is to convert each of the text into vector representation. First, we need to find the magnitude of each vector, which is typically termed as norm.&lt;br&gt;
Step 1:&lt;br&gt;
|(|x|)|2=√∑_1^n▒x2i&lt;br&gt;
Where:&lt;br&gt;
    x=(x_1,x_2,…,x_n)&lt;br&gt;
    x_iare the vector components &lt;br&gt;
    nis the number of dimensions&lt;br&gt;
Step 2:&lt;br&gt;
                Find the dot product of the two vectors&lt;br&gt;
                                          X . Y&lt;br&gt;
                                            Cos θ = Adjacent side/Hypoteneus&lt;br&gt;
                                                       = x.yT/||x||2 . ||y||2&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           We get directional similarity, eliminating  magnitude.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When cosθ=0 the vectors are perpendicular and dissimilar and cosθ=1 means the vectors are aligned and similar&lt;br&gt;
Step 2:&lt;br&gt;
The next step is to customize logic to find best match using dynamic programming. In simpler cases we can use the maximum value of cosine similarity to find maximum aligning strings. However, if we customize certain characters how they need to be interpreted, we can use dynamic programming by taking top picks and run the DP.&lt;br&gt;
Step 3:&lt;br&gt;
We can then find the confidence score between the generated string and the original string using “Recall-Oriented Understudy for Gisting Evaluation” which will calculate how many n grams appear in both the strings and gives f1score. &lt;br&gt;
Recall = (Overlapping words/n grams)/(total characters/n grams in original text)&lt;br&gt;
Precision=(Overlapping words/n grams)/(total characters/ n grams in generated text)&lt;br&gt;
F1score =2 * (precision * recall)/precision+ recall&lt;br&gt;
We need to focus on precision so that the generated text matches the original text as closely as possible and so recall will indicate how many positives matches are found. We need to find right balance between precision and recall. It is very important to find the right balance between recall and precision so the f1 score will give the right metric on the match. &lt;br&gt;
I have also experimented to see if any regex patterns or wild card characters can be counted using reward and penalty count while finding the right score to find the match the desired metric and found good results.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
