<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Evgenii Perminov</title>
    <description>The latest articles on DEV Community by Evgenii Perminov (@evgeniiperminov).</description>
    <link>https://dev.to/evgeniiperminov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1973401%2F3bc0834c-aae8-4342-9fbb-14588e5533f9.jpg</url>
      <title>DEV Community: Evgenii Perminov</title>
      <link>https://dev.to/evgeniiperminov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/evgeniiperminov"/>
    <language>en</language>
    <item>
      <title>Embeddings clustering with Agglomerative Hierarchical Clustering (messy-folder-reorganizer-ai)</title>
      <dc:creator>Evgenii Perminov</dc:creator>
      <pubDate>Fri, 28 Mar 2025 15:42:16 +0000</pubDate>
      <link>https://dev.to/evgeniiperminov/embeddings-clustering-with-agglomerative-hierarchical-clustering-messy-folder-reorganizer-ai-520k</link>
      <guid>https://dev.to/evgeniiperminov/embeddings-clustering-with-agglomerative-hierarchical-clustering-messy-folder-reorganizer-ai-520k</guid>
      <description>&lt;h1&gt;
  
  
  Adding RAG and ML to Messy-Folder-Reorganizer-AI
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why ML Methods for Clustering
&lt;/h2&gt;

&lt;p&gt;As we discovered in previous articles, all LLMs have context restrictions, so we cannot send hundreds of file names to an LLM and ask it to create folder names for all of them. On the other hand, sending a request for each file separately is not only inefficient and redundant—it also breaks the global context.&lt;/p&gt;

&lt;p&gt;For example, if you have files like &lt;code&gt;bill_for_electricity.pdf&lt;/code&gt; and &lt;code&gt;bill_for_leasing.docx&lt;/code&gt;, you don’t want to end up with folder names like &lt;code&gt;bills&lt;/code&gt; for the first and &lt;code&gt;documents&lt;/code&gt; for the second. These results are technically valid, but they’re disconnected. &lt;strong&gt;We need to group related files together first&lt;/strong&gt;, and the best way to do that is by clustering their embeddings.&lt;br&gt;
For &lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai" rel="noopener noreferrer"&gt;messy-folder-reorganizer-ai&lt;/a&gt; I picked agglomerative hierarchical clustering and I will try to explain my choice to the reader.&lt;/p&gt;




&lt;h2&gt;
  
  
  Selecting a Clustering Method
&lt;/h2&gt;

&lt;p&gt;There are many clustering algorithms out there, but not all are suitable for the nature of embeddings. We're working with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High-dimensional vectors&lt;/strong&gt; (e.g., 384, 768, or more dimensions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relatively small datasets&lt;/strong&gt; (e.g., a few hundred or thousand files).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's a comparison of a few clustering options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;K-Means&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, simple, widely used&lt;/td&gt;
&lt;td&gt;Requires choosing &lt;code&gt;k&lt;/code&gt;, assumes spherical clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBSCAN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detects arbitrary shapes, noise handling&lt;/td&gt;
&lt;td&gt;Sensitive to parameters, poor with high dimensions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HDBSCAN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Improved DBSCAN, handles hierarchy&lt;/td&gt;
&lt;td&gt;Slower, more complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agglomerative&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No need for &lt;code&gt;k&lt;/code&gt;, builds hierarchy, flexible distances&lt;/td&gt;
&lt;td&gt;Slower, high memory use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Agglomerative hierarchical clustering&lt;/strong&gt; is a strong fit because it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Doesn’t require you to predefine the number of clusters.&lt;/li&gt;
&lt;li&gt;Works well with custom distance metrics (like cosine).&lt;/li&gt;
&lt;li&gt;Builds a dendrogram that can be explored at different levels of granularity.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Agglomerative Clustering Preparations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Input: Embedding Matrix
&lt;/h3&gt;

&lt;p&gt;We assume an input matrix of shape &lt;strong&gt;M x N&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;M&lt;/code&gt;: Number of files (embeddings).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;N&lt;/code&gt;: Dimensionality of the embeddings (depends on the model used).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Building a Normalized Matrix
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What is normalization?&lt;/strong&gt;&lt;br&gt;
Normalization ensures that all vectors are of unit length, which is especially important when using cosine distance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why normalize?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevents length from affecting similarity.&lt;/li&gt;
&lt;li&gt;Ensures cosine distance reflects angular difference only.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For vector (x), normalize it as:&lt;/p&gt;

&lt;p&gt;x̂ = x / ||x||&lt;/p&gt;

&lt;p&gt;Where ||x|| is the Euclidean norm (i.e., the square root of the sum of squares of the elements of x).&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the Distance Matrix Using Cosine Distance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why cosine distance?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It captures &lt;strong&gt;semantic similarity&lt;/strong&gt; better in high-dimensional embedding spaces.&lt;/li&gt;
&lt;li&gt;More stable than Euclidean in high dimensions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Does it help with the curse of dimensionality?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
To some extent, yes. While no method fully escapes the curse, &lt;strong&gt;cosine similarity&lt;/strong&gt; is more robust than Euclidean for textual or semantic data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Given two normalized vectors (x) and (y):&lt;/p&gt;

&lt;p&gt;cosine_similarity(x, y) = (x · y) / (‖x‖ · ‖y‖)&lt;/p&gt;

&lt;p&gt;cosine_distance(x, y) = 1 - cosine_similarity(x, y)&lt;/p&gt;




&lt;h2&gt;
  
  
  Agglomerative Clustering Algorithm
&lt;/h2&gt;

&lt;p&gt;Once we have the distance matrix, the agglomerative process begins:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start&lt;/strong&gt;: Treat each embedding as its own cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge&lt;/strong&gt;: Find the two closest clusters using the selected linkage method:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single&lt;/strong&gt;: Minimum distance between points across clusters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete&lt;/strong&gt;: Maximum distance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average&lt;/strong&gt;: Mean distance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ward&lt;/strong&gt;: Minimizes variance (works only with Euclidean distance).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeat&lt;/strong&gt;: Merge the next closest pair until one cluster remains or a distance threshold is reached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cut the dendrogram&lt;/strong&gt;: Decide how many clusters to extract based on height (distance) or desired granularity.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This method gives you &lt;strong&gt;interpretable, connected groupings&lt;/strong&gt;—a critical step before folder naming or generating structured representations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;If you are interested, you can check out implementation on Rust&lt;br&gt;
&lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai/blob/main/src/ml/agglomerative_clustering.rs" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Looking for Feedback
&lt;/h2&gt;

&lt;p&gt;I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Thanks for Reading!&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Feel free to reach out here or connect with me on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/PerminovEugene" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/eugene-perminov/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!&lt;/p&gt;

</description>
      <category>vectordatabase</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>cli</category>
    </item>
    <item>
      <title>Making Embeddings Understand Files and Folders with Simple Sentences (messy-folder-reorganizer-ai)</title>
      <dc:creator>Evgenii Perminov</dc:creator>
      <pubDate>Fri, 28 Mar 2025 15:42:08 +0000</pubDate>
      <link>https://dev.to/evgeniiperminov/making-embeddings-understand-files-and-folders-with-simple-sentences-messy-folder-reorganizer-ai-mjg</link>
      <guid>https://dev.to/evgeniiperminov/making-embeddings-understand-files-and-folders-with-simple-sentences-messy-folder-reorganizer-ai-mjg</guid>
      <description>&lt;h1&gt;
  
  
  Do Embeddings Need Context? A Practical Look at File-to-Folder Matching
&lt;/h1&gt;

&lt;p&gt;When building smart systems that classify or match content — such as automatically sorting files into folders — embeddings are a powerful tool. But how well do they work with minimal input? And does adding natural language context make a difference?&lt;/p&gt;

&lt;p&gt;During development &lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai" rel="noopener noreferrer"&gt;messy-folder-reorganizer-ai&lt;/a&gt; I found how adding &lt;strong&gt;contextual phrasing&lt;/strong&gt; to file and folder names significantly improved the performance of embedding models and in this article I will share it with the reader.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Case: Matching Files to Valid Folder Names
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test A: Using Only File and Folder Names
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| File Name               | Folder Name | Score     |
|-------------------------|-------------|-----------|
| crack.exe               | apps        | 0.5147713 |
| lovecraft novels.txt    | books       | 0.5832841 |
| police report.docx      | docs        | 0.6303186 |
| database admin.pkg      | docs        | 0.5538312 |
| invoice from google.pdf | docs        | 0.5381457 |
| meme.png                | images      | 0.6993392 |
| funny cat.jpg           | images      | 0.5511819 |
| lord of the ring.avi    | movies      | 0.5454072 |
| harry potter.mpeg4      | movies      | 0.5410566 |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test B: Adding Natural Language Context
&lt;/h3&gt;

&lt;p&gt;Each string was framed like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;"This is a file name: {file_name}"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"This is a folder name: {folder_name}"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| File Name                       | Folder Name | Score    |
|--------------------------------|-------------|-----------|
| crack.exe                      | apps        | 0.6714907 |
| lovecraft novels.txt           | books       | 0.7517922 |
| database admin.pkg             | dest        | 0.7194574 |
| police report.docx             | docs        | 0.7456068 |
| invoice from google.pdf        | docs        | 0.7141885 |
| meme.png                       | images      | 0.7737676 |
| funny cat.jpg                  | images      | 0.7438067 |
| harry potter.mpeg4             | movies      | 0.7156760 |
| lord of the ring.avi           | movies      | 0.6718528 |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Observations:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scores were consistently higher&lt;/strong&gt; across the board when context was added.&lt;/li&gt;
&lt;li&gt;The model &lt;strong&gt;made more accurate matches&lt;/strong&gt;, such as correctly associating &lt;code&gt;database admin.pkg&lt;/code&gt; with &lt;code&gt;dest&lt;/code&gt; instead of &lt;code&gt;books&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;This suggests that &lt;strong&gt;embeddings perform better with structured, semantic context&lt;/strong&gt;, not just bare tokens.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Test Case: Only Some Files Have Valid Matches
&lt;/h2&gt;

&lt;p&gt;Now let's delete the movies and images folders and observe how the matching behavior changes:&lt;/p&gt;

&lt;h3&gt;
  
  
  Test A: Using Only File and Folder Names
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| File Name               | Folder Name | Score      |
|-------------------------|-------------|------------|
| hobbit.fb2              | apps        | 0.55056566 |
| crack.exe               | apps        | 0.5147713  |
| lovecraft novels.txt    | books       | 0.57081085 |
| police report.docx      | docs        | 0.6303186  |
| meme.png                | docs        | 0.58589196 |
| database admin.pkg      | docs        | 0.5538312  |
| invoice from google.pdf | docs        | 0.5381457  |
| lord of the ring.avi    | docs        | 0.492918   |
| funny cat.jpg           | docs        | 0.45956808 |
| harry potter.mpeg4      | docs        | 0.45733657 |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Test B: Adding Natural Language Context
&lt;/h3&gt;

&lt;p&gt;Same context generation pattern as in previous test case&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| File Name               | Folder Name | Score      |
|-------------------------|-------------|------------|
| crack.exe               | apps        | 0.6714907  |
| lovecraft novels.txt    | books       | 0.72899115 |
| database admin.pkg      | dest        | 0.7194574  |
| meme.png                | dest        | 0.68507683 |
| funny cat.jpg           | dest        | 0.6797525  |
| lord of the ring.avi    | dest        | 0.5323342  |
| police report.docx      | docs        | 0.7456068  |
| invoice from google.pdf | docs        | 0.71418846 |
| hobbit.fb2              | docs        | 0.6780642  |
| harry potter.mpeg4      | docs        | 0.5984984  |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Observations:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;In Test A, files like meme.png, funny cat.jpg, and lord of the ring.avi were incorrectly matched to the docs folder. In Test B, they appeared in the more appropriate dest folder.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There are still some mismatches — for example, hobbit.fb2 was matched with docs instead of books, likely due to the less common .fb2 format. harry potter.mpeg4 also matched with docs, though with a relatively low score.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Does This Happen?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Context Gives Structure&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Embedding models are trained on natural language. So when we provide structured inputs like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“This is a file name: invoice from google.pdf”&lt;br&gt;&lt;br&gt;
“This is a folder name: docs”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;...the model better understands the &lt;strong&gt;semantic role&lt;/strong&gt; of each string. It knows these aren't just tokens — they are &lt;em&gt;types of things&lt;/em&gt;, which makes embeddings more aligned.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. &lt;strong&gt;It’s Not Just Word Overlap&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes, phrases like &lt;code&gt;"this is a file name"&lt;/code&gt; and &lt;code&gt;"this is a folder name"&lt;/code&gt; are similar. But if word overlap were the only reason for higher scores, all scores would rise evenly — regardless of actual content.&lt;/p&gt;

&lt;p&gt;Instead, we're seeing better matching. That means the model is using &lt;strong&gt;true context&lt;/strong&gt; to judge compatibility — a sign that semantic meaning is being used, not just lexical similarity.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. &lt;strong&gt;Raw Strings Without Context Can Be Misleading&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A folder named &lt;code&gt;docs&lt;/code&gt; or &lt;code&gt;my-pc&lt;/code&gt; is vague. A file named &lt;code&gt;database admin.pkg&lt;/code&gt; is even more so. Embeddings of such raw strings might be overly similar due to lack of semantic separation. &lt;/p&gt;

&lt;p&gt;Adding even a light wrapper like &lt;code&gt;"This is a file name..."&lt;/code&gt; or &lt;code&gt;"This is a folder name..."&lt;/code&gt; gives the model &lt;strong&gt;clearer context and role assignment&lt;/strong&gt;, helping it avoid false positives and improve semantic accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings require context to be effective&lt;/strong&gt;, especially for classification or matching tasks.&lt;/li&gt;
&lt;li&gt;Providing &lt;strong&gt;natural-language-like structure&lt;/strong&gt; (even just a short prefix) significantly improves performance.&lt;/li&gt;
&lt;li&gt;It’s not just about higher scores — it’s about &lt;strong&gt;better semantics and more accurate results&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building tools that rely on embeddings, especially for classification, recommendation, or clustering — &lt;strong&gt;don't be afraid to add a little helpful context.&lt;/strong&gt; It goes a long way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Looking for Feedback
&lt;/h2&gt;

&lt;p&gt;I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Thanks for Reading!&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Feel free to reach out here or connect with me on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/PerminovEugene" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/eugene-perminov/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How Cosine Similarity Helped My CLI Decide Where Files Belong (messy-folder-reorganizer-ai)</title>
      <dc:creator>Evgenii Perminov</dc:creator>
      <pubDate>Fri, 28 Mar 2025 15:41:57 +0000</pubDate>
      <link>https://dev.to/evgeniiperminov/how-cosine-similarity-helped-my-cli-decide-where-files-belong-messy-folder-reorganizer-ai-fm3</link>
      <guid>https://dev.to/evgeniiperminov/how-cosine-similarity-helped-my-cli-decide-where-files-belong-messy-folder-reorganizer-ai-fm3</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;In version 0.2 of &lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai" rel="noopener noreferrer"&gt;messy-folder-reorganizer-ai&lt;/a&gt;, I used the Qdrant vector database to search for similar vectors. This was necessary to determine which folder a file should go into based on its embedding. Because of this, I needed to revisit different distance/similarity metrics and choose the most appropriate one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing the Right Vector Similarity Metric in Qdrant
&lt;/h2&gt;

&lt;p&gt;Qdrant supports the following distance/similarity metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dot Product&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cosine Similarity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Euclidean Distance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manhattan Distance&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Distance/Similarity Formulas
&lt;/h3&gt;

&lt;p&gt;Let &lt;strong&gt;x&lt;/strong&gt; and &lt;strong&gt;y&lt;/strong&gt; be two vectors of dimensionality &lt;em&gt;n&lt;/em&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cosine Similarity
&lt;/h4&gt;

&lt;p&gt;cosine(x, y) = (x · y) / (‖x‖ · ‖y‖)&lt;/p&gt;

&lt;h4&gt;
  
  
  Dot Product
&lt;/h4&gt;

&lt;p&gt;dot(x, y) = Σ (xᵢ * yᵢ)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ If vectors are normalized to unit length, then: &lt;code&gt;cosine(x, y) = dot(x, y)&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Euclidean Distance
&lt;/h4&gt;

&lt;p&gt;euclidean(x, y) = sqrt(Σ (xᵢ - yᵢ)²)&lt;/p&gt;

&lt;h4&gt;
  
  
  Manhattan Distance (L1)
&lt;/h4&gt;

&lt;p&gt;manhattan(x, y) = Σ |xᵢ - yᵢ|&lt;/p&gt;

&lt;p&gt;When working with high-dimensional vectors (e.g., 1024 dimensions, as in the &lt;strong&gt;mxbai-embed-large:latest&lt;/strong&gt; Ollama model) that have &lt;strong&gt;small magnitudes&lt;/strong&gt;, &lt;strong&gt;Cosine Similarity&lt;/strong&gt; is often the best choice — especially for embeddings.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Cosine Similarity is a Good Choice
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Focuses on orientation, not magnitude
&lt;/h3&gt;

&lt;p&gt;Cosine similarity measures the angle between vectors. It tells you &lt;br&gt;
how similar the directions are*, regardless of vector length. This &lt;br&gt;
is useful when comparing embeddings, where absolute length may not &lt;br&gt;
be meaningful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built-in normalization
&lt;/h3&gt;

&lt;p&gt;Cosine similarity is equivalent to the dot product of &lt;strong&gt;L2-&lt;br&gt;
normalized vectors&lt;/strong&gt;, which helps reduce the effect of the "curse &lt;br&gt;
of dimensionality."&lt;/p&gt;

&lt;h3&gt;
  
  
  Great for semantic embeddings
&lt;/h3&gt;

&lt;p&gt;Works very well when vectors represent meaning or context. Many models (e.g., OpenAI, BERT, Sentence Transformers) are trained &lt;br&gt;
with cosine similarity in mind.&lt;/p&gt;

&lt;h3&gt;
  
  
  Efficient
&lt;/h3&gt;

&lt;p&gt;Can be computed quickly even in high dimensions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cosine Similarity in Detail
&lt;/h2&gt;

&lt;p&gt;Imagine two arrows (vectors) starting from the origin in a multi-dimensional space. Cosine similarity measures the &lt;strong&gt;angle between them&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If they point in &lt;strong&gt;exactly the same direction&lt;/strong&gt;, similarity = &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If they are &lt;strong&gt;completely opposite&lt;/strong&gt;, similarity = &lt;code&gt;-1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If they are &lt;strong&gt;orthogonal&lt;/strong&gt; (90° apart), similarity = &lt;code&gt;0.0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The closer the angle is to zero, the more similar the vectors are.&lt;/p&gt;




&lt;h3&gt;
  
  
  Formula
&lt;/h3&gt;

&lt;p&gt;Given two vectors &lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;B&lt;/strong&gt;, cosine similarity is calculated as: &lt;/p&gt;

&lt;p&gt;cos(θ) = (A · B) / (||A|| * ||B||)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;A · B&lt;/code&gt; is the dot product of the vectors
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;||A||&lt;/code&gt; and &lt;code&gt;||B||&lt;/code&gt; are the magnitudes (lengths) of the vectors&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Let's take two simple 2D vectors:&lt;/p&gt;

&lt;p&gt;A = [1, 2] B = [2, 3]&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Dot Product:
&lt;/h4&gt;

&lt;p&gt;A · B = (1 * 2) + (2 * 3) = 2 + 6 = 8&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Magnitudes:
&lt;/h4&gt;

&lt;p&gt;||A|| = √(1² + 2²) = √5 ≈ 2.236 ||B|| = √(2² + 3²) = √13 ≈ 3.606&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Cosine Similarity:
&lt;/h4&gt;

&lt;p&gt;cos(θ) = 8 / (2.236 * 3.606) ≈ 8 / 8.062 ≈ 0.993&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: 0.993&lt;/strong&gt; — Very high similarity!&lt;/p&gt;




&lt;h2&gt;
  
  
  In the Context of the CLI
&lt;/h2&gt;

&lt;p&gt;In &lt;code&gt;messy-folder-reorganizer-ai&lt;/code&gt;, embeddings represent file and folder names. Cosine similarity allows the CLI to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Find files with similar meaning or content
&lt;/li&gt;
&lt;li&gt;Group files together
&lt;/li&gt;
&lt;li&gt;Match files to folder "themes" based on vector similarity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Looking for Feedback
&lt;/h2&gt;

&lt;p&gt;I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Thanks for Reading!&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Feel free to reach out here or connect with me on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/PerminovEugene" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/eugene-perminov/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>cli</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Adding RAG and ML to AI files reorganization CLI (messy-folder-reorganizer-ai)</title>
      <dc:creator>Evgenii Perminov</dc:creator>
      <pubDate>Fri, 28 Mar 2025 15:41:36 +0000</pubDate>
      <link>https://dev.to/evgeniiperminov/adding-rag-and-ml-to-ai-files-reorganization-cli-messy-folder-reorganizer-ai-1d3</link>
      <guid>https://dev.to/evgeniiperminov/adding-rag-and-ml-to-ai-files-reorganization-cli-messy-folder-reorganizer-ai-1d3</guid>
      <description>&lt;p&gt;A month ago, I created the first naive version of a CLI tool for AI-powered file reorganization in Rust — &lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai" rel="noopener noreferrer"&gt;messy-folder-reorganizer-ai&lt;/a&gt;. It sent file names and paths to Ollama and asked the LLM to generate new paths for each file. This worked fine for a small number of files, but once the count exceeded around 50, the LLM context filled up quickly.&lt;/p&gt;

&lt;p&gt;So, I decided to improve the entire workflow by integrating RAG (Retrieval-Augmented Generation).&lt;/p&gt;




&lt;h2&gt;
  
  
  Version 0.2 Workflow Updates
&lt;/h2&gt;

&lt;p&gt;Here’s how adding RAG and a bit of ML helped improve the file reorganization flow in the CLI:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Custom Source and Destination Paths
&lt;/h3&gt;

&lt;p&gt;First, I allowed users to specify different paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;source path&lt;/strong&gt; where files are located.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;destination path&lt;/strong&gt; where files will be moved.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Adding RAG with Qdrant
&lt;/h3&gt;

&lt;p&gt;Next, I introduced RAG into the system. As a vector database, I chose &lt;a href="https://qdrant.tech/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt; — an open-source, easy-to-run local vector store.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Currently, users need to manually download and launch Qdrant. Automatic setup is planned for future versions.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The core of RAG is generating embeddings from text. Here's the step-by-step:&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Embedding Folder and File Names
&lt;/h3&gt;

&lt;p&gt;The CLI sends destination folder names and source file names to an Ollama embedding model. The model returns an embedding (vector) for each name.&lt;/p&gt;

&lt;h4&gt;
  
  
  Contextualizing the Input
&lt;/h4&gt;

&lt;p&gt;Instead of sending raw names, I added context like:&lt;br&gt;&lt;br&gt;
&lt;code&gt;"This is a folder name: {folder_name}"&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A more detailed explanation will be in the next article.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Embedding Model Selection
&lt;/h4&gt;

&lt;p&gt;Different models return vectors of different dimensions. I used the &lt;strong&gt;mxbai-embed-large:latest&lt;/strong&gt; model from Ollama, which produces 1024-dimensional vectors. It performed well for most use cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Storing Folder Embeddings in Qdrant
&lt;/h3&gt;

&lt;p&gt;Each destination folder's embedding is stored in Qdrant, with the original folder name included as payload metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Matching Files to Closest Folders
&lt;/h3&gt;

&lt;p&gt;For each source file embedding, the CLI searches Qdrant for the closest destination folder vector.&lt;br&gt;&lt;br&gt;
Qdrant returns the most similar match along with a similarity score.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;More about similarity measures and why I picked a particular one will be covered in the third article.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  6. Threshold-Based Filtering
&lt;/h3&gt;

&lt;p&gt;The CLI compares each similarity score to a configurable threshold (set via config files). If no suitable match is found, the file is filtered out and sent to an additional step — &lt;strong&gt;clustering&lt;/strong&gt; and &lt;strong&gt;folder name generation via LLM&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Clustering Unmatched Files
&lt;/h3&gt;

&lt;p&gt;Since LLMs struggle with large input contexts, we split unmatched files into clusters using machine learning — specifically &lt;strong&gt;agglomerative hierarchical clustering&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;More details about clustering are in the fourth article in this series.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  8. Naming Clusters via LLM
&lt;/h3&gt;

&lt;p&gt;Once clustering is complete, we end up with small, manageable groups of files. For each cluster, we send a prompt to the LLM to generate a suitable folder name.&lt;/p&gt;

&lt;p&gt;After some LLM thinking time, we receive the missing folder names and can show the user a preview of the proposed file reorganization.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Applying the Changes
&lt;/h3&gt;

&lt;p&gt;If the user is happy with the proposed structure, they can confirm it. The CLI will then move the files to their new paths accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In the upcoming articles, I’ll dive into some of the more technical and interesting parts of the project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to choose a similarity search method.&lt;/li&gt;
&lt;li&gt;Ways to improve embeddings for files and folders.&lt;/li&gt;
&lt;li&gt;Selecting and preparing data for clustering.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Looking for Feedback
&lt;/h2&gt;

&lt;p&gt;I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Thanks for Reading!&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Feel free to reach out here or connect with me on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/PerminovEugene" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/eugene-perminov/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>cli</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
    <item>
      <title>How I Built a Local LLM-Powered File Reorganizer with Rust</title>
      <dc:creator>Evgenii Perminov</dc:creator>
      <pubDate>Wed, 19 Feb 2025 15:20:29 +0000</pubDate>
      <link>https://dev.to/evgeniiperminov/how-i-built-a-local-llm-powered-file-reorganizer-in-rust-1bip</link>
      <guid>https://dev.to/evgeniiperminov/how-i-built-a-local-llm-powered-file-reorganizer-in-rust-1bip</guid>
      <description>&lt;h1&gt;
  
  
  Introduction: Diving (Back) Into Rust
&lt;/h1&gt;

&lt;p&gt;Some time ago, I decided to dive into Rust &lt;strong&gt;once again&lt;/strong&gt;—this must be my &lt;em&gt;nth&lt;/em&gt; attempt. I’d tried learning it before, but each time I either got swamped by the borrow checker or got sidetracked by other projects. This time, I wanted a small, &lt;em&gt;practical&lt;/em&gt; project to force myself to stick with Rust. The result is &lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai/tree/main" rel="noopener noreferrer"&gt;messy-folder-reorganizer-ai&lt;/a&gt;, a command-line tool for file organization powered by a local LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Inspiration: A Bloated Downloads Folder
&lt;/h2&gt;

&lt;p&gt;The main motivation was my messy &lt;strong&gt;Downloads&lt;/strong&gt; folder, which often ballooned to hundreds of files—images, documents, installers—essentially chaos. Instead of manually sorting through them, I thought, “Why not let an AI propose a structure?”&lt;/p&gt;




&lt;h2&gt;
  
  
  Discovering Local LLMs
&lt;/h2&gt;

&lt;p&gt;While brainstorming, I stumbled upon the possibility of running LLMs &lt;strong&gt;locally&lt;/strong&gt;, like Ollama or other self-hosted frameworks. I loved the idea of &lt;strong&gt;not sending&lt;/strong&gt; my data to some cloud service. So I decided to build a Rust-based CLI that &lt;strong&gt;queries&lt;/strong&gt; a local LLM server for suggestions on how to reorganize my folders.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges: LLM &amp;amp; Large Folders
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initial Model:&lt;/strong&gt; I started using &lt;code&gt;llama3.2:1b&lt;/code&gt;, but the responses didn’t follow prompt instructions well, so I switched to &lt;strong&gt;deepseek-r1&lt;/strong&gt;, which performed much better.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Limits:&lt;/strong&gt; When testing on folders with many files, the model began forgetting the beginning of the prompt and stopped following instructions properly. Increasing &lt;code&gt;num_ctx&lt;/code&gt; (which defines the model’s context size) helped partially, but the model still struggles with &lt;strong&gt;100+ files&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Possible Solutions:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batching Requests:&lt;/strong&gt; Split the file list into smaller chunks and send multiple prompts.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Other Ideas?:&lt;/strong&gt; If you’re an LLM expert—especially with local models like Ollama—I’d love advice on how to handle larger sets without hitting memory or context limits.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  CLI Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configurable Model:&lt;/strong&gt; Specify the local LLM endpoint, model name, or other model options.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customizable Prompts:&lt;/strong&gt; Tweak the AI prompt to fine-tune how the model interprets your folder’s contents.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confirmation Prompt:&lt;/strong&gt; The tool shows you the proposed structure and asks for confirmation before reorganizing any files.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Looking for Feedback
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rust Community:&lt;/strong&gt; I’d love code feedback — best practices, performance tips, or suggestions on how to structure the CLI.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM Gurus:&lt;/strong&gt; Any advice on optimizing local model inference for large file sets or advanced chunking strategies would be invaluable.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project has been a great way to re-learn some Rust features and experiment with local AI solutions. While it works decently for medium-sized folders, there’s plenty of room to grow. If this concept resonates with you—maybe your Downloads folder is as messy as mine—give it a try, open an issue, or contribute a pull request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks for reading!&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Feel free to reach out on the &lt;a href="https://github.com/PerminovEugene/messy-folder-reorganizer-ai/issues" rel="noopener noreferrer"&gt;GitHub issues page&lt;/a&gt;, or drop me a note if you have any thoughts, suggestions, or just want to talk about Rust and AI!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>cli</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
