DEV Community: Evgenii Perminov

Embeddings clustering with Agglomerative Hierarchical Clustering (messy-folder-reorganizer-ai)

Evgenii Perminov — Fri, 28 Mar 2025 15:42:16 +0000

Adding RAG and ML to Messy-Folder-Reorganizer-AI

Why ML Methods for Clustering

As we discovered in previous articles, all LLMs have context restrictions, so we cannot send hundreds of file names to an LLM and ask it to create folder names for all of them. On the other hand, sending a request for each file separately is not only inefficient and redundant—it also breaks the global context.

For example, if you have files like bill_for_electricity.pdf and bill_for_leasing.docx, you don’t want to end up with folder names like bills for the first and documents for the second. These results are technically valid, but they’re disconnected. We need to group related files together first, and the best way to do that is by clustering their embeddings.
For messy-folder-reorganizer-ai I picked agglomerative hierarchical clustering and I will try to explain my choice to the reader.

Selecting a Clustering Method

There are many clustering algorithms out there, but not all are suitable for the nature of embeddings. We're working with:

High-dimensional vectors (e.g., 384, 768, or more dimensions).
Relatively small datasets (e.g., a few hundred or thousand files).

Here's a comparison of a few clustering options:

Algorithm	Pros	Cons
K-Means	Fast, simple, widely used	Requires choosing `k`, assumes spherical clusters
DBSCAN	Detects arbitrary shapes, noise handling	Sensitive to parameters, poor with high dimensions
HDBSCAN	Improved DBSCAN, handles hierarchy	Slower, more complex
Agglomerative	No need for `k`, builds hierarchy, flexible distances	Slower, high memory use

Agglomerative hierarchical clustering is a strong fit because it:

Doesn’t require you to predefine the number of clusters.
Works well with custom distance metrics (like cosine).
Builds a dendrogram that can be explored at different levels of granularity.

Agglomerative Clustering Preparations

Input: Embedding Matrix

We assume an input matrix of shape M x N:

M: Number of files (embeddings).
N: Dimensionality of the embeddings (depends on the model used).

Building a Normalized Matrix

What is normalization?
Normalization ensures that all vectors are of unit length, which is especially important when using cosine distance.

Why normalize?

Prevents length from affecting similarity.
Ensures cosine distance reflects angular difference only.

Formula:

For vector (x), normalize it as:

x̂ = x / ||x||

Where ||x|| is the Euclidean norm (i.e., the square root of the sum of squares of the elements of x).

Building the Distance Matrix Using Cosine Distance

Why cosine distance?

It captures semantic similarity better in high-dimensional embedding spaces.
More stable than Euclidean in high dimensions.

Does it help with the curse of dimensionality?

To some extent, yes. While no method fully escapes the curse, cosine similarity is more robust than Euclidean for textual or semantic data.

Formula:

Given two normalized vectors (x) and (y):

cosine_similarity(x, y) = (x · y) / (‖x‖ · ‖y‖)

cosine_distance(x, y) = 1 - cosine_similarity(x, y)

Agglomerative Clustering Algorithm

Once we have the distance matrix, the agglomerative process begins:

Start: Treat each embedding as its own cluster.
Merge: Find the two closest clusters using the selected linkage method:
- Single: Minimum distance between points across clusters.
- Complete: Maximum distance.
- Average: Mean distance.
- Ward: Minimizes variance (works only with Euclidean distance).
Repeat: Merge the next closest pair until one cluster remains or a distance threshold is reached.
Cut the dendrogram: Decide how many clusters to extract based on height (distance) or desired granularity.

This method gives you interpretable, connected groupings—a critical step before folder naming or generating structured representations.

Implementation

If you are interested, you can check out implementation on Rust
here

Looking for Feedback

I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.

Thanks for Reading!

Feel free to reach out here or connect with me on:

Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!

Making Embeddings Understand Files and Folders with Simple Sentences (messy-folder-reorganizer-ai)

Evgenii Perminov — Fri, 28 Mar 2025 15:42:08 +0000

Do Embeddings Need Context? A Practical Look at File-to-Folder Matching

When building smart systems that classify or match content — such as automatically sorting files into folders — embeddings are a powerful tool. But how well do they work with minimal input? And does adding natural language context make a difference?

During development messy-folder-reorganizer-ai I found how adding contextual phrasing to file and folder names significantly improved the performance of embedding models and in this article I will share it with the reader.

Test Case: Matching Files to Valid Folder Names

Test A: Using Only File and Folder Names

| File Name               | Folder Name | Score     |
|-------------------------|-------------|-----------|
| crack.exe               | apps        | 0.5147713 |
| lovecraft novels.txt    | books       | 0.5832841 |
| police report.docx      | docs        | 0.6303186 |
| database admin.pkg      | docs        | 0.5538312 |
| invoice from google.pdf | docs        | 0.5381457 |
| meme.png                | images      | 0.6993392 |
| funny cat.jpg           | images      | 0.5511819 |
| lord of the ring.avi    | movies      | 0.5454072 |
| harry potter.mpeg4      | movies      | 0.5410566 |

Test B: Adding Natural Language Context

Each string was framed like:

"This is a file name: {file_name}"
"This is a folder name: {folder_name}"

| File Name                       | Folder Name | Score    |
|--------------------------------|-------------|-----------|
| crack.exe                      | apps        | 0.6714907 |
| lovecraft novels.txt           | books       | 0.7517922 |
| database admin.pkg             | dest        | 0.7194574 |
| police report.docx             | docs        | 0.7456068 |
| invoice from google.pdf        | docs        | 0.7141885 |
| meme.png                       | images      | 0.7737676 |
| funny cat.jpg                  | images      | 0.7438067 |
| harry potter.mpeg4             | movies      | 0.7156760 |
| lord of the ring.avi           | movies      | 0.6718528 |

Observations:

Scores were consistently higher across the board when context was added.
The model made more accurate matches, such as correctly associating database admin.pkg with dest instead of books.
This suggests that embeddings perform better with structured, semantic context, not just bare tokens.

Test Case: Only Some Files Have Valid Matches

Now let's delete the movies and images folders and observe how the matching behavior changes:

Test A: Using Only File and Folder Names

| File Name               | Folder Name | Score      |
|-------------------------|-------------|------------|
| hobbit.fb2              | apps        | 0.55056566 |
| crack.exe               | apps        | 0.5147713  |
| lovecraft novels.txt    | books       | 0.57081085 |
| police report.docx      | docs        | 0.6303186  |
| meme.png                | docs        | 0.58589196 |
| database admin.pkg      | docs        | 0.5538312  |
| invoice from google.pdf | docs        | 0.5381457  |
| lord of the ring.avi    | docs        | 0.492918   |
| funny cat.jpg           | docs        | 0.45956808 |
| harry potter.mpeg4      | docs        | 0.45733657 |

Test B: Adding Natural Language Context

Same context generation pattern as in previous test case

| File Name               | Folder Name | Score      |
|-------------------------|-------------|------------|
| crack.exe               | apps        | 0.6714907  |
| lovecraft novels.txt    | books       | 0.72899115 |
| database admin.pkg      | dest        | 0.7194574  |
| meme.png                | dest        | 0.68507683 |
| funny cat.jpg           | dest        | 0.6797525  |
| lord of the ring.avi    | dest        | 0.5323342  |
| police report.docx      | docs        | 0.7456068  |
| invoice from google.pdf | docs        | 0.71418846 |
| hobbit.fb2              | docs        | 0.6780642  |
| harry potter.mpeg4      | docs        | 0.5984984  |

Observations:

In Test A, files like meme.png, funny cat.jpg, and lord of the ring.avi were incorrectly matched to the docs folder. In Test B, they appeared in the more appropriate dest folder.
There are still some mismatches — for example, hobbit.fb2 was matched with docs instead of books, likely due to the less common .fb2 format. harry potter.mpeg4 also matched with docs, though with a relatively low score.

Why Does This Happen?

1. Context Gives Structure

Embedding models are trained on natural language. So when we provide structured inputs like:

“This is a file name: invoice from google.pdf”

“This is a folder name: docs”

...the model better understands the semantic role of each string. It knows these aren't just tokens — they are types of things, which makes embeddings more aligned.

2. It’s Not Just Word Overlap

Yes, phrases like "this is a file name" and "this is a folder name" are similar. But if word overlap were the only reason for higher scores, all scores would rise evenly — regardless of actual content.

Instead, we're seeing better matching. That means the model is using true context to judge compatibility — a sign that semantic meaning is being used, not just lexical similarity.

3. Raw Strings Without Context Can Be Misleading

A folder named docs or my-pc is vague. A file named database admin.pkg is even more so. Embeddings of such raw strings might be overly similar due to lack of semantic separation.

Adding even a light wrapper like "This is a file name..." or "This is a folder name..." gives the model clearer context and role assignment, helping it avoid false positives and improve semantic accuracy.

Conclusion

Embeddings require context to be effective, especially for classification or matching tasks.
Providing natural-language-like structure (even just a short prefix) significantly improves performance.
It’s not just about higher scores — it’s about better semantics and more accurate results.

If you're building tools that rely on embeddings, especially for classification, recommendation, or clustering — don't be afraid to add a little helpful context. It goes a long way.

Looking for Feedback

I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.

Thanks for Reading!

Feel free to reach out here or connect with me on:

Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!

How Cosine Similarity Helped My CLI Decide Where Files Belong (messy-folder-reorganizer-ai)

Evgenii Perminov — Fri, 28 Mar 2025 15:41:57 +0000

Introduction

In version 0.2 of messy-folder-reorganizer-ai, I used the Qdrant vector database to search for similar vectors. This was necessary to determine which folder a file should go into based on its embedding. Because of this, I needed to revisit different distance/similarity metrics and choose the most appropriate one.

Choosing the Right Vector Similarity Metric in Qdrant

Qdrant supports the following distance/similarity metrics:

Dot Product
Cosine Similarity
Euclidean Distance
Manhattan Distance

Distance/Similarity Formulas

Let x and y be two vectors of dimensionality n.

Cosine Similarity

cosine(x, y) = (x · y) / (‖x‖ · ‖y‖)

Dot Product

dot(x, y) = Σ (xᵢ * yᵢ)

⚠️ If vectors are normalized to unit length, then: cosine(x, y) = dot(x, y)

Euclidean Distance

euclidean(x, y) = sqrt(Σ (xᵢ - yᵢ)²)

Manhattan Distance (L1)

manhattan(x, y) = Σ |xᵢ - yᵢ|

When working with high-dimensional vectors (e.g., 1024 dimensions, as in the mxbai-embed-large:latest Ollama model) that have small magnitudes, Cosine Similarity is often the best choice — especially for embeddings.

Why Cosine Similarity is a Good Choice

Focuses on orientation, not magnitude

Cosine similarity measures the angle between vectors. It tells you
how similar the directions are*, regardless of vector length. This
is useful when comparing embeddings, where absolute length may not
be meaningful.

Built-in normalization

Cosine similarity is equivalent to the dot product of L2-
normalized vectors, which helps reduce the effect of the "curse
of dimensionality."

Great for semantic embeddings

Works very well when vectors represent meaning or context. Many models (e.g., OpenAI, BERT, Sentence Transformers) are trained
with cosine similarity in mind.

Efficient

Can be computed quickly even in high dimensions.

Cosine Similarity in Detail

Imagine two arrows (vectors) starting from the origin in a multi-dimensional space. Cosine similarity measures the angle between them:

If they point in exactly the same direction, similarity = 1.0
If they are completely opposite, similarity = -1.0
If they are orthogonal (90° apart), similarity = 0.0

The closer the angle is to zero, the more similar the vectors are.

Formula

Given two vectors A and B, cosine similarity is calculated as:

cos(θ) = (A · B) / (||A|| * ||B||)

A · B is the dot product of the vectors
||A|| and ||B|| are the magnitudes (lengths) of the vectors

Example

Let's take two simple 2D vectors:

A = [1, 2] B = [2, 3]

1. Dot Product:

A · B = (1 * 2) + (2 * 3) = 2 + 6 = 8

2. Magnitudes:

||A|| = √(1² + 2²) = √5 ≈ 2.236 ||B|| = √(2² + 3²) = √13 ≈ 3.606

3. Cosine Similarity:

cos(θ) = 8 / (2.236 * 3.606) ≈ 8 / 8.062 ≈ 0.993

Result: 0.993 — Very high similarity!

In the Context of the CLI

In messy-folder-reorganizer-ai, embeddings represent file and folder names. Cosine similarity allows the CLI to:

Find files with similar meaning or content
Group files together
Match files to folder "themes" based on vector similarity

Looking for Feedback

I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.

Thanks for Reading!

Feel free to reach out here or connect with me on:

Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!

Adding RAG and ML to AI files reorganization CLI (messy-folder-reorganizer-ai)

Evgenii Perminov — Fri, 28 Mar 2025 15:41:36 +0000

A month ago, I created the first naive version of a CLI tool for AI-powered file reorganization in Rust — messy-folder-reorganizer-ai. It sent file names and paths to Ollama and asked the LLM to generate new paths for each file. This worked fine for a small number of files, but once the count exceeded around 50, the LLM context filled up quickly.

So, I decided to improve the entire workflow by integrating RAG (Retrieval-Augmented Generation).

Version 0.2 Workflow Updates

Here’s how adding RAG and a bit of ML helped improve the file reorganization flow in the CLI:

1. Custom Source and Destination Paths

First, I allowed users to specify different paths:

A source path where files are located.
A destination path where files will be moved.

2. Adding RAG with Qdrant

Next, I introduced RAG into the system. As a vector database, I chose Qdrant — an open-source, easy-to-run local vector store.

Currently, users need to manually download and launch Qdrant. Automatic setup is planned for future versions.

The core of RAG is generating embeddings from text. Here's the step-by-step:

3. Embedding Folder and File Names

The CLI sends destination folder names and source file names to an Ollama embedding model. The model returns an embedding (vector) for each name.

Contextualizing the Input

Instead of sending raw names, I added context like:

"This is a folder name: {folder_name}"

A more detailed explanation will be in the next article.

Embedding Model Selection

Different models return vectors of different dimensions. I used the mxbai-embed-large:latest model from Ollama, which produces 1024-dimensional vectors. It performed well for most use cases.

4. Storing Folder Embeddings in Qdrant

Each destination folder's embedding is stored in Qdrant, with the original folder name included as payload metadata.

5. Matching Files to Closest Folders

For each source file embedding, the CLI searches Qdrant for the closest destination folder vector.

Qdrant returns the most similar match along with a similarity score.

More about similarity measures and why I picked a particular one will be covered in the third article.

6. Threshold-Based Filtering

The CLI compares each similarity score to a configurable threshold (set via config files). If no suitable match is found, the file is filtered out and sent to an additional step — clustering and folder name generation via LLM.

7. Clustering Unmatched Files

Since LLMs struggle with large input contexts, we split unmatched files into clusters using machine learning — specifically agglomerative hierarchical clustering.

More details about clustering are in the fourth article in this series.

8. Naming Clusters via LLM

Once clustering is complete, we end up with small, manageable groups of files. For each cluster, we send a prompt to the LLM to generate a suitable folder name.

After some LLM thinking time, we receive the missing folder names and can show the user a preview of the proposed file reorganization.

9. Applying the Changes

If the user is happy with the proposed structure, they can confirm it. The CLI will then move the files to their new paths accordingly.

Conclusion

In the upcoming articles, I’ll dive into some of the more technical and interesting parts of the project:

How to choose a similarity search method.
Ways to improve embeddings for files and folders.
Selecting and preparing data for clustering.

Looking for Feedback

I’d really appreciate any feedback — positive or critical — on the project, the codebase, the article series, or the general approach used in the CLI.

Thanks for Reading!

Feel free to reach out here or connect with me on:

Or just drop me a note if you want to chat about Rust, AI, or creative ways to clean up messy folders!

How I Built a Local LLM-Powered File Reorganizer with Rust

Evgenii Perminov — Wed, 19 Feb 2025 15:20:29 +0000

Introduction: Diving (Back) Into Rust

Some time ago, I decided to dive into Rust once again—this must be my nth attempt. I’d tried learning it before, but each time I either got swamped by the borrow checker or got sidetracked by other projects. This time, I wanted a small, practical project to force myself to stick with Rust. The result is messy-folder-reorganizer-ai, a command-line tool for file organization powered by a local LLM.

The Inspiration: A Bloated Downloads Folder

The main motivation was my messy Downloads folder, which often ballooned to hundreds of files—images, documents, installers—essentially chaos. Instead of manually sorting through them, I thought, “Why not let an AI propose a structure?”

Discovering Local LLMs

While brainstorming, I stumbled upon the possibility of running LLMs locally, like Ollama or other self-hosted frameworks. I loved the idea of not sending my data to some cloud service. So I decided to build a Rust-based CLI that queries a local LLM server for suggestions on how to reorganize my folders.

Challenges: LLM & Large Folders

Initial Model: I started using llama3.2:1b, but the responses didn’t follow prompt instructions well, so I switched to deepseek-r1, which performed much better.
Context Limits: When testing on folders with many files, the model began forgetting the beginning of the prompt and stopped following instructions properly. Increasing num_ctx (which defines the model’s context size) helped partially, but the model still struggles with 100+ files.
Possible Solutions:
- Batching Requests: Split the file list into smaller chunks and send multiple prompts.
- Other Ideas?: If you’re an LLM expert—especially with local models like Ollama—I’d love advice on how to handle larger sets without hitting memory or context limits.

CLI Features

Configurable Model: Specify the local LLM endpoint, model name, or other model options.
Customizable Prompts: Tweak the AI prompt to fine-tune how the model interprets your folder’s contents.
Confirmation Prompt: The tool shows you the proposed structure and asks for confirmation before reorganizing any files.

Looking for Feedback

Rust Community: I’d love code feedback — best practices, performance tips, or suggestions on how to structure the CLI.
LLM Gurus: Any advice on optimizing local model inference for large file sets or advanced chunking strategies would be invaluable.

Conclusion

This project has been a great way to re-learn some Rust features and experiment with local AI solutions. While it works decently for medium-sized folders, there’s plenty of room to grow. If this concept resonates with you—maybe your Downloads folder is as messy as mine—give it a try, open an issue, or contribute a pull request.

Thanks for reading!

Feel free to reach out on the GitHub issues page, or drop me a note if you have any thoughts, suggestions, or just want to talk about Rust and AI!