Sahil Singh

Posted on Feb 8 • Edited on Mar 5 • Originally published at getglueapp.com

How We Cluster 4,000 Files Into Features Using Louvain Community Detection

#devtools #programming #architecture #ai

When a developer asks "which files handle authentication?" they expect a precise answer. Not a keyword search. Not a directory listing. An actual traced answer: these 14 files, across 4 directories, form the authentication feature. Here's how they connect.

Building this requires solving a graph clustering problem. Here's exactly how we do it.

The Problem

A codebase is a graph. Files import other files. Functions call other functions. Types reference other types. This creates a dense dependency network.

The question is: which clusters of files form coherent features? "Authentication" isn't a directory — it's files spread across controllers, services, middleware, types, and utils. The grouping is structural (based on dependencies), not spatial (based on file paths).

Why Directory Structure Fails

Most codebases organize files by technical layer:

src/
  controllers/
    authController.ts
    billingController.ts
  services/
    authService.ts
    billingService.ts
  middleware/
    authMiddleware.ts
  models/
    User.ts
    Subscription.ts

This tells you about architecture. It doesn't tell you about features. The authentication feature is authController.ts + authService.ts + authMiddleware.ts + User.ts + parts of the session configuration. These files are in 4 different directories, but they form one cohesive unit.

The Graph Approach

Step 1: Build the Dependency Graph

We extract every import, function call, and type reference in the codebase. Each file becomes a node. Each dependency becomes an edge. The edge weight reflects the strength of the connection:

Import: weight 1.0 (direct dependency)
Function call: weight 0.8 (runtime dependency)
Type reference: weight 0.5 (structural dependency)
Test file to source: weight 0.3 (test dependency)

For a typical 4,000-file codebase, this produces a graph with 4,000 nodes and 15,000-40,000 edges.

Step 2: Louvain Community Detection

Louvain is a modularity optimization algorithm. It groups nodes into communities that maximize internal connections and minimize external connections. In our case: files that depend heavily on each other get grouped together.

The algorithm works in two phases:

Phase 1 (Local): Each node starts in its own community. For each node, try moving it to each neighbor's community. Accept the move that gives the largest modularity gain. Repeat until no move improves modularity.

Phase 2 (Aggregation): Collapse each community into a single node. Build a new graph where edge weights between super-nodes are the sum of edges between their constituent nodes. Go back to Phase 1.

This repeats until modularity stabilizes. The result: a hierarchical clustering of files into features.

Step 3: Feature Labeling

Louvain gives us clusters, not names. We label features by analyzing the cluster contents:

Extract the most common domain terms from file names and function names in the cluster
Identify the "entry point" files (most imported by other clusters) — these define the feature's public interface
Use the entry point names and domain terms to generate a feature label

Cluster containing authController.ts, authService.ts, authMiddleware.ts, sessionManager.ts, User.ts → "Authentication & Session Management"

Step 4: Hierarchy Detection

Features have sub-features. Louvain's hierarchical output captures this naturally:

Authentication & Session Management
  ├── OAuth Integration (oauth.ts, providers/*.ts)
  ├── Session Handling (sessionManager.ts, sessionStore.ts)
  └── User Management (User.ts, userService.ts, permissions.ts)

Real-World Results

On Glue's own codebase (~1,200 files):

23 features detected at the top level
67 sub-features at the second level
92% accuracy when validated against our team's mental model of feature boundaries
Processing time: 4.2 seconds for the full clustering

On a client codebase (~4,000 files, Node.js monolith):

41 features detected
134 sub-features
Revealed 3 unexpected dependencies between features the team thought were independent
Processing time: 11.8 seconds

Edge Cases and Refinements

Utility Files

Files like utils.ts or helpers.ts connect to everything. They create noise in the clustering. We handle this by:

Detecting high-degree nodes (imported by >30% of the codebase)
Reducing their edge weights by a dampening factor
Allowing them to be assigned to the cluster they're most strongly connected to

Test Files

Test files mirror source files structurally but shouldn't dominate clustering. We assign them to their source file's cluster with reduced edge weight.

Configuration Files

Config files (next.config.ts, .env, tsconfig.json) are excluded from clustering — they're infrastructure, not features.

Why This Matters for Developers

Feature clustering powers everything else in Glue:

"What files handle authentication?" → query the authentication cluster
"What depends on authentication?" → check cross-cluster edges from the auth cluster
"What's the blast radius of changing this auth function?" → trace dependencies within and across clusters
"Which features are most complex?" → measure intra-cluster density and cross-cluster coupling

It turns a bag of 4,000 files into a structured, queryable knowledge graph. That's the foundation that makes every other intelligence feature possible.

Originally published on glue.tools. Glue is the pre-code intelligence platform — paste a ticket, get a battle plan.

DEV Community