tumf

Posted on Jan 29 • Originally published at blog.tumf.dev

llm-tldr: Answering "Where is the authentication?" with 100ms - Accuracy and Limitations of Semantic Search

#llm #mcp #ai

Originally published on 2026-01-19
Original article (Japanese): llm-tldr: 「認証はどこ？」に100msで答えるセマンティック検索の精度と限界

Have you ever searched through a large codebase wondering, "Where is the authentication feature implemented?" Tools like grep and find can't help if you don't know the function name, and IDE symbol searches require exact matches.

llm-tldr is a code analysis tool that features semantic search capabilities, allowing you to search for code using natural language queries. It can identify functions based on vague keywords like "authentication" or "PDF generation" by analyzing actual behavior. In this article, I will report on the accuracy and limitations based on tests conducted in a Next.js project consisting of 269 files.

What is llm-tldr?

llm-tldr is a tool designed to efficiently provide information from a codebase to LLM (Large Language Model). Instead of passing the entire codebase, it extracts structured information, achieving a 95% reduction in tokens and 155 times faster processing.

Key features:

5-layer code analysis architecture: Analyzes using five layers: AST, call graph, control flow, data flow, and program dependency.
Semantic search: Searches for functions using natural language queries (the focus of this article).
Support for 16 languages: Including TypeScript, Python, JavaScript, Go, Rust, Java, and more.
MCP integration: Works with Claude Code / Claude Desktop (MCP).

Performance Metrics Claimed by the Official Documentation

Metric	Raw Code	TLDR	Improvement Rate
Number of tokens in function context	21,000	175	99% reduction
Number of tokens in codebase summary	104,000	12,000	89% reduction
Query latency	30 seconds	100ms	300 times faster

We will verify how well these numbers can be reproduced in actual projects.

How Semantic Search Works

The semantic search in llm-tldr encodes each function into a 1024-dimensional vector along with the following information:

Signature + docstring (L1: AST)
Call relationships (L2: Call graph)
Complexity metrics (L3: Control flow)
Data flow patterns (L4: Data flow)
Dependencies (L5: Program dependency)
The first approximately 10 lines of code

The embedding model used is bge-large-en-v1.5 (1.3GB), and for vector search, FAISS (Facebook AI Similarity Search) is employed.

Differences from grep

Traditional tools like grep or ripgrep rely on string matching, so they cannot find anything unless the function name or comment contains "authentication." llm-tldr, on the other hand, vectorizes based on actual behavior (which functions are called, what data is handled), making it independent of comments or variable names.

Testing Environment

The environment used for testing in an actual project is as follows:

Project: system-planner (TypeScript/Next.js project)
Size: 269 files, 517 edges (function call relationships)
Language Composition: 257 TypeScript files, 12 JavaScript files
llm-tldr Version: 1.5.2

Setup: The Initial Traps

Following the official documentation can lead you to an initial trap.

Trap 1: Semantic Index Not Created

# Executing as per the official documentation
tldr warm .

Result: 0 code units, and the semantic index was not created.

Cause:

The embedding model (bge-large-en-v1.5, 1.3GB) needs to be downloaded.
Language specification (--lang) is mandatory.

Solution:

# Explicitly specify the language
tldr semantic index . --lang typescript

The first execution will take a few minutes to download the 1.3GB model. After that, 516 code units were indexed.

Trap 2: Path Aliases Not Resolved

Path aliases commonly used in Next.js projects, such as @/lib/..., cannot be correctly resolved by llm-tldr.

# Searching for importers
tldr importers "@/lib/supabase/server" .

Result: "importers": [] (empty)

This is because llm-tldr does not read the paths setting from tsconfig.json. Currently, there is no workaround, and searches must be done using absolute paths.

Validating the Power of Semantic Search

I confirmed the accuracy with three queries.

Validation 1: Searching for Authentication Features

tldr semantic search "authentication and login" --path . --k 5

Result:

[
  {
    "name": "LoginForm",
    "file": "components/auth/LoginForm.tsx",
    "score": 0.6509
  },
  {
    "name": "LoginPage",
    "file": "app/auth/login/page.tsx",
    "score": 0.6506
  },
  {
    "name": "UpdatePasswordForm",
    "file": "components/auth/UpdatePasswordForm.tsx",
    "score": 0.6165
  }
]

Evaluation: ✅ As expected. Components related to login ranked highly. A score around 0.65 is reasonable for semantic search similarity.

Validation 2: Accuracy of Japanese Queries

tldr semantic search "Supabaseへのデータベース接続" --path . --k 5

Result: Database-related scripts were appropriately detected.

Evaluation: ✅ Japanese queries are effective. Although bge-large-en-v1.5 is an English model, it supports multilingual queries.

Validation 3: Cross-Functional Search

tldr semantic search "PDF generation and export" --path . --k 5

Result:

[
  {
    "name": "generateEstimatePDF",
    "file": "lib/utils/pdf-generator.ts",
    "score": 0.6862
  },
  {
    "name": "convertMarkdownToDocx",
    "file": "lib/utils/specification-export.ts",
    "score": 0.6582
  }
]

Evaluation: ✅ Exceeds expectations. It detected not only PDF generation but also related document conversion functions, indicating that data flow and call graph information are effective.

Call Graph Expansion Feature

tldr semantic search "chat message handling" --path . --k 3 --expand

By adding the --expand option, the results include function call relationships (calls, called_by, related). This is useful for impact analysis during refactoring.

Features That Did Not Work

There were features I expected to work but did not function correctly.

Failure 1: Impact Analysis

tldr impact fetchVendors .

Result: Function 'fetchVendors' not found in call graph

Possible Causes:

Exact match of function name may be required?
The call graph index may not be complete.
TypeScript-specific issues (related to type definitions).

Failure 2: Architecture Analysis

tldr arch .

Result: All items were empty.

{
  "entry_layer": [],
  "leaf_layer": [],
  "middle_layer_count": 0,
  "circular_dependencies": []
}

Possible Causes: It may not support the specific structure of Next.js projects (App Router, Server Components, etc.).

Performance: The Power of Daemon Mode

llm-tldr has a mode that starts a daemon in the background.

tldr daemon start

Effects:

Queries are accelerated to 100ms (as claimed).
In-memory caching makes subsequent searches extremely fast.
Communication via Unix Socket.

In practice, the first query took about 2 seconds, while subsequent queries were around 100ms. This is close to the official claim of "155 times faster."

Practicality Evaluation

Comparison with Serena (Conclusion Only)

Serena is a tool based on the Language Server Protocol (LSP) that excels in "accurate editing and refactoring" using type information and reference resolution. In contrast, llm-tldr is strong in "finding features when you don't know the name" through static analysis with Tree-sitter + vector search.

These two tools complement each other rather than compete, and I believe the following practical distinctions can be made:

Serena: Accurately trace known symbols / safely rename.
llm-tldr: Narrow down using natural language like "Where is authentication?" / gain an overview of unfamiliar codebases.

✅ Highly Effective Cases

Exploring Large Codebases
- Ideal for searching for features like "Where is authentication?" or "Where is PDF generation?" in unfamiliar projects.
- More intuitive than traditional grep or find.
Impact Investigation Before Refactoring
- Understand dependencies between functions using the call graph.
- Identify the scope of changes with the --expand option.
Providing Context to LLMs
- The 95% reduction in tokens allows context windows to fit even in large projects.
- Structured JSON output makes it easier for LLMs to understand.

🤔 Limited Effectiveness Cases

Debugging
- Program slicing with tldr slice is theoretically effective but was not validated this time.
- Data flow analysis (tldr dfg) is similar.
Architecture Analysis
- If tldr arch worked correctly, it would be useful, but results were not obtained this time.

❌ Unsuitable Cases

Simple Edits in Single Files
- High overhead.
- IDE features are sufficient.
Situations Requiring Real-Time Responses
- Index updates are necessary (after every 20 file changes).
- Continuous notification settings (like Git hooks) are required.

Reality of Initial Costs

The initial costs of implementing llm-tldr cannot be ignored.

Required Resources

Disk Space: 1.3GB (bge-large-en-v1.5 model) + index files.
Initial Setup Time:
- Model download: A few minutes (depending on connection speed).
- Index creation: About 2 minutes for a medium-sized project (269 files).
Memory: A few hundred MB as a resident process when using daemon mode.

Ongoing Maintenance

The index does not update automatically, so maintenance is required using one of the following methods:

Method 1: Git Hook (Recommended)

Add to .git/hooks/post-commit:

#!/bin/bash
git diff --name-only HEAD~1 | xargs -I{} tldr daemon notify {} --project .

After changing 20 files, the daemon will automatically rebuild the semantic index.

Method 2: Manual Update

tldr warm .  # Full rebuild

MCP Integration: Collaboration with Claude Code

llm-tldr can operate as an MCP (Model Context Protocol) server, integrating with Claude Code / Claude Desktop.

Setting Up in Claude Code

.claude/settings.json:

{
  "mcpServers": {
    "tldr": {
      "command": "tldr-mcp",
      "args": ["--project", "."]
    }
  }
}

This allows Claude to automatically use llm-tldr to understand the codebase. In practice, when I tested it, Claude answered the question, "Where is the authentication feature in this project?" based on the semantic search results from llm-tldr.

Conclusion: Insights from Two Hours of Testing

The semantic search of llm-tldr was impressively effective in functioning with natural language queries beyond expectations. It is particularly practical in the following cases:

Recommended Use Cases:

Large projects with over 100,000 lines of code.
Microservice architectures.
Analysis of legacy code.
AI-assisted development (integration with Claude Code, etc.).

Points to Note:

The initial setup cost (1.3GB download) is unavoidable.
Some features (arch, impact) may not work as expected.
There are challenges in resolving path aliases.
Index updates require settings like Git hooks.

Personally, I feel that llm-tldr has substantial value for implementation in large projects as a tool that "changes the experience of exploring codebases." The combination with Claude Code suggests new possibilities for AI-assisted development.

I recommend trying it out on a small-scale project if you're interested.

DEV Community