DEV Community

soy
soy

Posted on • Originally published at media.patentllm.org

Fast Searching 4 Million Patent Records with FTS5

Introduction: The Limitations of LIKE Search

When searching for "battery" in PatentLLM's patent database (4 million records), results were slow to appear. A full table scan using LIKE '%battery%' was not practical in terms of speed.

There were three main problems with LIKE search:

  • Performance degradation due to full table scans
  • Inability to handle variations in word endings (battery / batteries)
  • Difficulty with complex searches like "battery OR lithium NOT sodium"

What is FTS5?

FTS5 is a full-text search engine built into SQLite. It pre-builds an inverted index, allowing it to instantly retrieve relevant rows from keywords.

Key benefits:

  • Fast searching with an inverted index
  • BM25 ranking: Automatic scoring based on relevance to search terms
  • Native support for complex searches: battery OR lithium NOT sodium can be written in a single line

Implementation Steps

Building the FTS5 Table

-- FTS5仮想テーブルを作成
CREATE VIRTUAL TABLE cases_fts USING fts5(
    raw_case_name,
    summary,
    analysis_json
);

-- 既存データを投入
INSERT INTO cases_fts SELECT raw_case_name, summary, analysis_json FROM merged_cases;
Enter fullscreen mode Exit fullscreen mode

Comparing Search Queries

-- 旧:LIKE検索(全件スキャン・ランキングなし)
SELECT * FROM merged_cases
WHERE raw_case_name LIKE '%battery%'
   OR summary LIKE '%battery%';

-- 新:FTS5検索(インデックス利用・BM25ランキング付き)
SELECT *, rank FROM cases_fts
WHERE cases_fts MATCH 'battery'
ORDER BY rank;
Enter fullscreen mode Exit fullscreen mode

FTS5 eliminates the need for full table scans, allowing searches on 1.73 million records to return instantly.

Query Tuning for MCP Integration

When connecting to PatentLLM's DB via the MCP server from Claude Code to perform patent analysis, we encountered an issue where "0 DB hits for all hypotheses."

Cause: Keyword Design Mismatch

Keywords generated by the Gemini API often became long phrases connected by AND, leading to cases where they didn't match the actual data.

-- 問題例:広すぎるクエリ(汎用語で大量ヒット)
SELECT COUNT(*) FROM cases_fts WHERE cases_fts MATCH 'patent';

-- 問題例:厳しすぎるクエリ(AND連結で0件)
SELECT COUNT(*) FROM cases_fts
WHERE cases_fts MATCH 'retrieval AND augmented AND generation AND pipeline';
Enter fullscreen mode Exit fullscreen mode

Solution: Manual Keyword Prioritization Logic

# Geminiに全て委任する代わりに、手動キーワードを優先
if 'fts_keywords' in hypotheses[hypothesis]:
    query = hypotheses[hypothesis]['fts_keywords']
else:
    query = analyze_query(hypothesis)
Enter fullscreen mode Exit fullscreen mode

Three-Stage Keyword Design Process

  1. Measure the number of hits with general keywords (check with COUNT(*)).
  2. Refine using FTS5's complex search syntax (combining OR / AND / NOT).
  3. Fine-tune to achieve the target number of results (10-30 items).

Summary

  • FTS5 can be immediately introduced with existing SQLite data.
  • Achieves millisecond-level search even with 1.73 million records.
  • When integrating with MCP, pay attention to keyword granularity. Always validate automatically generated queries against actual data.

Top comments (0)