DEV Community

Cover image for Chat With Your Documents Using Garudust Agent — No Vector Database Required
Garudust
Garudust

Posted on

Chat With Your Documents Using Garudust Agent — No Vector Database Required

Most RAG tutorials start the same way: "First, install a vector database…" Then come the embedding models, the chunking strategies, the similarity thresholds. By the time you can ask a question about a PDF, you've deployed three services and written 200 lines of boilerplate.

Garudust Agent takes a different path. RAG is built in — backed by SQLite FTS5 with a trigram tokenizer. No vector database. No embedding API calls. Drop a PDF (or TXT, CSV, Markdown, JSON) into the conversation and start asking questions in seconds.


How It Works

When you ingest a document, Garudust:

  1. Extracts text (native PDF parser, no external tools)
  2. Splits it into chunks (≤ 800 chars, paragraph-aware)
  3. Indexes chunks into an FTS5 virtual table with tokenize = 'trigram'

When you ask a question, doc_search runs a full-text query against the index and feeds the top matching chunks to the LLM as context. That's the whole pipeline — one SQLite file at ~/.garudust/state.db.

The trigram tokenizer means it works on any language, including Thai, Chinese, and Japanese, without any tokenizer configuration.


Setup

RAG is enabled by default. The only thing you need to configure is which directories the agent is allowed to read from:

# ~/.garudust/config.yaml
security:
  allowed_read_paths:
    - /home/you/documents
    - /data/company-docs
Enter fullscreen mode Exit fullscreen mode

That's it. If you want to turn RAG off entirely:

disabled_toolsets: [rag]
Enter fullscreen mode Exit fullscreen mode

Your First Ingestion

Start the CLI:

garudust
Enter fullscreen mode Exit fullscreen mode

Then tell the agent to ingest a file:

You: ingest /home/you/documents/employee-handbook.pdf

Agent: Indexed employee-handbook.pdf — 47 chunks ready for search.
       Preview: "This handbook outlines the policies and procedures for all employees…"
Enter fullscreen mode Exit fullscreen mode

Now ask anything:

You: What is the remote work policy?

Agent: According to the employee handbook, remote work is permitted up to 3 days per week
       for roles that do not require on-site presence. Employees must notify their manager
       at least 24 hours in advance and maintain availability during core hours (10am–4pm).
Enter fullscreen mode Exit fullscreen mode

The Four RAG Tools

Tool What it does
doc_ingest Extract and index a file (PDF, TXT, CSV, MD, JSON…)
doc_search Full-text search across all ingested documents
doc_list List all indexed documents with chunk count and timestamp
doc_forget Remove one file or clear the entire index

You never call these directly — the agent decides when to use them based on your question. But knowing they exist helps you understand what's happening.

Re-ingesting a file

If a document changes, just ingest it again. The old index for that path is replaced automatically.

Forgetting a document

You: Remove the Q1 report from the index.
Agent: Document removed from index.
Enter fullscreen mode Exit fullscreen mode

Or clear everything:

You: Clear all indexed documents.
Agent: Removed 5 document(s) from index.
Enter fullscreen mode Exit fullscreen mode

FTS5 Query Syntax

doc_search supports full FTS5 syntax, which the agent uses automatically when your question benefits from it:

Syntax Example
AND (default) remote work policy
Phrase "annual leave"
OR vacation OR leave
NOT policy NOT contractor
Prefix terminat*

You don't need to write FTS5 queries yourself — the agent figures this out. But if you want to guide it:

You: Search for "termination clause" in the contract documents.
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

1. Company Knowledge Base

Ingest your onboarding docs, SOPs, and internal wikis. New team members can ask questions in plain language instead of searching through Confluence.

You: ingest /docs/sop-release-process.md
You: What approvals are needed before a hotfix can go to production?
Enter fullscreen mode Exit fullscreen mode

2. Contract and Legal Review

You: ingest /legal/vendor-agreement-2025.pdf
You: Does this contract include a limitation of liability clause? What is the cap?
Enter fullscreen mode Exit fullscreen mode

3. Log Analysis

Ingest a log file and ask questions without writing grep patterns:

You: ingest /var/log/app/error.log
You: Which service caused the most errors in the last hour?
You: Are there any database connection timeouts?
Enter fullscreen mode Exit fullscreen mode

4. Codebase Documentation

You: ingest /project/docs/api-reference.md
You: What parameters does the /auth/refresh endpoint accept?
Enter fullscreen mode Exit fullscreen mode

Ingesting Files Sent via Telegram or LINE

If you're running garudust-server with a messaging platform, users can send files directly to the bot. Attachments are automatically saved to a temporary path and can be ingested on request:

User sends: quarterly-report.pdf (attached)

Agent: I received your file. Would you like me to index it for search?

User: yes

Agent: Indexed quarterly-report.pdf — 83 chunks ready.
       Preview: "Q1 2025 Financial Summary — Total Revenue: $4.2M…"

User: What was the gross margin for Q1?
Agent: According to the report, gross margin for Q1 2025 was 61.3%,
       up from 58.9% in Q4 2024.
Enter fullscreen mode Exit fullscreen mode

Platform attachments (files from Telegram, LINE, Discord, etc.) are always allowed regardless of allowed_read_paths, since they're written to /tmp/garudust_* by the platform adapter.


Multiple Documents at Once

You can ingest multiple files and search across all of them in the same session:

You: ingest /docs/policy-2024.pdf
You: ingest /docs/policy-2025.pdf
You: What changed in the travel expense policy between 2024 and 2025?
Enter fullscreen mode Exit fullscreen mode

The agent searches both documents and synthesizes the differences.

Check what's indexed at any time:

You: What documents have you indexed?

Agent: 2 documents indexed:
       - policy-2024.pdf | 34 chunks | ingested 2025-05-21 09:14
       - policy-2025.pdf | 38 chunks | ingested 2025-05-21 09:15
Enter fullscreen mode Exit fullscreen mode

Limitations

  • No semantic search — FTS5 is keyword/trigram matching, not embedding similarity. If the document says "annual leave" and you ask about "vacation days," the agent bridges this with its language understanding, but results depend on the LLM's reasoning.
  • Session-scoped by default — The index persists in state.db, but searches are scoped to the current conversation key. Starting a new session means re-ingesting if you want to query the same files.
  • Text-only — Images, tables, and charts inside PDFs are not extracted. Text content only.

Summary

Garudust RAG Vector DB approach
Setup One config line Vector DB + embedding API
Storage SQLite (single file) Separate service
Languages Any (trigram) Depends on embedding model
Cost Zero (no embedding calls) Per-token embedding cost
Search type FTS5 keyword + LLM reasoning Semantic similarity

Garudust's RAG won't replace a purpose-built vector search pipeline for large-scale production retrieval. But for a developer who wants to ask questions about their documents right now — without running a second service — it's the fastest path from PDF to answer.


Garudust AgentGitHub · Releases

Top comments (0)