I Built a Python CLI Tool for RAG Over Any Document Folder

#rag #cli #python #ai

A zero-config command-line tool for retrieval-augmented generation — index a folder, ask questions, get cited answers. Works locally with Ollama or with cloud APIs.

Every time I wanted to ask questions about a set of documents, I'd write the same 100 lines of boilerplate: load docs, chunk them, embed them, store in a vector DB, retrieve, generate. I got tired of it. So I built a CLI tool that does it in two commands.

The Problem

RAG prototyping has too much ceremony. You have a folder of PDFs, Markdown files, maybe some text notes. You want to ask questions about them. Simple enough in theory.

In practice, you're wiring up document loaders, picking a chunking strategy, initializing an embedding provider, setting up a vector store, writing retrieval logic, and then finally getting to the part you actually care about: generating an answer. And you do this every single time you start a new project or want to test a new document set.

Existing solutions sit at the extremes. Full frameworks like LangChain and LlamaIndex are powerful, but they're heavy. You pull in a framework with dozens of abstractions just to ask a question about a folder. On the other end, tutorial notebooks are disposable. They work once, for one demo, and you throw them away.

I wanted something in the middle. A CLI that's zero-config for the common case, configurable when you need it, and built from pieces I can reuse in other projects. No framework dependencies. No notebook rot. Just a tool that does one thing well.

What I Built

rag-cli-tool gives you two commands:

rag-cli index ./my-docs/
rag-cli ask "What is the refund policy?"

That's it. Point it at a folder, it indexes everything. Ask a question, it answers from your documents. Supported formats include PDF, Markdown, plain text, and DOCX.

Under the hood, the pipeline is straightforward. index loads documents from the directory, splits them into overlapping chunks using a recursive text splitter, generates embeddings, and stores everything in a local ChromaDB instance. ask embeds your question, retrieves the most similar chunks, and generates an answer using only the retrieved context -- strict RAG, no hallucination from external knowledge.

The tech stack is deliberately boring. ChromaDB for the vector store because it runs locally with zero setup -- no Docker, no server, just a directory. Typer for the CLI framework because it gives you type-checked arguments and auto-generated help for free. Rich for terminal output because progress bars and formatted answers make the tool pleasant to use. Pydantic Settings for configuration because environment variables and .env files are the right answer for CLI tools.

You can run it fully local with Ollama (no API keys needed) or use cloud providers:

# Local -- no API keys
RAG_CLI_MODEL=ollama:llama3.2 RAG_CLI_EMBEDDING_MODEL=ollama:nomic-embed-text \
  rag-cli ask "What are the payment terms?"

# Cloud -- Anthropic + OpenAI
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
rag-cli ask "What are the payment terms?"

Architecture -- Built for Reuse

This is where rag-cli-tool diverges from a typical weekend project. The repository contains three independent packages, not one monolith:

src/
├── rag_cli/       # CLI interface (Typer + Rich)
├── llm_core/      # LLM abstraction layer (providers, config, retry)
└── rag_core/      # RAG pipeline (loaders, chunking, embeddings, retrieval)

llm_core handles everything related to calling language models. It defines a provider interface, implements Anthropic and Ollama adapters, and includes retry logic with exponential backoff. It knows nothing about RAG, documents, or CLI output.

rag_core handles the RAG pipeline: loading documents, chunking text, generating embeddings, storing vectors, and retrieving results. It depends on llm_core for embedding providers but has no opinion about how you present results to users.

rag_cli is the thin layer that wires everything together. It handles argument parsing, progress bars, and formatted output. The actual logic is a few lines of glue code.

The reason for this separation is practical, not academic. I build AI projects regularly. The next one might be a web app, a Slack bot, or an API service. When that happens, I don't want to extract RAG logic from a CLI tool. I want to import rag_core and start building. Same for llm_core -- provider switching, retry logic, and configuration management are problems I solve once.

Every major component has an abstract base class. BaseLLMProvider, BaseEmbedder, BaseChunker, BaseRetriever, BaseVectorStore. Today I have one implementation of each. Tomorrow I can add a GraphRAG retriever or a Pinecone vector store without touching existing code. The abstractions aren't speculative -- they're the minimum interface each component needs to be swappable.

The project has full test coverage across all three packages -- 37 tests covering providers, configuration, chunking, embeddings, retrieval, and vector store operations.

Design Decisions

Four decisions shaped the project, each with a specific reason:

ChromaDB over FAISS or Pinecone. FAISS requires numpy gymnastics for persistence and doesn't store metadata natively. Pinecone requires an account and network access. ChromaDB gives you a local, persistent vector store with metadata filtering in one line: ChromaStore(persist_dir=path). For a CLI tool that should work offline, this was the only real choice.

Typer over Click. Click is battle-tested, but Typer gives you type annotations as your argument definitions. No decorators for each option, no callback functions. You write a normal Python function with type hints, and Typer generates the CLI. The help text writes itself.

Pydantic Settings for configuration. CLI tools need to read config from environment variables and .env files. Pydantic Settings does both, with validation, default values, and type coercion. One class definition replaces a dozen os.getenv() calls with fallback logic.

Provider routing via model string prefix. Instead of separate config fields for provider selection, the model string does double duty: claude-3-5-sonnet-latest routes to Anthropic, ollama:llama3.2 routes to Ollama. One config field, zero ambiguity. This pattern scales to any number of providers without config proliferation.

What I Learned

The 80/20 of RAG tooling surprised me. I expected the infrastructure -- vector stores, embedding APIs, retrieval logic -- to consume most of the development time. Instead, chunking decisions dominated. How big should chunks be? How much overlap? Which separators produce coherent boundaries? The pipeline code was straightforward; the tuning was where the real work happened.

CLI-first development forces good API design. When your first consumer is a command-line interface, you can't hide behind web framework magic. Every input is explicit, every output is visible. This discipline produced cleaner interfaces in llm_core and rag_core than I would have gotten starting with a web app.

I intentionally shipped without several features: chat mode with conversation history, benchmarking against different chunking strategies, a web UI, and support for more vector stores. These are all reasonable features. They're also scope creep for a v0.1. The foundation is solid, the abstractions are in place, and each of those features is an afternoon of work because the architecture supports extension.

Try It

The best developer tools solve your own problems first. rag-cli-tool started as "I'm tired of writing this boilerplate" and turned into reusable building blocks for my entire AI project portfolio. If you work with documents and want a fast way to prototype RAG pipelines, give it a try.

# Install from PyPI
pip install rag-cli-tool

# Or from source
git clone https://github.com/LukaszGrochal/rag-cli-tool
cd rag-cli-tool
pip install -e .

# With Ollama (free, local)
ollama pull llama3.2 && ollama pull nomic-embed-text
rag-cli index ./sample-docs/
rag-cli ask "What is the refund policy?"