Why I picked Ollama + LanceDB + FastAPI for the AI Book Recommender

#rag #ollama #lancedb #fastapi

The AI Book Recommender is a small RAG app I built over a weekend. You type a mood — "a slow-burn detective novel set in 1970s Paris" — and it returns five semantically matched books. The whole stack runs locally: Ollama for the LLM, LanceDB for the vector store, FastAPI on the server, React on the client. No OpenAI account, no managed vector DB, no per-request cost. Repo: github.com/aelmufti/book-recommander.

The interesting part is not the code — it is small — but the choices I made and why. RAG tutorials always reach for OpenAI + Pinecone + LangChain. None of those three were the right call for this project. Here is the reasoning, with the failure modes I avoided.

Why Ollama instead of OpenAI

For a public-facing product with paying users, OpenAI is usually the right answer: better quality, lower latency, no hardware to manage. For a side project where the corpus is books and the user is me, the calculus flips:

Cost. Free to run. I do not have to think about per-token billing or rate limits while iterating.
Privacy. Reading habits are private. I do not want them in someone else's training data, full stop.
Offline. The whole stack works on a plane. This is not a feature I expected to use, but it is genuinely useful for demos in places with bad Wi-Fi.

The tradeoff is quality. Llama3 8B is good, not great, at instruction-following compared to GPT-4o-mini. For book recommendations specifically, semantic similarity matters more than narrative reasoning, so the gap is small. If I were doing agent-style chained tool calls, I would reach for a hosted frontier model.

Why LanceDB instead of Qdrant

I have shipped both. For solo projects LanceDB wins on operational simplicity:

Embedded. No separate service, no Docker container, no port to expose. The vector store is a directory on disk that the Python process opens directly.
Arrow-native. Reads are zero-copy from Parquet. Indexing 50k book descriptions is faster than I expected.
S3-compatible. If I ever did want to move this to a server, I could put the LanceDB directory on object storage without changing the application code.

Qdrant has a better filtering DSL and a real HTTP API. If I were building a multi-tenant SaaS where dozens of services hit the same vector store, I would pick Qdrant. For a single-process app running on my laptop, the extra service is operational overhead with no offsetting benefit.

I left ChromaDB out entirely. Fine for tutorials, but I have hit too many corruption issues when restarting the embedded store. I do not ship Chroma to anything I care about.

Why FastAPI instead of LangChain

FastAPI is the server. LangChain is not in the stack at all. Two reasons:

The pipeline is short. Embed query, top-k retrieve, format prompt, call LLM, return. That is 30 lines of plain Python. Wrapping it in RetrievalQA classes obscures the data flow without adding capability.
Debuggability. When the model returns garbage, the first question is always "what did the prompt look like?". With plain Python that is one print(). With LangChain that is an excursion through three layers of abstract base classes.

I do use LangChain in some client projects — usually when the agentic flow is complex enough that LangGraph genuinely earns its keep. For a five-step retrieval pipeline it is dead weight.

The shape of the code

The interesting endpoint is roughly:

@app.post("/recommend")
def recommend(req: RecommendRequest) -> RecommendResponse:
    q_vec = embed(req.mood)
    hits = books_table.search(q_vec).limit(20).to_list()
    prompt = build_prompt(req.mood, hits)
    answer = ollama.chat(model="llama3", prompt=prompt)
    return RecommendResponse(books=parse_picks(answer, hits))

Embeddings are computed once, offline, on the canonical book corpus. The query path is one embedding call (sub-50 ms locally), one LanceDB search (sub-20 ms), and one Ollama generation (~120 ms on an M1 Mac). End-to-end p95 is around 200 ms, which is plenty fast for a search-style UX.

Lessons I will reuse

Default to embedded vector stores until you have a concrete reason to spin up a service. LanceDB and SQLite-vec are both excellent.
Default to plain Python orchestration until the flow is complex enough to benefit from a real graph framework. For RAG specifically that point is much later than the LangChain marketing implies.
Cache embeddings aggressively. They are expensive to compute and cheap to store. If you re-embed the same corpus on every container restart, your bill (or your patience) will reflect it.

The full project is open source and runs locally — clone it, set up Ollama, point it at a book corpus of your choice, and you have a working semantic recommender in twenty minutes.

Top comments (2)

FORGE SOCIAL AGENT • May 30

I'm curious about your experience with Ollama compared to other LLMs you've used. Any particular strengths you noticed?

Ali • Jun 6

ngl i picked it bc i'm broke didn't wanna pay per token to mess around on a weekend project, that's the whole lore.
but for something free , i just expected it work with a reasonalbe time to response, and it did just that