<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sviatoslav Barbutsa</title>
    <description>The latest articles on DEV Community by Sviatoslav Barbutsa (@sviat_barbutsa).</description>
    <link>https://dev.to/sviat_barbutsa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854341%2F27035f48-263a-4934-ae61-5a4733463276.jpg</url>
      <title>DEV Community: Sviatoslav Barbutsa</title>
      <link>https://dev.to/sviat_barbutsa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sviat_barbutsa"/>
    <language>en</language>
    <item>
      <title>How /search and /ask Work: Local Hybrid RAG with ChromaDB + SQLite FTS5</title>
      <dc:creator>Sviatoslav Barbutsa</dc:creator>
      <pubDate>Mon, 11 May 2026 17:36:15 +0000</pubDate>
      <link>https://dev.to/sviat_barbutsa/how-search-and-ask-work-local-hybrid-rag-with-chromadb-sqlite-fts5-226c</link>
      <guid>https://dev.to/sviat_barbutsa/how-search-and-ask-work-local-hybrid-rag-with-chromadb-sqlite-fts5-226c</guid>
      <description>&lt;p&gt;&lt;em&gt;This is the second article in a five-part series about building Llamail, a private local AI email agent.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the first article, I showed the whole system behind it: Gmail, Telegram, n8n, FastAPI, llama.cpp, SQLite, and ChromaDB. From the user's side it looked simple: type &lt;code&gt;/search&lt;/code&gt;, get useful hits; ask a follow-up question in plain English, get an answer back; do it all from a phone. Quite convenient actually.&lt;/p&gt;

&lt;p&gt;This article is about how all of that works under the hood. It will be interesting so let's dive in.&lt;/p&gt;

&lt;p&gt;Pure semantic search is great at meaning. But ask it for exact tokens like a sender name plus invoice number, and it will often return emails that are vaguely about invoices while missing the exact identifier that mattered.&lt;/p&gt;

&lt;p&gt;Pure keyword search has the opposite problem. It is strong on exact terms, sender tokens, and IDs, then falls apart when the query is conceptual: &lt;code&gt;budget&lt;/code&gt;, &lt;code&gt;financials&lt;/code&gt;, &lt;code&gt;spending plan&lt;/code&gt;, &lt;code&gt;Q2 numbers&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So I do both like this: every query goes into ChromaDB semantic retrieval and SQLite FTS5 keyword search, then the results are merged with a tiny weighted scoring function. The idea is related to rank-fusion techniques like RRF - Reciprocal Rank Fusion, which I will cover in separate article as it's quite useful, but this implementation is simpler: it merges normalized scores instead of reciprocal ranks. On my &lt;code&gt;18,000+&lt;/code&gt; email mailbox, that gets me roughly &lt;code&gt;~3 seconds&lt;/code&gt; for search and &lt;code&gt;~7 seconds&lt;/code&gt; for full RAG Q&amp;amp;A, all on my average consumer laptop.&lt;/p&gt;

&lt;p&gt;If you missed part 1, start there first:&lt;br&gt;
&lt;a href="https://dev.to/sviat_barbutsa/from-inbox-to-character-building-a-private-local-ai-email-agent-c3k"&gt;From Inbox to Character: Building a Private, Local AI Email Agent&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What This Feels Like from Telegram
&lt;/h2&gt;

&lt;p&gt;From the outside, the flow is simple. I type &lt;code&gt;/search budget&lt;/code&gt; and get a ranked list of emails. Then I ask something like &lt;code&gt;what did John say about the budget?&lt;/code&gt; and the agent answers with sources. In article 1, that looked like a straightforward chat interface. Under the hood, though, those two commands are doing different amounts of work.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/search&lt;/code&gt; only needs retrieval. &lt;code&gt;/ask&lt;/code&gt; runs retrieval too, but then turns the top results into constrained context for one more LLM call. The reason both commands feel useful instead of gimmicky is that retrieval is hybrid: semantic search handles meaning, and keyword search handles exact matches.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Single-Mode Search Breaks
&lt;/h2&gt;

&lt;p&gt;Semantic search fails in predictable ways. It is good at meaning, but bad at exactness. If the user searches for an email address, an invoice number, a project code, or a sender name that really matters, embeddings can blur those details away. &lt;code&gt;About invoices&lt;/code&gt; is not the same as &lt;code&gt;from john@acme.com about invoice&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Keyword search fails just as predictably in the other direction. FTS5 is excellent at exact terms, but it has no idea that &lt;code&gt;budget&lt;/code&gt;, &lt;code&gt;financials&lt;/code&gt;, &lt;code&gt;spending plan&lt;/code&gt;, and &lt;code&gt;Q2 numbers&lt;/code&gt; are probably the same conversation. It also struggles with higher-level questions like &lt;code&gt;what did John think about the proposal?&lt;/code&gt; unless the exact wording appears in the mailbox.&lt;/p&gt;

&lt;p&gt;There is another failure mode that matters for RAG: aggregate questions. If someone asks &lt;code&gt;how many spam emails did I get last week?&lt;/code&gt;, retrieval alone cannot answer that reliably, because &lt;code&gt;/ask&lt;/code&gt; only sees the top few matching emails instead of scanning the full mailbox. The prompt explicitly tells the model to say that it only has a limited sample.&lt;/p&gt;

&lt;p&gt;The fix is dead simple: run both search modes, normalize both score types into roughly the same range, merge them, and let the strengths of one compensate for the weaknesses of the other.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Shape of the Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5gyb17qxwa6qgsomb9q0.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5gyb17qxwa6qgsomb9q0.webp" alt="Email Processing Pipeline" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The diagram is dense a bit, so I also added a full-size version for readability:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=1600%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5gyb17qxwa6qgsomb9q0.webp"&gt;Open full-size diagram&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The bigger system still starts with email ingestion: live Gmail events or bulk import come in, the system summarizes the email and stores structured data in SQLite. In the single-email path, that same pass writes curated vector text into ChromaDB. In the chunked path, the system summarizes each chunk, creates a master summary, and writes one parent email embedding from that master summary. Search only works because the processing pipeline already did the cleanup work up front.&lt;/p&gt;

&lt;p&gt;That is one of the reasons the Telegram experience in article 1 feels so lightweight. Most of the expensive thinking happened earlier, when the email entered the system.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Actually Embed
&lt;/h2&gt;

&lt;p&gt;The vector side of the system is intentionally small. Instead of embedding raw emails blindly I implemented a curated representation first and then send that to the embedding model.&lt;/p&gt;

&lt;p&gt;This is the shared parent-email embedding text builder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# webservice/src/email_service/services/email_processor.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_build_embed_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ProcessEmailRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;embed_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;From: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_name&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summary: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attachments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;filenames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attachments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;embed_text&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Attachments: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;embed_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key decision is that I embed the LLM summary instead the raw body. For normal emails, that is the direct summary. For long emails, it is the master summary created after chunking. That keeps the vector text semantically dense and strips out a lot of email garbage: signatures, disclaimers, nested reply chains, boilerplate greetings, and formatting noise. I also append attachment filenames so queries like &lt;code&gt;the email with the budget spreadsheet&lt;/code&gt; still have a chance to hit.&lt;/p&gt;

&lt;p&gt;The ChromaDB wrapper is correspondingly small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# webservice/src/email_service/services/embeddings.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;where&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;where&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the collection setup is just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PersistentClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chroma_path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;_collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_or_create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hnsw:space&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The embedding model is Nomic Embed v2 MoE, served by a separate llama.cpp instance on it's dedicated port. That split matters actually. Embedding requests do not block text generation on the main LLM server, so &lt;code&gt;/ask&lt;/code&gt; and summarization are not competing for the same process which is really good for performance and the overall smooth experience.&lt;/p&gt;

&lt;p&gt;Worth noticing that the chunked emails currently get one parent embedding from the master summary. I do not embed every individual chunk yet. That keeps the architecture simpler while still making long emails visible to semantic search. I think overengineering is one of the most common problems in software engineering and I'm definitely not immune to it, huh, so I try to keep that in mind.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Keyword Side: SQLite FTS5
&lt;/h2&gt;

&lt;p&gt;The exact-match side of the pipeline is simple SQLite FTS5. I didn't see any need for Elasticsearch, or Meilisearch, or any separate search service. Just a virtual table living next to the relational data. My general principle is: keep it simple, yet still satisfy the business requirements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VIRTUAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;emails_fts&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;fts5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'emails'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_rowid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'rowid'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;emails_ai&lt;/span&gt; &lt;span class="k"&gt;AFTER&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;emails&lt;/span&gt; &lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;emails_fts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_address&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full database setup also includes matching delete/update triggers plus a &lt;code&gt;rebuild_fts()&lt;/code&gt; helper for one-time backfills:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rebuild_fts&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Backfill FTS5 index from existing emails. One-time catch-up.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO emails_fts(emails_fts) VALUES (&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rebuild&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why FTS5 is the right level of boring here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero infrastructure because it's built into SQLite.&lt;/li&gt;
&lt;li&gt;There is no additional network calls, or extra service handling, or any additional deploy.&lt;/li&gt;
&lt;li&gt;WAL mode gives concurrent reads and writes.&lt;/li&gt;
&lt;li&gt;For a mailbox this size, the exact-match lookup itself is cheap. The slower part of overall &lt;code&gt;/search&lt;/code&gt; is usually local embedding inference and, for &lt;code&gt;/ask&lt;/code&gt;, the final answer generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the FTS query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# webservice/src/email_service/services/search.py
&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
         SELECT e.id, -fts.rank AS score
         FROM emails_fts fts
         JOIN emails e on e.rowid = fts.rowid
         WHERE emails_fts MATCH :query
         ORDER BY fts.rank
         LIMIT :limit
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this query, &lt;code&gt;fts.rank&lt;/code&gt; is ordered lower-is-better, so I sort by the raw rank but expose &lt;code&gt;-fts.rank&lt;/code&gt; as a positive score. Then I normalize it to the &lt;code&gt;0-1&lt;/code&gt; range so it can be mixed with semantic scores cleanly.&lt;/p&gt;

&lt;p&gt;One nuance: this version passes the query string directly into FTS5. That works well for normal words, but punctuation-heavy searches like raw email addresses or invoice numbers may need a small escaping or tokenization step. If FTS5 rejects the query, the code logs the failure and still returns semantic results instead of breaking the Telegram command.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Merge That Makes It Work
&lt;/h2&gt;

&lt;p&gt;The merge algorithm is the part I like most because it is so small relative to how useful it feels in practice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F401558wxsfqgpwfxf0dw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F401558wxsfqgpwfxf0dw.png" alt="RAG Hybrid Search Flow" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the code, the full hybrid search function is small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# webservice/src/email_service/services/search.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;after_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;semantic_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_semantic_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_results&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fts_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_fts_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_results&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_merge_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semantic_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fts_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_enrich&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;after_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;after_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here is the actual merge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_merge_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;fts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;all_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;SEMANTIC_WEIGHT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;
    &lt;span class="n"&gt;FTS_WEIGHT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;

    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;email_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;SEMANTIC_WEIGHT&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;FTS_WEIGHT&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;email_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is basically the whole trick.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;60/40&lt;/code&gt;?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Semantic search should dominate for concept-heavy questions.&lt;/li&gt;
&lt;li&gt;FTS should still have enough weight to pull exact identifiers upward.&lt;/li&gt;
&lt;li&gt;Emails found by both methods nearly always deserve to float to the top.&lt;/li&gt;
&lt;li&gt;Fetching &lt;code&gt;2x&lt;/code&gt; the final limit from each source before trimming prevents one source from crowding out the other too early.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's basically it and there is no fancy reranker or some fusion layer after that. Only two signal sources, a weighted sum, and a sort. Remember: simple yet functional.&lt;/p&gt;

&lt;p&gt;A concrete scoring example makes the behavior easier to see. If &lt;code&gt;email_A&lt;/code&gt; scores &lt;code&gt;0.90&lt;/code&gt; in semantic search and &lt;code&gt;0.00&lt;/code&gt; in FTS5, the combined score is &lt;code&gt;0.54&lt;/code&gt;. If &lt;code&gt;email_B&lt;/code&gt; scores &lt;code&gt;0.00&lt;/code&gt; in semantic search and &lt;code&gt;0.95&lt;/code&gt; in FTS5, the combined score is &lt;code&gt;0.38&lt;/code&gt;. But if &lt;code&gt;email_C&lt;/code&gt; scores &lt;code&gt;0.70&lt;/code&gt; in semantic search and &lt;code&gt;0.80&lt;/code&gt; in FTS5, the combined score is &lt;code&gt;0.74&lt;/code&gt;, so the result found by both systems wins.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F841ynohk59qkz3jsoyct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F841ynohk59qkz3jsoyct.png" alt="Hybrid Scoring Example" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  From Ranked Results to RAG Answers
&lt;/h2&gt;

&lt;p&gt;This is the point where article 1's two user-visible commands start to diverge.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/search&lt;/code&gt; only needs ranked results. &lt;code&gt;/ask&lt;/code&gt; goes one step further: it takes those search hits, renders them into a prompt, adds limited conversation history, and asks the local LLM for a JSON answer.&lt;/p&gt;

&lt;p&gt;The current prompt template is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Answer the user's question based ONLY on the email context below.
You are Sable, a private local email agent. Be calm, factual, and concise.
If the answer is not in the context, say "I don't have enough information in the available emails to answer that."
If the question asks to count, list totals, or compute statistics across all emails, explain that you can only see a limited sample of emails, not the full mailbox.
{% if conversation_history %}

--- CONVERSATION HISTORY ---
{{ conversation_history }}
--- END HISTORY ---
{% endif %}

--- EMAIL CONTEXT ---
{% for email in emails %}

[{{ loop.index }}] From: {{ email.from_name or "Unknown" }} &amp;lt;{{ email.from_address }}&amp;gt;
Subject: {{ email.subject or "(no subject)" }}
Date: {{ email.received_at }}
Summary: {{ email.summary or "(no summary)" }}
{% if email.attachments %}Attachments: {{ email.attachments }}{% endif %}

{% endfor %}
--- END CONTEXT ---

Question: {{ question }}

Return ONLY valid JSON:
{
    "answer": "Your answer here, referencing emails by [number] when relevant",
    "confidence": "high | medium | low"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conversation history is stored in SQLite via the &lt;code&gt;ChatMessage&lt;/code&gt; table for persistence. The helper in &lt;code&gt;chat_memory.py&lt;/code&gt; pulls recent exchanges, then trims them again against a token budget before formatting them for the prompt.&lt;/p&gt;

&lt;p&gt;In the current code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;history is stored in SQLite&lt;/li&gt;
&lt;li&gt;the latest rows are loaded per &lt;code&gt;chat_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;history is capped at &lt;code&gt;settings.chat_history_limit * 2&lt;/code&gt;, which works out to 20 rows by default, or about 10 user/assistant exchanges when the conversation alternates cleanly&lt;/li&gt;
&lt;li&gt;it is trimmed again to a &lt;code&gt;2,000&lt;/code&gt; token budget before prompt rendering&lt;/li&gt;
&lt;li&gt;all Telegram exchanges are recorded, but only the &lt;code&gt;ask&lt;/code&gt; and &lt;code&gt;chitchat&lt;/code&gt; flows inject that history into an LLM prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: &lt;em&gt;For sure I could have used LangChain or LlamaIndex with a more elaborate multi-level memory system, but that would have added a lot of complexity for little benefit in this project. A small SQLite-backed memory window was efficient for my purpose.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is enough for useful follow-ups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;What did John say about the budget?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;When was that?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Did he mention attachments?&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current context-budget reality is also quite modest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;app context window setting: &lt;code&gt;8192&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;chat history token budget: &lt;code&gt;2000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/ask&lt;/code&gt; search result count: &lt;code&gt;5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;there is no separate explicit &lt;code&gt;max_output_tokens&lt;/code&gt; setting in the current &lt;code&gt;ask()&lt;/code&gt; path; the prompt keeps the task narrow instead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the system stays comfortably inside an 8k context window without pretending it has full-mailbox visibility which is extremely useful on common hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  Small Details That Matter
&lt;/h2&gt;

&lt;p&gt;The raw hybrid merge is not the whole story. Two practical details make the results feel much smarter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Date filtering from natural language&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For search-style messages, the intent classifier can turn relative time phrases into an ISO date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"emails from last week" -&amp;gt; after_date: "&amp;lt;computed ISO date&amp;gt;"
"since Monday"          -&amp;gt; after_date: "&amp;lt;computed ISO date&amp;gt;"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not the literal example value; it is rather the mechanism. &lt;code&gt;classify_intent.j2&lt;/code&gt; gets &lt;code&gt;today&lt;/code&gt; injected as a template variable, so the router can compute a real &lt;code&gt;after_date&lt;/code&gt; before the search handler runs. In the current code, that date parameter is wired into the search flow; &lt;code&gt;/ask&lt;/code&gt; still uses the top retrieved emails without a separate date-filter argument.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Source thresholding for &lt;code&gt;/ask&lt;/code&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer formatter hides weak sources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;good_sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;good_sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sources (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;good_sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tiny cutoff fixed an annoying behavior where broad or aggregate questions would dump obviously irrelevant citations just because they happened to be in the top five retrieved emails.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Like This Approach
&lt;/h2&gt;

&lt;p&gt;Hybrid search is a good solution for the common RAG relevancy problems. The merge function is basically a handful of lines. But the improvement over semantic-only or keyword-only retrieval is dramatic.&lt;br&gt;
SQLite FTS5 + ChromaDB, both embedded, running on one average machine, are enough to make 18,000+ emails genuinely searchable in a useful way.&lt;/p&gt;

&lt;p&gt;In my setup that means about &lt;code&gt;~3 seconds&lt;/code&gt; for search and about &lt;code&gt;~7 seconds&lt;/code&gt; for full RAG Q&amp;amp;A. Slower than a cloud stack, obviously, but fast enough to be practical from Telegram on a phone. For my purposes it's perfectly fine.&lt;/p&gt;

&lt;p&gt;If article 1 was the broad system tour, this is the subsystem that made the whole project feel real to me. Once search stopped acting like a demo and started returning the emails I actually meant, the rest of the agent suddenly had something solid to stand on.&lt;/p&gt;

&lt;p&gt;Source code:&lt;br&gt;
&lt;a href="https://github.com/sviat-barbutsa/llamail" rel="noopener noreferrer"&gt;github.com/sviat-barbutsa/llamail&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the next article, I will move one layer up from retrieval and into the command layer: the LLM-as-router pattern that lets people talk to this system naturally instead of memorizing boring command syntax. Stay tuned!&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>ai</category>
      <category>database</category>
    </item>
    <item>
      <title>From Inbox to Character: Building a Private, Local AI Email Agent</title>
      <dc:creator>Sviatoslav Barbutsa</dc:creator>
      <pubDate>Mon, 06 Apr 2026 00:09:34 +0000</pubDate>
      <link>https://dev.to/sviat_barbutsa/from-inbox-to-character-building-a-private-local-ai-email-agent-c3k</link>
      <guid>https://dev.to/sviat_barbutsa/from-inbox-to-character-building-a-private-local-ai-email-agent-c3k</guid>
      <description>&lt;p&gt;There are 18k+ emails in my personal inbox, and it's only one of the accounts I have. I wanted to search through them semantically, get AI summaries, draft replies, and run email campaigns - all from my phone. I didn't want OpenAI reading my emails or Google's AI. Or anyone's. For me, local AI is the only real answer for private data processing because no one can read your data or train their models on your data while you're paying for the service.&lt;/p&gt;

&lt;p&gt;So I built my own - a private, local email agent project called Llamail. Its default synthetic persona is Sable. It helps me search, summarize, and manage my emails, and can also chat in a casual, roleplay-like manner to make the whole thing a bit more fun. ~3700 lines of Python, a Llama model running on my average, consumer laptop GPU, and a Telegram bot as the interface - and everything is handcrafted, not vibecoded. Here's how.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxee2g613kgojbn8mv2cd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxee2g613kgojbn8mv2cd.jpg" alt="Sable welcome" width="800" height="1733"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Meet Sable - a private, local email agent with a configurable synthetic persona.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4usur28h22w9t7vrj9n9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4usur28h22w9t7vrj9n9.jpg" alt="Sable hero shot" width="800" height="1733"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Two common workflows: RAG-based search and live incoming email processing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the first article in a five-part series.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Local?
&lt;/h2&gt;

&lt;p&gt;I could've used OpenAI or Gemini or any other cloud API and been done in a weekend. But emails are personal and they contain contracts, salary discussions, medical stuff, conversations with people who didn't consent to having their words processed by a third-party AI, and I wasn't comfortable with that.&lt;/p&gt;

&lt;p&gt;There's also the practical side: 18,000+ emails means real API costs. With GPT-5.4 pricing as of April 5, 2026 ($2.50/1M input, $15/1M output tokens), the initial import alone would likely land somewhere around $60–120 depending on average email length and summary size - not ruinous on its own, but it's still an upfront cost that scales with the size of the mailbox. Then there are search queries, draft generation, and follow-up questions - it adds up. A local LLM costs electricity and patience, but it doesn't bill you per token. And honestly, I just prefer running things locally. The independence from yet another metered subscription is its own reward.&lt;/p&gt;

&lt;p&gt;So the constraint was: everything runs on my hardware, nothing leaves my network with one exception - Telegram. It's a third-party server and in theory they could read it if they really want to. But on paper, unlike cloud AI providers, processing my data isn't their product - they're a messaging platform, not a model training pipeline. For now that's a tradeoff I accept for the convenience of a mobile interface that works everywhere. In the future I'm thinking about implementing an adapter layer so Telegram can be swapped for other interfaces such as a self-hosted Matrix bot, a web UI, or even a plain CLI.&lt;/p&gt;


&lt;h2&gt;
  
  
  What It Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;I control everything from Telegram. Here's a real flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fml3mg59chap8co6nofv4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fml3mg59chap8co6nofv4.jpg" alt="Live email processing in Telegram" width="800" height="1733"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;New emails are automatically summarized and pushed to Telegram&lt;/em&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe37ul6wzp46tufq8kitm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe37ul6wzp46tufq8kitm.jpg" alt="Searching emails and asking follow-up questions" width="800" height="1733"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Searching emails and asking follow-up questions with RAG&lt;/em&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo4o7yy70ief7xrfa5qy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo4o7yy70ief7xrfa5qy.jpg" alt="Draft reply in Telegram" width="800" height="1733"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;"agree but ask to clarify the refund numbers" → a polished, professional email. You say what to say, the LLM figures out how.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The agent handles search and Q&amp;amp;A (with simple follow-up memory), email drafting and sending (including scheduled sends), bulk Gmail import, sender blocking and unsubscribing, grammar checking, and LLM-personalized email campaigns with reply tracking.&lt;/p&gt;

&lt;p&gt;You can use slash commands (&lt;code&gt;/search budget&lt;/code&gt;), bare commands (&lt;code&gt;import status&lt;/code&gt;), or just talk to it naturally - "hey, what did the team discuss last week?" The LLM reads your message, picks the right tool from 30+ available actions, extracts parameters from context (including computing "last week" into an actual date), and executes it. Between tasks, it stays in character as a configurable persona - you can chitchat with it, and it remembers your conversation history across sessions. I gave it a cold synthetic voice so it would feel like a real agent instead of a generic assistant.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Side note: Initially, the project used a persona inspired by a well-known white-haired android lady with a bob haircut. To avoid any potential copyright issues, I reworked it into an original character. It ended up being the better creative decision anyway. In practice, though, you can configure any personality you want: it is really just a prompt template, a name, and an avatar image.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4yn4kfr6r31odd7ruw2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4yn4kfr6r31odd7ruw2.jpg" alt="General conversation" width="800" height="1733"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;It can step outside pure inbox work too, which makes the whole thing feel more like a character-driven agent than a command parser.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I'll walk through the interesting parts of how this works.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Stack and How It Fits Together
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7coisgkj7rbgjw30to60.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7coisgkj7rbgjw30to60.png" alt="Llamail system architecture" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The high-level architecture: n8n handles Gmail and Telegram events, while the Python webservice does all the actual thinking.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The system has two halves that barely know about each other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;n8n&lt;/strong&gt; (running in Docker) is a dumb message bridge. It watches Gmail for new emails and forwards Telegram messages. That's it. The Telegram command workflow is literally three nodes: &lt;code&gt;Trigger → HTTP Request → Send Message&lt;/code&gt;. No logic, no branching, no code nodes. For sure I could totally skip n8n and build Python services for the Gmail and Telegram APIs but I decided to save some time. Plus, n8n has a good library of connectors so if I ever need to plug in Slack, Discord, or another service, it's a drag-and-drop node instead of a new API integration so it was another solid positive argument for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Python webservice&lt;/strong&gt; (FastAPI) is the brain. Every command, every LLM call, every database query, every Gmail API interaction happens here. When I need to add a new feature, I write a Python function - not a workflow branch.&lt;/p&gt;

&lt;p&gt;The two halves talk over HTTP. n8n lives in Docker, the webservice runs on the host. &lt;code&gt;host.docker.internal&lt;/code&gt; bridges them.&lt;/p&gt;

&lt;p&gt;Here's the full stack:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;llama.cpp + Llama 3.1 8B (Q8_0)&lt;/td&gt;
&lt;td&gt;Local, free, OpenAI-compatible API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;llama.cpp + Nomic Embed v2 MoE (Q6_K)&lt;/td&gt;
&lt;td&gt;Separate server so it doesn't block the LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API layer&lt;/td&gt;
&lt;td&gt;FastAPI&lt;/td&gt;
&lt;td&gt;Pydantic validation, lifespan for background threads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;SQLite (WAL) + ChromaDB&lt;/td&gt;
&lt;td&gt;One file for relational data, embedded vector store for semantic search. FTS5 for keyword search - no extra service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestrator&lt;/td&gt;
&lt;td&gt;n8n (Docker) + Cloudflare Tunnel&lt;/td&gt;
&lt;td&gt;Gmail/Telegram triggers. Tunnel gives a free HTTPS URL for webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interface&lt;/td&gt;
&lt;td&gt;Telegram Bot&lt;/td&gt;
&lt;td&gt;Works from phone. No terminal, no venv, no SSH&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I develop on Windows with an RTX 4080 Mobile (12GB VRAM, CUDA). Production runs on a Linux Mint mini-PC with a Ryzen 7 8845HS (Vulkan). Same Python code for both - llama.cpp handles the GPU backend difference.&lt;/p&gt;

&lt;p&gt;The entire LLM client is 81 lines. No LangChain, no framework - just &lt;code&gt;httpx.post()&lt;/code&gt; to an OpenAI-compatible endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# services/llm.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;json_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;90.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;There is no persona hardcoded into the client itself. Personality lives in the prompt templates, which makes it easy to swap or reconfigure later.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Because llama.cpp speaks the OpenAI protocol, I can swap to Ollama, vLLM, or actual OpenAI by changing one URL. The rest of the code never touches the provider. Don't get me wrong, it's not a big deal to create an adapter for any other protocol, but it's all about convenience, and I like the performance of llama.cpp. It's the only local LLM runner for me.&lt;/p&gt;




&lt;h2&gt;
  
  
  The One Pattern That Runs Everything
&lt;/h2&gt;

&lt;p&gt;Every LLM task in the system - summarization, Q&amp;amp;A, intent classification, drafting, campaign personalization, reply classification, grammar checking and even a casual conversation - follows the exact same pattern:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jinja2 template → llama.cpp → parse JSON&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are 11 templates. Each one is a contract: "here's what I'm giving you, here's the JSON structure I expect back." I found that the template system works perfectly for setting LLM behavioral boundaries and expectations, simply because they're separated instruction files that support variable syntax inline and you keep all your instructions in a clean, isolated space.&lt;/p&gt;

&lt;p&gt;Here's the email summarization template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Analyze this email and return ONLY valid JSON.

From: {{ from_name or "Unknown" }} &amp;lt;{{ from_address }}&amp;gt;
Subject: {{ subject or "(no subject)" }}
Date: {{ date }}

--- EMAIL BODY ---
{{ body }}
--- END ---

Return this exact JSON structure, nothing else:
{
    "summary": "2-3 sentence summary",
    "category": "work | personal | newsletter | finance | spam | other",
    "priority": "high | medium | low",
    "sentiment": "positive | negative | neutral | urgent",
    "action_required": true or false,
    "action_items": ["action item 1"] or [],
    "key_people": ["Person name"] or []
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same pattern handles intent classification (user types "find emails about the budget" → LLM returns &lt;code&gt;{"intent": "search", "params": {"query": "budget"}}&lt;/code&gt;), draft generation (instructions + original email → LLM returns the draft text), and everything else. Same &lt;code&gt;generate()&lt;/code&gt; function, different template.&lt;/p&gt;

&lt;p&gt;This is the decision I'm happiest with. Adding a new LLM-powered feature means writing a Jinja2 file and a thin Python function. No new abstractions, no plumbing. In general, the "hyped AI skills" the whole internet has been buzzing around lately are nothing but text-based instructions backed by tools. So that's it - we added "skills" to our agent in the most convenient way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Tier Command Router
&lt;/h2&gt;

&lt;p&gt;This is the most interesting pattern in the codebase. When you send a message, the system tries three strategies - fast to slow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Slash commands.&lt;/strong&gt; &lt;code&gt;/search budget&lt;/code&gt; → instant. The system splits on whitespace, looks up the command, and dispatches. Zero LLM involvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2: Bare compound commands.&lt;/strong&gt; &lt;code&gt;import status&lt;/code&gt; or &lt;code&gt;draft reply 1 sounds good&lt;/code&gt; (no slash) → also instant. These are safe to match without the LLM because they always start with a known keyword (&lt;code&gt;import&lt;/code&gt;, &lt;code&gt;draft&lt;/code&gt;, &lt;code&gt;campaign&lt;/code&gt;, &lt;code&gt;schedule&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3: Natural language.&lt;/strong&gt; "hey, how's my import going?" → the agent sends you an &lt;em&gt;"Analyzing your message..."&lt;/em&gt; notification (so you know it's working), passes your text to the LLM with a classification template listing 30+ intents, gets back structured JSON, and dispatches to the right handler. So instead of passing all the instructions directly into one LLM call and hoping it will process everything at once correctly, I use separation of concerns (yes, good old SOLID): first, the agent classifies the user's intent and nothing more. Only after it correctly picks the appropriate action - including whether further LLM processing is needed - does it pass the intent data back to the system, which executes the next step. That might include further agent processing such as a roleplay conversation or dedicated tool usage. This approach allows smaller models like 8B or even lower to correctly handle such tasks without any hassle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Tier 1: /slash commands
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="c1"&gt;# Tier 2: bare compound commands
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;first_word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;first_word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;import&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;draft&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;campaign&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
            &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first_word&lt;/span&gt;
            &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="c1"&gt;# Tier 3: LLM fallback
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_llm_route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key: the LLM is the &lt;em&gt;fallback&lt;/em&gt;, not the primary path. Slash commands have zero overhead and it's a really convenient way for people who clearly know what they want and just want to get the result without any blabbering: &lt;br&gt; &lt;code&gt;command -&amp;gt; result -&amp;gt; done&lt;/code&gt;. Natural language adds ~4 seconds - and only when there's no other way to understand the input. You get the convenience of free-form chat without paying the latency cost on every message.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things That Broke
&lt;/h2&gt;

&lt;p&gt;These are the bugs I spent the most time on. If you're building something similar, maybe this saves you a few hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1,628 failed imports from one bug.&lt;/strong&gt; I was importing 18,000 emails, and 1,628 kept failing with foreign key constraint errors. The cause: I was saving email chunks before the parent email row existed. The fix was literally moving one function call above another. The lesson was less fun - I should have had integration tests from the start. No, I still don't have them, and yes, I know I could ask Codex or Claude to auto-generate some, but that would break the whole handcrafted, no-vibecoded spirit of the project. If you're going to be honest, you might as well be honest all the way through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gmail IDs are lies.&lt;/strong&gt; The Gmail API returns hex IDs like &lt;code&gt;19c54cb15118c128&lt;/code&gt;. The Gmail web UI uses completely different IDs like &lt;code&gt;QgrcJHrtw...&lt;/code&gt;. I spent an afternoon trying to build a "View in Gmail" link. Turns out there is no conversion between these two formats. No API, no formula, nothing documented. I ended up using &lt;code&gt;rfc822msgid:&lt;/code&gt; search links - the RFC822 Message-ID header is in every email, and Gmail search can find by it. It opens a search with one result instead of a direct link, but it works. Mostly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google OAuth and the trailing slash.&lt;/strong&gt; &lt;code&gt;http://localhost:9090&lt;/code&gt; and &lt;code&gt;http://localhost:9090/&lt;/code&gt; are different redirect URIs to Google. One works. One gives you &lt;code&gt;redirect_uri_mismatch&lt;/code&gt;. I tried the wrong one first and spent 10 minutes reading Stack Overflow threads about client ID misconfigurations before noticing the slash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telegram eats angle brackets.&lt;/strong&gt; I had &lt;code&gt;"Usage: import history &amp;lt;account_email&amp;gt;"&lt;/code&gt; in a response. Telegram's HTML parser tried to interpret &lt;code&gt;&amp;lt;account_email&amp;gt;&lt;/code&gt; as an HTML tag, failed to find a closing tag, and silently dropped the entire message. This happened twice - once with static text, once with email content containing &lt;code&gt;&amp;lt;sender@email.com&amp;gt;&lt;/code&gt;. The real fix was &lt;code&gt;html.escape()&lt;/code&gt; at the route level so I'd never have to think about it again.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Process one email (summarize)&lt;/td&gt;
&lt;td&gt;~2-3.5 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Process one email (embedding)&lt;/td&gt;
&lt;td&gt;&amp;lt; 0.1 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bulk import throughput&lt;/td&gt;
&lt;td&gt;~5 emails/minute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid search (semantic + keyword)&lt;/td&gt;
&lt;td&gt;~3 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG Q&amp;amp;A (search + generate answer)&lt;/td&gt;
&lt;td&gt;~7 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Natural language intent classification&lt;/td&gt;
&lt;td&gt;~4 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slash commands&lt;/td&gt;
&lt;td&gt;Instant&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All numbers measured on real data - bulk import throughput from SQLite job timestamps (&lt;code&gt;started_at&lt;/code&gt; to &lt;code&gt;finished_at&lt;/code&gt;), LLM call times from llama.cpp server logs, everything else with a stopwatch and a Telegram chat (yes, a very scientific method).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt; RTX 4080 Mobile (12GB VRAM), 32GB RAM. The 8B model in Q8_0 quantization uses about 9GB VRAM. You could run Q4 on a 6GB card, but response quality drops noticeably for structured JSON output.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Source code: &lt;a href="https://github.com/sviat-barbutsa/llamail" rel="noopener noreferrer"&gt;github.com/sviat-barbutsa/llamail&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A GPU with 8GB+ VRAM (NVIDIA with CUDA, or any Vulkan-capable card)&lt;/li&gt;
&lt;li&gt;16GB+ RAM&lt;/li&gt;
&lt;li&gt;Docker for n8n&lt;/li&gt;
&lt;li&gt;A Gmail account and a Telegram bot (both free)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo has a setup guide that walks through everything.&lt;/p&gt;

&lt;p&gt;In the next article, I'll go deep into the &lt;strong&gt;hybrid search system&lt;/strong&gt; - how combining ChromaDB semantic search with SQLite FTS5 keyword search produces better results than either one alone, and why it only took 120 lines to build.&lt;/p&gt;

&lt;p&gt;If you're building something with local LLMs, I'd love to hear about it in the comments. And if you've solved the Gmail ID problem more elegantly than I did, &lt;em&gt;please&lt;/em&gt; tell me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P.S.&lt;/strong&gt; The bot avatar shown in the screenshots was generated locally, and the Sable persona is original to this project.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
