Embedded Intelligence: How SQLite-vec Delivers Fast, Local Vector Search for AI.

#sqlite #sqlitevec #rag #vectordatabase

An experience with SQLite as a vector database

Introduction - Embedded Intelligence: Building a Local RAG Stack with SQLite-vec, Ollama, and Granite

I recently stumbled upon a fascinating article presenting SQLite-vec, a simple yet powerful extension that turns the ubiquitous SQLite database into a functional vector store. As someone who loves diving deep into code and testing new implementations, I immediately decided this was worth a try on my own machine.

The creator of the extension, Alex Garcia, provides excellent samples across several programming languages, which worked flawlessly right out of the box. Inspired by the clear Python demonstration available on his GitHub repository, I decided to build a complete, end-to-end Retrieval-Augmented Generation (RAG) sample application.

For this project, I chose to lean on my local AI setup, leveraging Ollama as the engine to manage and run my Large Language Models. Specifically, I integrated a local Granite LLM for the final answer generation and a corresponding Granite embedding model to handle the vector creation. The goal was to prove that you can build a robust, entirely local, and highly efficient RAG pipeline using familiar tools and a serverless database.

Test - Implementation — Results

The resulting application, rag_ollama_sqlite.py, implements this standard RAG workflow in four distinct phases. The initial Ingestion phase sets up the database, calculates 768-dimension embeddings for the sample documents using the Granite embedding model via Ollama, and stores both the original text and vectors into the sqlite-vec index. When a user submits a question, the Retrieval phase embeds that query and uses sqlite-vec's efficient vector search to fetch the top three most relevant document chunks based on vector distance. Finally, the Generation phase sends these retrieved chunks, along with the original question, to the powerful local Granite 3.3 LLM, which synthesizes the context-grounded final answer. The entire process is capped off by generating a comprehensive Markdown report detailing the full execution and sources used.

The sample app provided does not provide the markdown output but console only!

The implementation is provided below.

Prepare the environment (and troubleshooting 🥵)

###
python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
###
# 
# pip install sqlite3
# pip install pysqlite3 
# these installataions didn't work on me envrionment so I applied a plan-b below
####
#### TROUBLESHOOTING using stackoverflow and web search
brew --prefix sqlite
# Set the path to the Homebrew SQLite installation
SQLITE_PATH=$(brew --prefix sqlite)
# Use CFLAGS and LDFLAGS to point the compiler to the correct headers and libraries
# The LDFLAGS link against 'sqlite3' in the new path.
# The CFLAGS point to the header files.
CFLAGS="-I${SQLITE_PATH}/include" LDFLAGS="-L${SQLITE_PATH}/lib" pip install pysqlite3

pip cache purge
brew --prefix sqlite

SQLITE_PATH=$(brew --prefix sqlite)
CPPFLAGS="-I${SQLITE_PATH}/include" LDFLAGS="-L${SQLITE_PATH}/lib" pip install pysqlite3 --no-cache-dir

The code implementation described earlier 👇

import sqlite_vec
import requests
import json
import struct
import os
import datetime
from typing import List, Tuple, Dict

# Try to import pysqlite3 for extension loading support, fallback to standard sqlite3
try:
    # We rename pysqlite3 to sqlite3 so the rest of the code works seamlessly
    import pysqlite3 as sqlite3
    print("[INFO] Using pysqlite3 for robust extension loading.")
except ImportError:
    print("[WARNING] pysqlite3 not found. Falling back to standard sqlite3. Extension loading may fail.")
    import sqlite3

# --- 1. CONFIGURATION ---

# Ollama details
OLLAMA_URL = "http://localhost:11434"

# Granite Models (ensure these are downloaded via `ollama pull granite-embedding:278m` and `ollama pull granite3.3:8b`)
EMBEDDING_MODEL = "granite-embedding:278m"
GENERATION_MODEL = "granite3.3:8b"

# SQLite-vec configuration
DB_NAME = "granite_rag.db"
EMBEDDING_DIMENSION = 768  # Matches granite-embedding:278m
TOP_K = 3 # Number of relevant chunks to retrieve

# --- 2. UTILITY FUNCTIONS ---

def serialize_f32(vector: List[float]) -> bytes:
    """Serializes a list of floats (vector) into a compact raw bytes format
    required by sqlite-vec.
    """
    return struct.pack(f"{len(vector)}f", *vector)

def ollama_api_call(endpoint: str, payload: Dict):
    """Generic function to handle synchronous POST requests to Ollama API."""
    try:
        url = f"{OLLAMA_URL}{endpoint}"
        headers = {'Content-Type': 'application/json'}

        response = requests.post(url, headers=headers, data=json.dumps(payload))
        response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)

        return response.json()
    except requests.exceptions.ConnectionError:
        print(f"\n[ERROR] Could not connect to Ollama at {OLLAMA_URL}.")
        print("Please ensure Ollama is running.")
        return None
    except requests.exceptions.HTTPError as e:
        print(f"\n[ERROR] Ollama API request failed: {e}")
        print(f"Check if model '{payload.get('model')}' is installed (e.g., `ollama pull {payload.get('model')}`).")
        return None
    except Exception as e:
        print(f"\n[ERROR] An unexpected error occurred during API call: {e}")
        return None

def get_embedding(text: str) -> List[float]:
    """Generates an embedding vector for a given text using the specified Ollama model."""
    print(f"  > Generating embedding for text snippet...")
    payload = {
        "model": EMBEDDING_MODEL,
        "prompt": text,
        "options": {"num_ctx": 4096}
    }
    result = ollama_api_call("/api/embeddings", payload)

    if result and 'embedding' in result:
        return result['embedding']
    return []

# The function save_output_to_file is REMOVED as requested.

# --- 3. DATABASE SETUP AND INGESTION ---

def setup_db() -> sqlite3.Connection:
    """Initializes the SQLite database and loads the sqlite-vec extension."""
    # We keep the db removal for a clean run every time
    if os.path.exists(DB_NAME):
        os.remove(DB_NAME)

    db = sqlite3.connect(DB_NAME)

    if hasattr(db, 'enable_load_extension'):
        db.enable_load_extension(True)
    else:
        print("[ERROR] SQLite connection does not support extension loading. Ensure pysqlite3 is installed.")
        pass 

    try:
        sqlite_vec.load(db)
    except sqlite3.OperationalError as e:
        print(f"[ERROR] Could not load sqlite_vec extension: {e}")
        print("This usually means the underlying SQLite library (pysqlite3) was not correctly compiled with extension support.")
        return None
    except Exception as e:
        print(f"[ERROR] An unexpected error occurred while loading sqlite_vec: {e}")
        return None

    if hasattr(db, 'enable_load_extension'):
        db.enable_load_extension(False)

    print(f"Database {DB_NAME} initialized with sqlite-vec (version: {db.execute('select vec_version()').fetchone()[0]})")

    # Create a table to store the text content and a virtual table for the vector index
    db.execute("""
        CREATE TABLE documents (
            id INTEGER PRIMARY KEY,
            content TEXT
        );
    """)

    # Create the virtual vector table, referencing the document content
    db.execute(f"""
        CREATE VIRTUAL TABLE vec_index USING vec0(
            embedding float[{EMBEDDING_DIMENSION}],
            content_id int
        );
    """)

    return db

def ingest_documents(db: sqlite3.Connection, documents: List[str]):
    """Embeds documents and inserts them into both the documents and vector index tables."""
    print("\n--- INGESTION PHASE: Embedding documents and storing in SQLite-vec ---")

    for i, doc_content in enumerate(documents):
        print(f"Processing Document {i+1}/{len(documents)}...")

        # 1. Store the text content
        doc_id = db.execute("INSERT INTO documents (content) VALUES (?) RETURNING id", (doc_content,)).fetchone()[0]

        # 2. Generate embedding using Ollama/Granite
        vector = get_embedding(doc_content)

        if not vector or len(vector) != EMBEDDING_DIMENSION:
            print(f"[ERROR] Skipping Document {i+1}: Invalid embedding generated. Expected dim {EMBEDDING_DIMENSION}, got {len(vector)}")
            continue

        # 3. Serialize vector and insert into the virtual index table
        serialized_vector = serialize_f32(vector)
        db.execute(
            "INSERT INTO vec_index(rowid, embedding, content_id) VALUES (?, ?, ?)",
            [doc_id, serialized_vector, doc_id]
        )
        print(f"  > Document {doc_id} successfully stored and indexed.")

    db.commit()
    print("Ingestion complete.")

# --- 4. RAG IMPLEMENTATION ---

def retrieve_context(db: sqlite3.Connection, query_vector: List[float]) -> str:
    """
    Performs vector search (RAG Retrieval step) and returns aggregated context
    only (no markdown summary).
    """
    print("\n--- RETRIEVAL PHASE: Querying SQLite-vec for relevant context ---")

    # 1. Serialize the query vector
    serialized_query = serialize_f32(query_vector)

    # Use the explicit k=? syntax for reliable constraint handling.
    retrieval_query = f"""
        SELECT
            T1.content,
            T2.distance
        FROM documents AS T1
        INNER JOIN vec_index AS T2 ON T1.id = T2.content_id
        WHERE T2.embedding MATCH ? AND T2.k = ?
        ORDER BY T2.distance ASC
    """

    results = db.execute(retrieval_query, [serialized_query, TOP_K]).fetchall()

    if not results:
        print("[WARNING] No relevant documents found.")
        return "No relevant context found."

    # 3. Format context for the LLM and print retrieval log
    context_chunks = []

    print(f"Found {len(results)} relevant chunks:") # Print header to console

    for content, distance in results:
        # Console log
        log_line = f"  > Distance: {distance:.4f} | Chunk: \"{content[:50].replace('\n', ' ')}...\""
        print(log_line)

        context_chunks.append(content)

    return "\n---\n".join(context_chunks) # Return only the raw context string

def generate_answer(query: str, context: str) -> str:
    """Generates the final answer (RAG Generation step) using the Ollama LLM."""
    print("\n--- GENERATION PHASE: Asking Ollama/Granite 3.3 for the final answer ---")

    system_prompt = (
        "You are an expert Q&A system. Your task is to answer the user's question "
        "only based on the provided CONTEXT. If the context does not contain the answer, "
        "state clearly that the answer is not available in the context. "
        "Be concise and professional."
    )

    full_prompt = f"CONTEXT:\n{context}\n\nQUESTION: {query}"

    payload = {
        "model": GENERATION_MODEL,
        "prompt": full_prompt,
        "system": system_prompt,
        "stream": False,
        "options": {"temperature": 0.2}
    }

    result = ollama_api_call("/api/generate", payload)

    if result and 'response' in result:
        return result['response'].strip()
    return "Failed to generate a response from the LLM."


# --- 5. MAIN EXECUTION ---

def main():
    """Main function to run the RAG application."""
    print("Starting Granite/SQLite-vec RAG Application.")

    # 1. Setup Database
    db_conn = setup_db()
    if db_conn is None:
        print("Application terminated due to database setup failure.")
        return

    # 2. Define Sample Knowledge Base (simulating corporate or project documents)
    knowledge_base = [
        "The project codename 'Monolith' was initiated in Q1 2025 to streamline our internal document storage systems.",
        "Team Lead for Monolith is Alex Varga, and the primary required technology is SQLite-vec for efficient vector indexing.",
        "Granite 3.3 models were selected for all RAG and summarization tasks due to their Apache 2.0 license and strong 128K context window.",
        "The current budget allocated for Q2 2025 for Monolith development is $150,000, excluding cloud compute costs.",
        "Final deployment of the Monolith project is scheduled for the end of Q4 2025.",
    ]

    # 3. Ingest Data
    ingest_documents(db_conn, knowledge_base)

    # 4. Define and Process Query
    user_query = "Who is the team lead for Monolith, and what is the primary technology used?"

    # RAG Step 1: Embed Query
    query_vector = get_embedding(user_query)
    if not query_vector:
        print("Application terminated due to failure to embed query.")
        return

    # RAG Step 2: Retrieve Context
    # Now retrieves context string only
    context = retrieve_context(db_conn, query_vector) 

    # RAG Step 3: Generate Answer
    final_answer = generate_answer(user_query, context)

    # 5. Output Result to Console (only)

    # --- Console Output ---
    console_output = "\n" + "="*80
    console_output += f"\nUSER QUESTION: {user_query}"
    console_output += "\n\nFINAL ANSWER:\n"
    console_output += final_answer
    console_output += "\n" + "="*80
    print(console_output)


    # Clean up database
    db_conn.close()
    os.remove(DB_NAME)
    print(f"\nCleanup successful. Deleted {DB_NAME}.")


if __name__ == "__main__":
    main()

And the output 🎶

We nailed it 🍀

Conclusin

Ultimately, this sample code demonstrates a significant shift in how we approach RAG architecture. By combining the local, embedded power of SQLite-vec for vector management, the flexibility of Ollama as an LLM runtime, and the intelligence of the Granite models for both embedding and generation, we achieve a high-performance RAG pipeline that is completely self-contained. This approach minimizes infrastructure complexity and proves that powerful, context-aware AI is entirely achievable right on your desktop, making advanced LLM applications more accessible than ever.