Alireza Razinejad

Posted on Jun 14

How to Build a High-Performance RAG Pipeline with Ollama, Python and TypeScript

#python #ai #cryptocurrency #architecture

The TL;DR

If you need to spin up a local, privacy-first AI agent that can query your own internal documents without sending data to third-party APIs, this guide covers the exact architecture using TypeScript, Python, and Ollama.
Time to complete: ~15 minutes.
Prerequisites: Python 3.10+ or Node.js installed, basic familiarity with embeddings.

The Problem: API Costs & Data Privacy

When building production-ready LLM features, relying solely on cloud providers introduces two major friction points: variable API latency and data compliance bottlenecks.

By shifting the embedding generation and model inference locally, we completely bypass network overhead and keep sensitive data securely inside our infrastructure.

The Architecture

Here is how the data flows through our system:

Ingestion: Parse local documents (Markdown/PDF).
Chunking: Break text into digestible tokens.
Embeddings: Generate vectors using a local model.
Retrieval: Query a vector store for semantic matches.
Generation: Pass context to the LLM for the final answer.

Step-by-Step Implementation

1. Setting Up the Local Environment

First, ensure you have Ollama running locally and pull the required models. Open your terminal and run:

# Pull the LLM
ollama pull llama3

# Pull the embedding model explicitly
ollama pull nomic-embed-text

2. Initializing the Project

Choose your preferred language environment to house the orchestration logic.

TypeScript

// index.ts
import { Ollama } from 'ollama';

const ollama = new Ollama({ host: 'http://127.0.0.1:11434' });

async function generateLocalEmbedding(text: string): Promise<number[]> {
  const response = await ollama.embeddings({
    model: 'nomic-embed-text',
    prompt: text,
  });
  return response.embedding;
}

Python

First, install the official client: pip install ollama

# orchestrator.py
import asyncio
from ollama import AsyncClient

# Initialize the asynchronous local client
client = AsyncClient(host='http://127.0.0.1:11434')

async def generate_local_embedding(text: str) -> list[float]:
    response = await client.embed(
        model='nomic-embed-text',
        input=text
    )
    # The client returns a list of embedding arrays inside 'embeddings'
    return response['embeddings'][0]

3. Handling the Semantic Search

When querying the local vector array, we calculate the similarity score to find the most relevant document chunks.

TypeScript

function cosineSimilarity(vecA: number[], vecB: number[]): number {
  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
  const normA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const normB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
  return dotProduct / (normA * normB);
}

Python

import math

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    dot_product = sum(a * b for a, b in zip(vec_a, vec_b))
    norm_a = math.sqrt(sum(a * a for a in vec_a))
    norm_b = math.sqrt(sum(b * b for b in vec_b))

    if not norm_a or not norm_b:
        return 0.0  # Prevent division by zero

    return dot_product / (norm_a * norm_b)

Performance Gotchas to Avoid

Memory Allocation: Running local models demands high RAM. Ensure you limit concurrent embedding generations to prevent the runtime from crashing.
Chunk Overlap: When chunking text, always implement an overlap (e.g., 500 characters chunk size with a 50-character overlap) so context isn't split across arbitrary boundaries.

Conclusion & Next Steps

Building local agentic workflows gives you complete control over your data lifecycle and cuts API bills down to zero.

What's next? Try swapping out the in-memory array for a persistent vector database like Chroma or Milvus.

Let me know in the comments below: Are you running your LLMs locally or sticking to cloud APIs for production?

This tutorial on Building Local RAG with Ollama provides an excellent visual look at parsing document chunks and handling embedding shapes using the official python libraries we integrated into the text.

Top comments (2)

Lucas Mand • Jun 15

I appreciate the focus on local first AI infrastructure design.
Many teams underestimate latency introduced by external inference providers.
Using Ollama significantly simplifies local model management workflows.
Persistent vector storage becomes essential as document volumes grow.

Alireza Razinejad • Jun 23

I am truly you liked this post, and thank you for your comment. I will try to post more about concepts around the local first AI infrastructure design.