DEV Community

Cover image for LLM Token Cost Optimization: Cutting Your API Bills Without Cutting Quality
Veduis
Veduis

Posted on • Originally published at veduis.com

LLM Token Cost Optimization: Cutting Your API Bills Without Cutting Quality

Traditional search matches keywords. Users must know the exact words in the documents they seek. Vector search matches meaning. Users describe what they are looking for in natural language, and the system finds semantically similar content even when keywords differ. "Car trouble" finds documents about "automotive repair" and "engine problems."

Vector search powers modern semantic search, recommendation systems, and retrieval-augmented generation (RAG) for LLMs. It converts text, images, or other content into high-dimensional vectors (embeddings) that capture semantic meaning. It then searches for vectors most similar to a query vector using specialized algorithms that find approximate nearest neighbors efficiently at scale.

I have built vector search systems for knowledge bases, product catalogs, and content recommendation. I have learned that embedding model selection dramatically impacts quality, that similarity metric choice affects results, and that approximate nearest neighbor (ANN) algorithms enable search at million-document scale. This guide covers the patterns that work: understanding embeddings and their properties, similarity metrics and when to use each, ANN algorithms that make vector search practical, vector database selection, and building complete semantic search pipelines.

Understanding Embeddings

What Are Embeddings

Embeddings are dense numerical vectors that represent semantic meaning. Similar items have similar vectors.

Text: "The quick brown fox"
Embedding: [0.12, -0.45, 0.89, ..., 0.34]  # 384 to 1536 dimensions

Text: "A fast brown animal"
Embedding: [0.14, -0.42, 0.87, ..., 0.31]  # Similar vector

Text: "Quantum computing"
Embedding: [-0.78, 0.23, -0.12, ..., 0.91]  # Very different vector
Enter fullscreen mode Exit fullscreen mode

Properties of Good Embeddings

Semantic similarity: Similar meaning = close in vector space

Linear relationships: Analogies work as vector arithmetic

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
Enter fullscreen mode Exit fullscreen mode

Dense representation: All dimensions have values (vs sparse one-hot encoding)

Text embeddings visualization showing similar phrases clustering together in vector space
Similar meanings produce vectors that cluster together in high-dimensional space.

Embedding Models

Model Dimensions Best For Provider
text-embedding-3-small 1536 General purpose OpenAI
text-embedding-3-large 3072 High accuracy OpenAI
text-embedding-ada-002 1536 Legacy OpenAI
sentence-transformers/all-MiniLM 384 Cost-effective Open source
sentence-transformers/all-mpnet-base 768 Balanced Open source
voyage-2 1024 High quality Voyage AI
Cohere embed 1024 Multilingual Cohere

Selection criteria:

  • Quality: Benchmark on your specific data
  • Cost: API costs or compute for self-hosted
  • Dimensions: Higher dimensions = more accurate but more storage
  • Latency: Model inference time

Generating Embeddings

# OpenAI
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    # Replace newlines, which can affect results
    text = text.replace("\n", " ")

    response = client.embeddings.create(
        input=text,
        model=model
    )

    return response.data[0].embedding

# Batch processing for efficiency
async def get_embeddings_batch(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = await client.embeddings.create(input=batch, model="text-embedding-3-small")
        embeddings.extend([item.embedding for item in response.data])

    return embeddings
Enter fullscreen mode Exit fullscreen mode
# Open source (sentence-transformers)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embedding(text: str) -> list[float]:
    embedding = model.encode(text)
    return embedding.tolist()

# Batch processing
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
Enter fullscreen mode Exit fullscreen mode

For teams evaluating whether to run embeddings locally or via API, cost and latency tradeoffs matter. Our guide to locally run AI breaks down when self-hosting saves money versus using managed embedding APIs.

Similarity Metrics

Cosine Similarity

Measures angle between vectors, ignoring magnitude.

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Ranges from -1 (opposite) to 1 (identical)
# Most common for text embeddings
Enter fullscreen mode Exit fullscreen mode

When to use:

  • Text embeddings (most models normalized)
  • When direction matters more than magnitude
  • Most common default choice

Euclidean Distance

Straight-line distance between vectors.

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return np.linalg.norm(a - b)

# Lower = more similar
Enter fullscreen mode Exit fullscreen mode

When to use:

  • When magnitude matters
  • Computer vision embeddings
  • When vectors are not normalized

Dot Product

Simple sum of element-wise products.

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b)
Enter fullscreen mode Exit fullscreen mode

When to use:

  • When vectors are normalized (equivalent to cosine)
  • Fast computation
  • Some vector databases optimize for this

Metric Selection Guide

Metric Use When Range
Cosine Text search, normalized vectors [-1, 1]
Euclidean Magnitude matters, vision [0, ∞)
Dot Product Normalized vectors, speed (-∞, ∞)

Approximate Nearest Neighbor (ANN) Algorithms

Why ANN

Exact nearest neighbor search is O(n) with high dimensional data. At million-document scale, it is too slow.

ANN algorithms trade small accuracy loss for massive speed gains (1000x+).

Approximate Nearest Neighbor algorithm visualization showing hierarchical graph search
ANN algorithms build hierarchical graphs that navigate directly to the most similar vectors without scanning every entry.

HNSW (Hierarchical Navigable Small World)

Graph-based algorithm. Most popular for production.

Building:
1. Insert points into layered graph structure
2. Each layer is a proximity graph
3. Top layer is sparse, lower layers denser

Searching:
1. Start at random point in top layer
2. Greedy walk to closest point
3. Drop to lower layer when local minimum reached
4. Repeat until bottom layer
Enter fullscreen mode Exit fullscreen mode

Characteristics:

  • High recall (typically >95%)
  • Fast queries (milliseconds)
  • Memory intensive (stores graph)
  • Good for million-scale datasets

IVF (Inverted File Index)

Clustering-based approach.

Building:
1. Cluster vectors into N groups (voronoi cells)
2. Store cluster centroids

Searching:
1. Find nearest cluster centroids to query
2. Search only vectors in those clusters
3. Refine with exact search on candidates
Enter fullscreen mode Exit fullscreen mode

Characteristics:

  • Tunable speed/accuracy tradeoff
  • Memory efficient
  • Good for billion-scale datasets

LSH (Locality Sensitive Hashing)

Hash-based approach.

Building:
1. Create multiple hash functions
2. Similar vectors hash to same buckets

Searching:
1. Hash query vector
2. Check all vectors in matching buckets
3. Refine with exact similarity
Enter fullscreen mode Exit fullscreen mode

Characteristics:

  • Very fast
  • Lower recall than HNSW
  • Good for very large datasets where recall can be lower

PQ (Product Quantization)

Compression technique often combined with other indexes.

Compress vectors by:
1. Split vector into sub-vectors
2. Quantize each sub-vector to a codebook
3. Store codes instead of full vectors

Enables:
- 10-20x memory reduction
- Faster distance computation
- Slight accuracy loss
Enter fullscreen mode Exit fullscreen mode

Vector Databases

Pinecone

Pinecone is a managed vector database.

from pinecone import Pinecone, ServerlessSpec

pc = PineCone(api_key="your-api-key")

# Create index
pc.create_index(
    name="my-index",
    dimension=1536,  # Matches embedding model
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("my-index")

# Upsert vectors
index.upsert(
    vectors=[
        {
            "id": "doc1",
            "values": embedding,
            "metadata": {"source": "article1", "category": "tech"}
        }
    ]
)

# Query
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"category": {"$eq": "tech"}},
    include_metadata=True
)
Enter fullscreen mode Exit fullscreen mode

Weaviate

Weaviate is an open-source vector database with a managed option.

import weaviate

client = weaviate.Client("http://localhost:8080")

# Create schema
class_obj = {
    "class": "Article",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": False
        }
    },
    "properties": [
        {"name": "title", "dataType": ["text"]},
        {"name": "content", "dataType": ["text"]},
        {"name": "category", "dataType": ["text"]}
    ]
}

client.schema.create_class(class_obj)

# Insert (automatically vectorized)
client.data_object.create(
    data_object={
        "title": "Vector Search Guide",
        "content": "Content here...",
        "category": "tech"
    },
    class_name="Article"
)

# Query
result = (
    client.query
    .get("Article", ["title", "content"])
    .with_near_text({"concepts": ["semantic search"]})  # Auto-vectorized
    .with_limit(10)
    .do()
)
Enter fullscreen mode Exit fullscreen mode

pgvector

pgvector is a PostgreSQL extension that adds vector similarity search to existing Postgres databases.

-- Enable extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(1536)
);

-- Create index
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Insert
INSERT INTO documents (content, embedding)
VALUES ('text here', '[0.1, 0.2, ...]');

-- Query
SELECT content, 1 - (embedding <=> query_embedding) AS cosine_similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode
# Using with SQLAlchemy
from sqlalchemy import create_engine, Column, Integer, String
from pgvector.sqlalchemy import Vector

class Document(Base):
    __tablename__ = 'documents'

    id = Column(Integer, primary_key=True)
    content = Column(String)
    embedding = Column(Vector(1536))

# Query
docs = session.query(Document).order_by(
    Document.embedding.cosine_distance(query_embedding)
).limit(10).all()
Enter fullscreen mode Exit fullscreen mode

Selection Guide

Database Best For Deployment
Pinecone Managed, easy start SaaS
Weaviate Flexibility, features Self-hosted or SaaS
pgvector Existing Postgres Self-hosted
Milvus High scale, hybrid search Self-hosted or SaaS
Chroma Local development, simplicity Embedded
Qdrant Rust-based, fast Self-hosted or SaaS

Vector database comparison showing Pinecone, Weaviate, pgvector, Milvus, Chroma, and Qdrant
Each vector database optimizes for different deployment models and scale requirements.

Building a Semantic Search Pipeline

Architecture

Documents
    |
    v
Chunking → Embedding → Vector DB
                              |
Query → Embedding → Similarity Search → Reranking → Results
Enter fullscreen mode Exit fullscreen mode

Chunking Strategy

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Chunk text with overlap for context preservation"""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        # Try to end at sentence boundary
        if end < len(text):
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.7:  # If found in last 30%
                chunk = chunk[:last_period + 1]
                end = start + len(chunk)

        chunks.append(chunk.strip())
        start = end - overlap  # Overlap for context

    return chunks

# Semantic chunking with embeddings
def semantic_chunk(text: str, similarity_threshold: float = 0.8) -> list[str]:
    """Chunk based on semantic similarity"""
    sentences = text.split('. ')
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        prev_embedding = get_embedding(sentences[i-1])
        curr_embedding = get_embedding(sentences[i])

        similarity = cosine_similarity(prev_embedding, curr_embedding)

        if similarity > similarity_threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append('. '.join(current_chunk))
            current_chunk = [sentences[i]]

    if current_chunk:
        chunks.append('. '.join(current_chunk))

    return chunks
Enter fullscreen mode Exit fullscreen mode

Hybrid Search

Combine vector similarity with keyword matching.

def hybrid_search(query: str, vector_weight: float = 0.7) -> list[dict]:
    """Combine BM25 and vector search"""

    # Vector search
    query_embedding = get_embedding(query)
    vector_results = vector_db.search(query_embedding, k=100)

    # Keyword search
    keyword_results = keyword_index.search(query, k=100)

    # Reciprocal Rank Fusion
    scores = {}

    for rank, result in enumerate(vector_results):
        doc_id = result['id']
        scores[doc_id] = scores.get(doc_id, 0) + vector_weight / (rank + 60)

    for rank, result in enumerate(keyword_results):
        doc_id = result['id']
        scores[doc_id] = scores.get(doc_id, 0) + (1 - vector_weight) / (rank + 60)

    # Sort by fused score
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    return [get_document(doc_id) for doc_id, _ in ranked[:10]]
Enter fullscreen mode Exit fullscreen mode

Reranking

Initial retrieval (fast, approximate) → Reranking (slower, accurate).

def search_with_reranking(query: str) -> list[dict]:
    # Initial retrieval (ANN)
    query_embedding = get_embedding(query)
    candidates = vector_db.query(query_embedding, top_k=100)

    # Rerank with cross-encoder (more accurate)
    from sentence_transformers import CrossEncoder

    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    pairs = [[query, candidate['text']] for candidate in candidates]
    scores = reranker.predict(pairs)

    # Sort by reranker score
    for candidate, score in zip(candidates, scores):
        candidate['rerank_score'] = score

    reranked = sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)

    return reranked[:10]
Enter fullscreen mode Exit fullscreen mode

Filtering and Metadata

# Pinecone with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "category": {"$eq": "documentation"},
        "created_at": {"$gte": "2024-01-01"},
        "$or": [
            {"author": {"$eq": "team-a"}},
            {"author": {"$eq": "team-b"}}
        ]
    }
)
Enter fullscreen mode Exit fullscreen mode

Performance Optimization

Index Tuning

# HNSW parameters
index_params = {
    "M": 16,        # Connections per layer (higher = more accurate, more memory)
    "efConstruction": 200,  # Size of dynamic candidate list during construction
    "ef": 100       # Size of dynamic candidate list during search
}

# Tradeoffs:
# M: 8-64 (default 16). Higher = better recall, more memory
# efConstruction: 64-512. Higher = better index quality, slower build
# ef: 16-512. Higher = better recall, slower queries
Enter fullscreen mode Exit fullscreen mode

Batch Operations

# Batch embedding (much faster)
texts = [doc['content'] for doc in documents]
embeddings = model.encode(texts, batch_size=64)

# Batch upsert
vectors_to_upsert = [
    {
        "id": doc['id'],
        "values": embedding.tolist(),
        "metadata": {"source": doc['source']}
    }
    for doc, embedding in zip(documents, embeddings)
]

# Upsert in batches
for i in range(0, len(vectors_to_upsert), 100):
    batch = vectors_to_upsert[i:i+100]
    index.upsert(vectors=batch)
Enter fullscreen mode Exit fullscreen mode

Caching

from functools import lru_cache

@lru_cache(maxsize=10000)
def get_cached_embedding(text: str) -> tuple[list[float], str]:
    """Cache embeddings by text hash"""
    embedding = get_embedding(text)
    return tuple(embedding)  # Must be hashable for cache
Enter fullscreen mode Exit fullscreen mode

Evaluation

Metrics

def evaluate_search(queries: list[dict]) -> dict:
    """
    queries: [{"query": str, "relevant_ids": [str]}]
    """
    results = {
        'recall@10': [],
        'precision@10': [],
        'mrr': [],
        'ndcg': []
    }

    for q in queries:
        search_results = search(q['query'], k=10)
        retrieved_ids = [r['id'] for r in search_results]
        relevant_ids = set(q['relevant_ids'])

        # Recall@10
        recall = len(relevant_ids & set(retrieved_ids)) / len(relevant_ids)
        results['recall@10'].append(recall)

        # Precision@10
        precision = len(relevant_ids & set(retrieved_ids)) / len(retrieved_ids)
        results['precision@10'].append(precision)

        # MRR
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant_ids:
                results['mrr'].append(1 / rank)
                break
        else:
            results['mrr'].append(0)

    return {
        'recall@10': np.mean(results['recall@10']),
        'precision@10': np.mean(results['precision@10']),
        'mrr': np.mean(results['mrr'])
    }
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls

Pitfall 1: Wrong Embedding Model

Using general embeddings for domain-specific content. Use domain-tuned models.

Pitfall 2: Poor Chunking

Chunks that break semantic coherence. Use overlap and semantic boundaries.

Pitfall 3: No Metadata Filtering

Searching across all documents when users need filtered results. Index metadata.

Pitfall 4: Ignoring Exact Matches

Relying only on vectors when users search for specific IDs or names. Use hybrid search.

Pitfall 5: Wrong Similarity Metric

Using Euclidean distance on normalized embeddings. Use cosine for text.

Pitfall 6: Not Monitoring Quality

Search quality degrades over time without measurement. Evaluate regularly.

Conclusion

Vector search enables semantic understanding at scale. Choose embedding models that match your domain and quality requirements. Select similarity metrics appropriate for your embeddings. Use ANN algorithms for production-scale performance. Consider vector databases based on your operational requirements.

Build complete pipelines with proper chunking, hybrid search for best results, and reranking for precision. Monitor quality metrics and iterate.

Vector search is foundational technology for modern AI applications. Master it to build systems that understand user intent, not just match keywords. For teams building production AI systems, understanding how to cut LLM costs without sacrificing quality and protecting against prompt injection attacks are the next logical steps after getting search working.


Further Reading

Top comments (0)