Yuuichi Eguchi

Posted on Dec 3

I built two high-performance Python libraries for production AI: LLM log analytics and vector similarity search

#python #llm #pybind11 #machinelearning

Hello everyone,

I'm excited to share two Python libraries I've been working on recently: llmlog_engine and mini_faiss. Both tackle performance-critical problems in production AI systems with C++ implementations under the hood while providing clean, Pythonic APIs.

For context, I've been building LLM-powered applications in production, and two recurring bottlenecks kept appearing. First, analyzing application logs to understand model behavior, error rates, and latency patterns was painfully slow with pandas alone. Second, running similarity searches on embeddings for retrieval systems felt like overkill with full FAISS for smaller datasets, yet pure NumPy was too slow.

I explored existing solutions but found a gap: llmlog_engine addresses the need for a lightweight, embedded analytics engine specifically designed for LLM logs, while mini_faiss provides a minimal vector search library that's easier to understand and integrate than full FAISS but significantly faster than NumPy.

Both libraries share the same philosophy: solve one problem exceptionally well with minimal dependencies and maximum performance.

What My Projects Do

llmlog_engine: Columnar Analytics for LLM Logs

A specialized embedded database for analyzing LLM application logs stored as JSONL.

Core capabilities:

Fast JSONL ingestion into columnar storage format
Efficient filtering on numeric and string columns
Group-by aggregations (COUNT, SUM, AVG, MIN, MAX)
Dictionary encoding for low-cardinality strings (model names, routes)
SIMD-friendly memory layout for performance
pandas DataFrame integration

Performance:

6.8x faster than pure Python on 100k rows
Benchmark: Filter by model + latency, group by route, compute 6 metrics
- Pure Python: 0.82s
- C++ Engine: 0.12s

mini_faiss: Lightweight Vector Similarity Search

A focused, high-performance library for similarity search in dense embeddings.

Core capabilities:

SIMD-accelerated distance computation (L2 and inner product)
NumPy-friendly API with clean type signatures
~1500 lines of readable C++ code
Support for both Euclidean and cosine similarity
Heap-based top-k selection

Performance:

~7x faster than pure NumPy on typical workloads
Benchmark: 100k vectors, 768 dimensions
- mini_faiss: 0.067s
- NumPy: 0.48s

Architecture Philosophy

Both libraries follow the same design pattern:

Core logic in C++17: Performance-critical operations using modern C++
Python bindings via pybind11: Zero-copy data transfer with NumPy
Minimal dependencies: No heavy frameworks or complex build chains
Columnar/SIMD-friendly layouts: Data structures optimized for CPU cache
Type safety: Strict validation at Python/C++ boundary

This approach delivers near-native performance while maintaining Python's developer experience.

Syntax Examples

llmlog_engine

Load and analyze logs:

from llmlog_engine import LogStore

# Load JSONL logs
store = LogStore.from_jsonl("production_logs.jsonl")

# Analyze slow responses by model
slow_by_model = (store.query()
    .filter(min_latency_ms=500)
    .aggregate(
        by=["model"],
        metrics={
            "count": "count",
            "avg_latency": "avg(latency_ms)",
            "max_latency": "max(latency_ms)"
        }
    ))

print(slow_by_model)  # Returns pandas DataFrame

Error analysis:

# Analyze error rates by model and route
errors = (store.query()
    .filter(status="error")
    .aggregate(
        by=["model", "route"],
        metrics={"count": "count"}
    ))

Combined filters:

# Filter by multiple conditions (AND logic)
result = (store.query()
    .filter(
        model="gpt-4.1",
        min_latency_ms=1000,
        route="chat"
    )
    .aggregate(
        by=["model"],
        metrics={"avg_tokens": "avg(tokens_output)"}
    ))

Expected JSONL format:

{"ts": "2024-01-01T12:00:00Z", "model": "gpt-4.1", "latency_ms": 423, "tokens_input": 100, "tokens_output": 921, "route": "chat", "status": "ok"}
{"ts": "2024-01-01T12:00:15Z", "model": "gpt-4.1-mini", "latency_ms": 152, "tokens_input": 50, "tokens_output": 214, "route": "rag", "status": "ok"}

mini_faiss

Basic similarity search:

import numpy as np
from mini_faiss import IndexFlatL2

# Create index for 768-dimensional vectors
d = 768
index = IndexFlatL2(d)

# Add vectors to index
xb = np.random.randn(10000, d).astype("float32")
index.add(xb)

# Search for nearest neighbors
xq = np.random.randn(5, d).astype("float32")
distances, indices = index.search(xq, k=10)

print(distances.shape)  # (5, 10) - 5 queries, 10 neighbors each
print(indices.shape)    # (5, 10)

Cosine similarity search:

from mini_faiss import IndexFlatIP

# Create inner product index
index = IndexFlatIP(d=768)

# Normalize vectors for cosine similarity
xb = np.random.randn(10000, 768).astype("float32")
xb /= np.linalg.norm(xb, axis=1, keepdims=True)

index.add(xb)
distances, indices = index.search(xq_normalized, k=10)
# Higher distances = more similar

Implementation Highlights

llmlog_engine

Columnar storage with dictionary encoding:

String columns (model, route, status) mapped to int32 IDs
Numeric columns stored as contiguous arrays
Filtering operates on compact integer representations

Query execution:

Build boolean mask from filter predicates (AND logic)
Group matching rows by specified columns
Compute aggregations only on filtered rows
Return pandas DataFrame

Example internal representation:

Column: model       [0, 1, 0, 2, 0, ...] (int32 IDs)
Column: latency_ms  [423, 1203, 512, ...] (int32)
Dictionary: model   {0: "gpt-4.1-mini", 1: "gpt-4.1", 2: "gpt-4-turbo"}

mini_faiss

Distance computation:

L2: ||q - db||^2 = ||q||^2 - 2*q·db + ||db||^2
Precomputes database norms for efficiency
Vectorizable loops enable SIMD auto-vectorization

Top-k selection:

Heap-based algorithm: O(N log k) per query
Efficient for typical case where k << N
Separate implementations for min (L2) and max (inner product)

Row-major storage:

data = [v_0[0], v_0[1], ..., v_0[d-1],
        v_1[0], v_1[1], ..., v_1[d-1],
        ...]

Cache-friendly for batch distance computation.

Installation

Both libraries use standard Python packaging:

# llmlog_engine
git clone https://github.com/yuuichieguchi/llmlog_engine.git
cd llmlog_engine
pip install -e .

# mini_faiss
git clone https://github.com/yuuichieguchi/mini_faiss.git
cd mini_faiss
pip install .

Requirements:

Python 3.8+
C++17 compiler (GCC, Clang, MSVC)
CMake 3.15+
pybind11 (installed via pip)

Use Cases

llmlog_engine

Monitor LLM application health in production
Analyze latency patterns by model and endpoint
Track error rates and failure modes
Debug performance regressions
Generate usage reports for cost analysis

mini_faiss

Dense retrieval for RAG systems
Document similarity search
Image search using vision model embeddings
Recommendation systems (nearest neighbor recommendations)
Prototyping before scaling to full FAISS

Known Limitations

llmlog_engine

In-memory only (no persistence yet)
Single-threaded query execution
No complex expressions or nested objects
No distributed processing

mini_faiss

Brute force search only (no approximate methods)
Append-only index (no deletion/updates)
Fixed vector dimension per index
Single machine, memory-limited (~1M vectors at 768d ≈ 3GB)

Both libraries prioritize simplicity and correctness in V1. Advanced features (parallel execution, approximate search, compression) can be added without breaking APIs.

Target Audience

These libraries are for Python developers who:

Need better performance than pure Python/NumPy
Want minimal dependencies and simple APIs
Prefer understanding their dependencies (both are <2000 lines of C++)
Are building small to medium-scale systems
Value type safety and clean abstractions

I'm actively using both in production, so they're battle-tested against real workloads.

Comparison to Alternatives

llmlog_engine vs. pandas/DuckDB:

More specialized: purpose-built for LLM log schema
Faster for common queries on columnar data
Simpler: no SQL, just Python method chaining
Embedded: no external process or server

mini_faiss vs. FAISS/NumPy:

Simpler than FAISS: easier to understand, modify, debug
Faster than NumPy: SIMD acceleration, optimized layout
Smaller scope: does one thing well (exact search)
Better for learning: clean, readable implementation

Future Roadmap

llmlog_engine

Memory-mapped on-disk format
Parallel query execution
SIMD micro-optimizations
Timestamp range filters
Compression for numeric columns

mini_faiss

Approximate search methods (IVF, PQ, HNSW)
GPU acceleration (CUDA/Metal)
Index serialization (save/load)
Multi-threaded search
Custom distance functions

Feedback Welcome

I'd love to hear:

Does this solve problems you're facing?
What features would make these more useful?
Any bugs or edge cases I should handle?
Performance bottlenecks in your use cases?

Both projects are MIT licensed and contributions are welcome!

llmlog_engine:

GitHub: https://github.com/yuuichieguchi/llmlog_engine

mini_faiss:

GitHub: https://github.com/yuuichieguchi/mini_faiss

Thanks for reading!

DEV Community

I built two high-performance Python libraries for production AI: LLM log analytics and vector similarity search

What My Projects Do

llmlog_engine: Columnar Analytics for LLM Logs

mini_faiss: Lightweight Vector Similarity Search

Architecture Philosophy

Syntax Examples

llmlog_engine

mini_faiss

Implementation Highlights

llmlog_engine

mini_faiss

Installation

Use Cases

llmlog_engine

mini_faiss

Known Limitations

llmlog_engine

mini_faiss

Target Audience

Comparison to Alternatives

Future Roadmap

llmlog_engine

mini_faiss

Feedback Welcome

Top comments (0)