DEV Community: Janea Systems

How to Build a Vector Database Using Amazon S3 Vectors

Benedetto Proietti — Wed, 30 Jul 2025 07:35:31 +0000

And Say Goodbye to Expensive SaaS Pricing

Here’s an estimated price comparison for storing 1 billion and 10 billion vectors using the most common SaaS vector databases. These numbers are pulled directly from each provider’s pricing calculator.

Why Are Vector Databases So Expensive?

I’ve covered this before in two articles — Desire for Structure (read: “SQL”) and (Beyond) The Art of Database Indexing. Traditional indexing starts to fall apart at scale — what we used to call “Big Data.”

Want the short version?

Vector databases are expensive because they rely on powerful, always-on hardware. They keep in-memory indexes fresh and caches hot — and that costs money.

Old Tricks, New Vector World

For tabular or log data, we decoupled compute and storage a long time ago: store data cheaply in S3 or MinIO, and spin up compute (like Spark or Presto) only when needed.

Amazon has now extended this model to vector embeddings with Amazon S3 Vectors. [Quick dive here.]

S3 Vectors lets you store huge volumes of vector data at low cost and run similarity searches in under a second — ideal for batch workloads and analytics.

Can we ditch expensive Vector DBs now?

Not quite. S3 Vectors doesn’t offer the low latency needed for real-time use cases — like fraud detection, recommendation engines, or chatbots that require sub-100 ms responses.

Instead, think of S3 Vectors as the durable, budget-friendly foundation. You’ll still need a lightweight layer on top to meet real-time latency requirements.

Level 1: Download and Run

Let’s start simple: use open-source software out of the box, no code changes, just run it.

We lower similarity search latency by using a classic Computer Science technique — indexes — and storing data in RAM (which is fast but expensive).

Product Quantization (PQ): Fast, Memory-Efficient Search

Performing exact distance calculations (cosine, Euclidean) on billions of 768-dimensional vectors is too slow and compute-heavy.

Product Quantization (PQ) helps by compressing vectors into compact forms. This makes searches 10–100× faster — with minimal accuracy loss.

How PQ Works

PQ splits each high-dimensional vector into smaller chunks (e.g., groups of 8 dimensions), then maps each chunk to the closest centroid in a precomputed codebook. Only the centroid IDs are stored.

At query time, instead of comparing against billions of raw vectors, the system compares to ~256 centroids per chunk — massively reducing compute time.

For most NLP workloads, PQ delivers excellent recall while cutting memory and compute costs.

Tool Selection

FAISS— originally developed by Facebook AI Research — is the go-to library for efficient similarity search and clustering of dense vectors. It’s widely adopted for high-performance vector indexing at scale. But I recommend JECQ, a drop-in replacement with 6× lower memory usage.

Disclaimer: I created JECQ. That said, it works well. But use FAISS if you prefer.

In-Memory Cache

You can cache a subset of raw vectors in RAM using tools like Redis or Valkey, depending on your licensing needs.

For 10 billion vectors (~30TB in S3), storing just 1% in RAM (about 300GB) can make a big difference. Pricey, but manageable.

Level 1 Architecture

Let’s walk through the architecture:

A router service handles incoming similarity search requests. It’s stateless and can scale horizontally.
Each node loads a copy of the JECQ (or FAISS) index in memory.
The router uses JECQ to find candidate vector IDs.
It then checks Redis for raw vectors:
Cache hit: Redis returns vectors. Router re-ranks and returns results.
Cache miss: Router pulls vectors from S3, re-ranks, and returns results.

Level 2: Teaser

Level 1 works fine for datasets up to ~1 billion vectors or demo workloads. But if you want 10–100 ms P95 latency at multi-billion scale, you’ll need more:

Local raw vectors on NVMe: A middle layer (5–10% of raw size, ~1.5TB) between RAM and S3 to avoid frequent S3 fetches.

Hierarchical data layers: JECQ + Redis/NVMe integration enables local posting list retrieval, turning 100 ms S3 reads into 2–5 ms NVMe reads.
Index sharding: Splits PQ clusters across nodes and avoids duplicating 100GB+ compressed data per node.
Advanced cache management: Store frequent queries, support MFU/LFU/LRU caching strategies, and pre-load data based on user behavior.
Aggressive S3 Vectors indexing: Each query hits just one index. A single S3 bucket can hold 10K indexes, each with ≤50M vectors. Smart indexing helps reduce latency significantly.

All of this requires solid engineering chops — but it's necessary if you want to build a cost-effective vector database with 10–100 ms latency on top of S3 Vectors.

Stay tuned for Level 2.

S3 Vectors: Changing How We Think About Vector Embeddings

Benedetto Proietti — Wed, 16 Jul 2025 16:11:26 +0000

Inserting and maintaining data in a relational database is expensive. Every write must update one or more indexes (data structures such as B-trees) that accelerate reads at the cost of extra CPU, memory, and I/O. On a single node, tables start to struggle once they pass a few terabytes. Distributed SQL and NoSQL systems push that limit, but the fundamental write amplification costs remain.

Object Storage

To escape those costs, teams began landing raw data in cloud object stores like Amazon S3. Instead of hot indexes, query engines (Spark, Athena, Trino) rely on partition pruning and lightweight statistics. The led to dramatically lower storage bills and petabyte scale datasets on commodity hardware.

Vectors Embeddings

AI and LLM workloads now emit vector embeddings – hundreds or thousands of dimensions per record. Answering “Which vectors are nearest to this one?” in real time is tricky:

High-dimensional data breaks classic data structures.
We lean on approximate nearest neighbor (ANN) algorithms such as HNSW or IVFPQ.
Queries often combine a distance threshold with metadata filters.
Recall, precision, and latency form a three-way tradeoff.

Amazon S3 Vector: A Game-Changer

Announced yesterday, Amazon S3 Vectors brings vector-aware storage classes to S3. Each vector table:

Stores vectors of fixed dimensionality, compressed on write. Not possible with traditional S3.
Supports ANN search with simultaneous filters on metadata. Immensely faster than S3.
Delivers sub-second latency: great for batch, a bit slow for interactive UX.

Closing the Latency Gap with In-Memory Caching

Janea Systems’ background is deeply rooted in working with in-memory, low-latency caches. Our track record includes:

We are the creators of Memurai, the official Redis for Windows, trusted by developers for its performance and reliability.
We are active contributors to ValKey, a rapidly evolving open-source fork of Redis, pushing the boundaries of in-memory data stores.

Given the inherent characteristics of S3 Vectors, including powerful storage and batch processing, but with room for improvement in interactive scenarios, the next logical step is to strategically implement a high-performance cache on top of S3 Vectors.

The Future

We are excited about the possibilities Amazon S3 Vectors unlocks. The upcoming articles will cover how to effectively integrate Redis, Valkey, or Memurai with the S3 Vector service to achieve optimal performance for your AI/LLM workloads. Also, we will explore the new AWS service and its implications for modern data architectures in detail. Stay tuned!

JECQ: Smart, Open-Source Compression for FAISS Users—6x Compression Ratio, 85% Accuracy

Benedetto Proietti — Wed, 09 Jul 2025 12:19:58 +0000

Hi everyone — I'm Benedetto Proietti, Head of Architecture at Janea Systems. I spend most of my time working on performance-critical systems and solving interesting problems at the intersection of machine learning, distributed systems, and open-source technologies. This post is about a recent project I ideated and directed: JECQ, an innovative, open-source, compression solution built specifically for FAISS users. I’ll walk you through the thinking behind it, how it works, and how it can help reduce memory usage without compromising too much on accuracy.

Ever wonder how it takes just milliseconds to search something on Google, despite hundreds of billions of webpages in existence? The answer is Google’s index. By the company’s own admission, that index weighs in at over 100,000,000 gigabytes. That’s roughly 95 petabytes.

Now, imagine if you could shrink that index by a factor of six.

That’s exactly what Janea Systems did for vector embeddings—the index of artificial intelligence.

Read on to learn what vector embeddings are, why compressing them matters, how it’s been done until now, and how Janea Systems’ solution pushes it to a whole new level.

The Data Explosion: From Social Media to Large Language Models

The arrival of Facebook in 2004 marked the beginning of the social media era. Today, Facebook has over 3 billion monthly active users worldwide. In 2016, TikTok introduced the world to short-form video, and now has more than a billion monthly users.

And in 2023, ChatGPT came along.

Every one of these inventions led to an explosion of data being generated and processed online. With Facebook, it was posts and photos. TikTok flooded the web with billions of 30-second dance videos.

When data starts flowing by the millions, companies look for ways to cut storage costs with compression. Facebook compresses the photos we upload to it. TikTok does the same with videos.

What about large language models? Is there anything to compress there?

The answer is yes: vector embeddings.

Vector Embeddings: The Language of Modern AI

Think of vector embeddings as the DNA of meaning inside a language model. When you type something like “Hi, how are you?”, the model converts that phrase into embeddings—a set of vectors that capture how it relates to other phrases. These embeddings help the model process the input and figure out how likely different words are to come next. This allows the model to know the right response to "Hi, how are you?" is “I’m good, and you?” instead of “That’s not something you’d ask a cucumber".

The principle behind vector embeddings also underpins a process called “similarity search.” Here, embeddings represent larger units of meaning—like entire documents—powering use cases like retrieval-augmented generation (RAG), recommendation engines, and more.

It should be pretty clear by now that vector embeddings are central not just to how generative AI works, but to a wide range of AI applications across industries.

The Hidden Costs of High-Dimensional Data: Why Vector Compression is Crucial

The problem is that vector embeddings take up space. And the faster and more accurate we want an AI system to be, the more vector embeddings it needs - and the more space to store them. But this isn't just a storage cost problem: the bigger embeddings are, the more bandwidth in the PCI bus and in the memory bus they use. It's also an issue for things like edge AI devices - edge devices don't have constant internet access, so their AI models need to run efficiently with the limited space they've got onboard.

That's why it makes sense to look for ways to push compression even further - despite the fact that embeddings are already being compressed today. Squeezing even another 10% out of the footprint can mean real savings, and a much better user experience for IoT devices running generative AI.

At Janea Systems, we saw this opportunity and built an advanced C++ library based on FAISS.

FAISS—short for Facebook AI Similarity Search—is Meta’s open-source library for fast vector similarity search, offering an 8.5x speedup over earlier solutions. Our library takes it further by optimizing the storage and retrieval of large-scale vector embeddings in FAISS—cutting storage costs and boosting AI performance on IoT and edge devices.

The Industry Standard: A Look at Product Quantization (PQ)

Vector embeddings are stored in a specialized data structure called a vector index. The index lets AI systems quickly find and retrieve the closest vectors to any input (e.g. user questions) and match it with accurate output.

A major constraint for vector indexes is space. The more vectors you store—and the higher their dimensionality—the more memory or disk you need. This isn’t just a storage problem; it affects whether the index fits in RAM, whether queries run fast, and whether the system can operate on edge devices.

Then there’s the question of accuracy. If you store vectors without compression, you get the most accurate results possible. But the process is slow, resource-intensive, and often impractical at scale. The alternative is to apply compression, which saves space and speeds things up, but sacrifices accuracy.

The most common way to manage this trade-off is a method called Product Quantization (PQ) (Fig. 1).

Fig. 1: PQ’s uniform compression across subspaces

PQ works by splitting each vector into equal-sized subspaces. It’s efficient, hardware-friendly, and the standard in vector search systems like FAISS.

But because each subspace in PQ is equal, it’s like compressing every video frame in the same way and to the same size—whether it’s entirely black or full of detail. This approach keeps things simple and efficient but misses the opportunity to increase compression on a case-by-case basis.

At Janea, we realized that vector dimensions vary in value—much like video frames vary in resolution and detail. This means we can adjust the aggressiveness of compression (or, more precisely, quantization) based on how relevant each dimension is, without affecting overall accuracy.

Solution: JECQ - Intelligent, Dimension-Aware Compression for FAISS

To strike the right balance between memory efficiency and accuracy, engineers at Janea Systems have developed JECQ, a novel, open-source compression algorithm available on GitHub that varies compression by the statistical relevance of each dimension.

In this approach, the distances between quantized values become irregular, reflecting each dimension's complexity.

How does JECQ work?

The algorithm starts by determining the isotropy of each dimension based on the eigenvalues of the covariance matrix. In the future, the analysis will also cover sparsity and information density.
The algorithm then classifies each dimension into one of three categories: low relevance, medium relevance, and high relevance.
Dimensions with low relevance are discarded, with very little loss in accuracy.

Medium-relevance dimensions are quantized using just one bit, again with minimal impact on accuracy.
High-relevance dimensions undergo the standard product quantization.
Compressed vectors are stored in a custom, compact format accessible via a lightweight API.

The solution is compatible with existing vector databases and ANN frameworks, including FAISS.

What are the benefits and best use cases for JECQ?

Early tests show memory footprint reduced by 6x, keeping 84.6% accuracy compared to non-compressed vector candidates. Figure 2 compares the memory footprint of an index before quantization, with product quantization (PQ), and with JECQ.

Fig. 2: Memory footprint before quantization, with PQ, and with JECQ

We expect this will lower cloud and on-prem storage costs for enterprise AI search, enhance Edge AI performance by fitting more embeddings per device for RAG or semantic search, and reduce the storage footprint of historical embeddings.

What Are JECQ’s License and Features?

JECQ is out on GitHub, available under the MIT license. It ships with an optimizer that takes a representative data sample or user-provided data and generates an optimized parameter set. Users can then fine-tune this by adjusting the objective function to balance their preferred accuracy–performance trade-off.

We're planning to share more tools, experiments, and lessons learned from our work in open-source, AI infrastructure, and performance engineering. If this kind of stuff interests you, stay tuned — more to come soon.