Hi everyone — I'm Benedetto Proietti, Head of Architecture at Janea Systems. I spend most of my time working on performance-critical systems and solving interesting problems at the intersection of machine learning, distributed systems, and open-source technologies. This post is about a recent project I ideated and directed: JECQ, an innovative, open-source, compression solution built specifically for FAISS users. I’ll walk you through the thinking behind it, how it works, and how it can help reduce memory usage without compromising too much on accuracy.
Ever wonder how it takes just milliseconds to search something on Google, despite hundreds of billions of webpages in existence? The answer is Google’s index. By the company’s own admission, that index weighs in at over 100,000,000 gigabytes. That’s roughly 95 petabytes.
Now, imagine if you could shrink that index by a factor of six.
That’s exactly what Janea Systems did for vector embeddings—the index of artificial intelligence.
Read on to learn what vector embeddings are, why compressing them matters, how it’s been done until now, and how Janea Systems’ solution pushes it to a whole new level.
The Data Explosion: From Social Media to Large Language Models
The arrival of Facebook in 2004 marked the beginning of the social media era. Today, Facebook has over 3 billion monthly active users worldwide. In 2016, TikTok introduced the world to short-form video, and now has more than a billion monthly users.
And in 2023, ChatGPT came along.
Every one of these inventions led to an explosion of data being generated and processed online. With Facebook, it was posts and photos. TikTok flooded the web with billions of 30-second dance videos.
When data starts flowing by the millions, companies look for ways to cut storage costs with compression. Facebook compresses the photos we upload to it. TikTok does the same with videos.
What about large language models? Is there anything to compress there?
The answer is yes: vector embeddings.
Vector Embeddings: The Language of Modern AI
Think of vector embeddings as the DNA of meaning inside a language model. When you type something like “Hi, how are you?”, the model converts that phrase into embeddings—a set of vectors that capture how it relates to other phrases. These embeddings help the model process the input and figure out how likely different words are to come next. This allows the model to know the right response to "Hi, how are you?" is “I’m good, and you?” instead of “That’s not something you’d ask a cucumber".
The principle behind vector embeddings also underpins a process called “similarity search.” Here, embeddings represent larger units of meaning—like entire documents—powering use cases like retrieval-augmented generation (RAG), recommendation engines, and more.
It should be pretty clear by now that vector embeddings are central not just to how generative AI works, but to a wide range of AI applications across industries.
The Hidden Costs of High-Dimensional Data: Why Vector Compression is Crucial
The problem is that vector embeddings take up space. And the faster and more accurate we want an AI system to be, the more vector embeddings it needs - and the more space to store them. But this isn't just a storage cost problem: the bigger embeddings are, the more bandwidth in the PCI bus and in the memory bus they use. It's also an issue for things like edge AI devices - edge devices don't have constant internet access, so their AI models need to run efficiently with the limited space they've got onboard.
That's why it makes sense to look for ways to push compression even further - despite the fact that embeddings are already being compressed today. Squeezing even another 10% out of the footprint can mean real savings, and a much better user experience for IoT devices running generative AI.
At Janea Systems, we saw this opportunity and built an advanced C++ library based on FAISS.
FAISS—short for Facebook AI Similarity Search—is Meta’s open-source library for fast vector similarity search, offering an 8.5x speedup over earlier solutions. Our library takes it further by optimizing the storage and retrieval of large-scale vector embeddings in FAISS—cutting storage costs and boosting AI performance on IoT and edge devices.
The Industry Standard: A Look at Product Quantization (PQ)
Vector embeddings are stored in a specialized data structure called a vector index. The index lets AI systems quickly find and retrieve the closest vectors to any input (e.g. user questions) and match it with accurate output.
A major constraint for vector indexes is space. The more vectors you store—and the higher their dimensionality—the more memory or disk you need. This isn’t just a storage problem; it affects whether the index fits in RAM, whether queries run fast, and whether the system can operate on edge devices.
Then there’s the question of accuracy. If you store vectors without compression, you get the most accurate results possible. But the process is slow, resource-intensive, and often impractical at scale. The alternative is to apply compression, which saves space and speeds things up, but sacrifices accuracy.
The most common way to manage this trade-off is a method called Product Quantization (PQ) (Fig. 1).
Fig. 1: PQ’s uniform compression across subspaces
PQ works by splitting each vector into equal-sized subspaces. It’s efficient, hardware-friendly, and the standard in vector search systems like FAISS.
But because each subspace in PQ is equal, it’s like compressing every video frame in the same way and to the same size—whether it’s entirely black or full of detail. This approach keeps things simple and efficient but misses the opportunity to increase compression on a case-by-case basis.
At Janea, we realized that vector dimensions vary in value—much like video frames vary in resolution and detail. This means we can adjust the aggressiveness of compression (or, more precisely, quantization) based on how relevant each dimension is, without affecting overall accuracy.
Solution: JECQ - Intelligent, Dimension-Aware Compression for FAISS
To strike the right balance between memory efficiency and accuracy, engineers at Janea Systems have developed JECQ, a novel, open-source compression algorithm available on GitHub that varies compression by the statistical relevance of each dimension.
In this approach, the distances between quantized values become irregular, reflecting each dimension's complexity.
How does JECQ work?
The algorithm starts by determining the isotropy of each dimension based on the eigenvalues of the covariance matrix. In the future, the analysis will also cover sparsity and information density.
The algorithm then classifies each dimension into one of three categories: low relevance, medium relevance, and high relevance.
Dimensions with low relevance are discarded, with very little loss in accuracy.
Medium-relevance dimensions are quantized using just one bit, again with minimal impact on accuracy.
High-relevance dimensions undergo the standard product quantization.
Compressed vectors are stored in a custom, compact format accessible via a lightweight API.
The solution is compatible with existing vector databases and ANN frameworks, including FAISS.
What are the benefits and best use cases for JECQ?
Early tests show memory footprint reduced by 6x, keeping 84.6% accuracy compared to non-compressed vector candidates. Figure 2 compares the memory footprint of an index before quantization, with product quantization (PQ), and with JECQ.
Fig. 2: Memory footprint before quantization, with PQ, and with JECQ
We expect this will lower cloud and on-prem storage costs for enterprise AI search, enhance Edge AI performance by fitting more embeddings per device for RAG or semantic search, and reduce the storage footprint of historical embeddings.
What Are JECQ’s License and Features?
JECQ is out on GitHub, available under the MIT license. It ships with an optimizer that takes a representative data sample or user-provided data and generates an optimized parameter set. Users can then fine-tune this by adjusting the objective function to balance their preferred accuracy–performance trade-off.
We're planning to share more tools, experiments, and lessons learned from our work in open-source, AI infrastructure, and performance engineering. If this kind of stuff interests you, stay tuned — more to come soon.
Top comments (0)