DEV Community: Benedetto Proietti

How to Build a Vector Database Using Amazon S3 Vectors

Benedetto Proietti — Wed, 30 Jul 2025 07:35:31 +0000

And Say Goodbye to Expensive SaaS Pricing

Here’s an estimated price comparison for storing 1 billion and 10 billion vectors using the most common SaaS vector databases. These numbers are pulled directly from each provider’s pricing calculator.

Why Are Vector Databases So Expensive?

I’ve covered this before in two articles — Desire for Structure (read: “SQL”) and (Beyond) The Art of Database Indexing. Traditional indexing starts to fall apart at scale — what we used to call “Big Data.”

Want the short version?

Vector databases are expensive because they rely on powerful, always-on hardware. They keep in-memory indexes fresh and caches hot — and that costs money.

Old Tricks, New Vector World

For tabular or log data, we decoupled compute and storage a long time ago: store data cheaply in S3 or MinIO, and spin up compute (like Spark or Presto) only when needed.

Amazon has now extended this model to vector embeddings with Amazon S3 Vectors. [Quick dive here.]

S3 Vectors lets you store huge volumes of vector data at low cost and run similarity searches in under a second — ideal for batch workloads and analytics.

Can we ditch expensive Vector DBs now?

Not quite. S3 Vectors doesn’t offer the low latency needed for real-time use cases — like fraud detection, recommendation engines, or chatbots that require sub-100 ms responses.

Instead, think of S3 Vectors as the durable, budget-friendly foundation. You’ll still need a lightweight layer on top to meet real-time latency requirements.

Level 1: Download and Run

Let’s start simple: use open-source software out of the box, no code changes, just run it.

We lower similarity search latency by using a classic Computer Science technique — indexes — and storing data in RAM (which is fast but expensive).

Product Quantization (PQ): Fast, Memory-Efficient Search

Performing exact distance calculations (cosine, Euclidean) on billions of 768-dimensional vectors is too slow and compute-heavy.

Product Quantization (PQ) helps by compressing vectors into compact forms. This makes searches 10–100× faster — with minimal accuracy loss.

How PQ Works

PQ splits each high-dimensional vector into smaller chunks (e.g., groups of 8 dimensions), then maps each chunk to the closest centroid in a precomputed codebook. Only the centroid IDs are stored.

At query time, instead of comparing against billions of raw vectors, the system compares to ~256 centroids per chunk — massively reducing compute time.

For most NLP workloads, PQ delivers excellent recall while cutting memory and compute costs.

Tool Selection

FAISS— originally developed by Facebook AI Research — is the go-to library for efficient similarity search and clustering of dense vectors. It’s widely adopted for high-performance vector indexing at scale. But I recommend JECQ, a drop-in replacement with 6× lower memory usage.

Disclaimer: I created JECQ. That said, it works well. But use FAISS if you prefer.

In-Memory Cache

You can cache a subset of raw vectors in RAM using tools like Redis or Valkey, depending on your licensing needs.

For 10 billion vectors (~30TB in S3), storing just 1% in RAM (about 300GB) can make a big difference. Pricey, but manageable.

Level 1 Architecture

Let’s walk through the architecture:

A router service handles incoming similarity search requests. It’s stateless and can scale horizontally.
Each node loads a copy of the JECQ (or FAISS) index in memory.
The router uses JECQ to find candidate vector IDs.
It then checks Redis for raw vectors:
Cache hit: Redis returns vectors. Router re-ranks and returns results.
Cache miss: Router pulls vectors from S3, re-ranks, and returns results.

Level 2: Teaser

Level 1 works fine for datasets up to ~1 billion vectors or demo workloads. But if you want 10–100 ms P95 latency at multi-billion scale, you’ll need more:

Local raw vectors on NVMe: A middle layer (5–10% of raw size, ~1.5TB) between RAM and S3 to avoid frequent S3 fetches.

Hierarchical data layers: JECQ + Redis/NVMe integration enables local posting list retrieval, turning 100 ms S3 reads into 2–5 ms NVMe reads.
Index sharding: Splits PQ clusters across nodes and avoids duplicating 100GB+ compressed data per node.
Advanced cache management: Store frequent queries, support MFU/LFU/LRU caching strategies, and pre-load data based on user behavior.
Aggressive S3 Vectors indexing: Each query hits just one index. A single S3 bucket can hold 10K indexes, each with ≤50M vectors. Smart indexing helps reduce latency significantly.

All of this requires solid engineering chops — but it's necessary if you want to build a cost-effective vector database with 10–100 ms latency on top of S3 Vectors.

Stay tuned for Level 2.

S3 Vectors: Changing How We Think About Vector Embeddings

Benedetto Proietti — Wed, 16 Jul 2025 16:11:26 +0000

Inserting and maintaining data in a relational database is expensive. Every write must update one or more indexes (data structures such as B-trees) that accelerate reads at the cost of extra CPU, memory, and I/O. On a single node, tables start to struggle once they pass a few terabytes. Distributed SQL and NoSQL systems push that limit, but the fundamental write amplification costs remain.

Object Storage

To escape those costs, teams began landing raw data in cloud object stores like Amazon S3. Instead of hot indexes, query engines (Spark, Athena, Trino) rely on partition pruning and lightweight statistics. The led to dramatically lower storage bills and petabyte scale datasets on commodity hardware.

Vectors Embeddings

AI and LLM workloads now emit vector embeddings – hundreds or thousands of dimensions per record. Answering “Which vectors are nearest to this one?” in real time is tricky:

High-dimensional data breaks classic data structures.
We lean on approximate nearest neighbor (ANN) algorithms such as HNSW or IVFPQ.
Queries often combine a distance threshold with metadata filters.
Recall, precision, and latency form a three-way tradeoff.

Amazon S3 Vector: A Game-Changer

Announced yesterday, Amazon S3 Vectors brings vector-aware storage classes to S3. Each vector table:

Stores vectors of fixed dimensionality, compressed on write. Not possible with traditional S3.
Supports ANN search with simultaneous filters on metadata. Immensely faster than S3.
Delivers sub-second latency: great for batch, a bit slow for interactive UX.

Closing the Latency Gap with In-Memory Caching

Janea Systems’ background is deeply rooted in working with in-memory, low-latency caches. Our track record includes:

We are the creators of Memurai, the official Redis for Windows, trusted by developers for its performance and reliability.
We are active contributors to ValKey, a rapidly evolving open-source fork of Redis, pushing the boundaries of in-memory data stores.

Given the inherent characteristics of S3 Vectors, including powerful storage and batch processing, but with room for improvement in interactive scenarios, the next logical step is to strategically implement a high-performance cache on top of S3 Vectors.

The Future

We are excited about the possibilities Amazon S3 Vectors unlocks. The upcoming articles will cover how to effectively integrate Redis, Valkey, or Memurai with the S3 Vector service to achieve optimal performance for your AI/LLM workloads. Also, we will explore the new AWS service and its implications for modern data architectures in detail. Stay tuned!

JECQ: Smart, Open-Source Compression for FAISS Users—6x Compression Ratio, 85% Accuracy

Benedetto Proietti — Wed, 09 Jul 2025 12:19:58 +0000

Hi everyone — I'm Benedetto Proietti, Head of Architecture at Janea Systems. I spend most of my time working on performance-critical systems and solving interesting problems at the intersection of machine learning, distributed systems, and open-source technologies. This post is about a recent project I ideated and directed: JECQ, an innovative, open-source, compression solution built specifically for FAISS users. I’ll walk you through the thinking behind it, how it works, and how it can help reduce memory usage without compromising too much on accuracy.

Ever wonder how it takes just milliseconds to search something on Google, despite hundreds of billions of webpages in existence? The answer is Google’s index. By the company’s own admission, that index weighs in at over 100,000,000 gigabytes. That’s roughly 95 petabytes.

Now, imagine if you could shrink that index by a factor of six.

That’s exactly what Janea Systems did for vector embeddings—the index of artificial intelligence.

Read on to learn what vector embeddings are, why compressing them matters, how it’s been done until now, and how Janea Systems’ solution pushes it to a whole new level.

The Data Explosion: From Social Media to Large Language Models

The arrival of Facebook in 2004 marked the beginning of the social media era. Today, Facebook has over 3 billion monthly active users worldwide. In 2016, TikTok introduced the world to short-form video, and now has more than a billion monthly users.

And in 2023, ChatGPT came along.

Every one of these inventions led to an explosion of data being generated and processed online. With Facebook, it was posts and photos. TikTok flooded the web with billions of 30-second dance videos.

When data starts flowing by the millions, companies look for ways to cut storage costs with compression. Facebook compresses the photos we upload to it. TikTok does the same with videos.

What about large language models? Is there anything to compress there?

The answer is yes: vector embeddings.

Vector Embeddings: The Language of Modern AI

Think of vector embeddings as the DNA of meaning inside a language model. When you type something like “Hi, how are you?”, the model converts that phrase into embeddings—a set of vectors that capture how it relates to other phrases. These embeddings help the model process the input and figure out how likely different words are to come next. This allows the model to know the right response to "Hi, how are you?" is “I’m good, and you?” instead of “That’s not something you’d ask a cucumber".

The principle behind vector embeddings also underpins a process called “similarity search.” Here, embeddings represent larger units of meaning—like entire documents—powering use cases like retrieval-augmented generation (RAG), recommendation engines, and more.

It should be pretty clear by now that vector embeddings are central not just to how generative AI works, but to a wide range of AI applications across industries.

The Hidden Costs of High-Dimensional Data: Why Vector Compression is Crucial

The problem is that vector embeddings take up space. And the faster and more accurate we want an AI system to be, the more vector embeddings it needs - and the more space to store them. But this isn't just a storage cost problem: the bigger embeddings are, the more bandwidth in the PCI bus and in the memory bus they use. It's also an issue for things like edge AI devices - edge devices don't have constant internet access, so their AI models need to run efficiently with the limited space they've got onboard.

That's why it makes sense to look for ways to push compression even further - despite the fact that embeddings are already being compressed today. Squeezing even another 10% out of the footprint can mean real savings, and a much better user experience for IoT devices running generative AI.

At Janea Systems, we saw this opportunity and built an advanced C++ library based on FAISS.

FAISS—short for Facebook AI Similarity Search—is Meta’s open-source library for fast vector similarity search, offering an 8.5x speedup over earlier solutions. Our library takes it further by optimizing the storage and retrieval of large-scale vector embeddings in FAISS—cutting storage costs and boosting AI performance on IoT and edge devices.

The Industry Standard: A Look at Product Quantization (PQ)

Vector embeddings are stored in a specialized data structure called a vector index. The index lets AI systems quickly find and retrieve the closest vectors to any input (e.g. user questions) and match it with accurate output.

A major constraint for vector indexes is space. The more vectors you store—and the higher their dimensionality—the more memory or disk you need. This isn’t just a storage problem; it affects whether the index fits in RAM, whether queries run fast, and whether the system can operate on edge devices.

Then there’s the question of accuracy. If you store vectors without compression, you get the most accurate results possible. But the process is slow, resource-intensive, and often impractical at scale. The alternative is to apply compression, which saves space and speeds things up, but sacrifices accuracy.

The most common way to manage this trade-off is a method called Product Quantization (PQ) (Fig. 1).

Fig. 1: PQ’s uniform compression across subspaces

PQ works by splitting each vector into equal-sized subspaces. It’s efficient, hardware-friendly, and the standard in vector search systems like FAISS.

But because each subspace in PQ is equal, it’s like compressing every video frame in the same way and to the same size—whether it’s entirely black or full of detail. This approach keeps things simple and efficient but misses the opportunity to increase compression on a case-by-case basis.

At Janea, we realized that vector dimensions vary in value—much like video frames vary in resolution and detail. This means we can adjust the aggressiveness of compression (or, more precisely, quantization) based on how relevant each dimension is, without affecting overall accuracy.

Solution: JECQ - Intelligent, Dimension-Aware Compression for FAISS

To strike the right balance between memory efficiency and accuracy, engineers at Janea Systems have developed JECQ, a novel, open-source compression algorithm available on GitHub that varies compression by the statistical relevance of each dimension.

In this approach, the distances between quantized values become irregular, reflecting each dimension's complexity.

How does JECQ work?

The algorithm starts by determining the isotropy of each dimension based on the eigenvalues of the covariance matrix. In the future, the analysis will also cover sparsity and information density.
The algorithm then classifies each dimension into one of three categories: low relevance, medium relevance, and high relevance.
Dimensions with low relevance are discarded, with very little loss in accuracy.

Medium-relevance dimensions are quantized using just one bit, again with minimal impact on accuracy.
High-relevance dimensions undergo the standard product quantization.
Compressed vectors are stored in a custom, compact format accessible via a lightweight API.

The solution is compatible with existing vector databases and ANN frameworks, including FAISS.

What are the benefits and best use cases for JECQ?

Early tests show memory footprint reduced by 6x, keeping 84.6% accuracy compared to non-compressed vector candidates. Figure 2 compares the memory footprint of an index before quantization, with product quantization (PQ), and with JECQ.

Fig. 2: Memory footprint before quantization, with PQ, and with JECQ

We expect this will lower cloud and on-prem storage costs for enterprise AI search, enhance Edge AI performance by fitting more embeddings per device for RAG or semantic search, and reduce the storage footprint of historical embeddings.

What Are JECQ’s License and Features?

JECQ is out on GitHub, available under the MIT license. It ships with an optimizer that takes a representative data sample or user-provided data and generates an optimized parameter set. Users can then fine-tune this by adjusting the objective function to balance their preferred accuracy–performance trade-off.

We're planning to share more tools, experiments, and lessons learned from our work in open-source, AI infrastructure, and performance engineering. If this kind of stuff interests you, stay tuned — more to come soon.

(Beyond) The Art of Database Indexing

Benedetto Proietti — Thu, 03 Apr 2025 16:10:36 +0000

and Why Indexes Alone Won’t Save You

Executive Summary

Modern databases handle vast amounts of data — often spanning tens of terabytes or even petabytes — and relying solely on indexing is insufficient for optimal performance. Misunderstanding this can lead to degraded performance, increased operational costs, and poor user experience. This article highlights why indexing alone isn’t the solution and provides practical alternatives and strategies for scalable, efficient data management.

Why Indexes Alone Are Insufficient

Indexes improve read performance but introduce overhead during writes, slowing transactional operations. Businesses frequently face performance degradation due to excessive indexing, resulting in increased operational costs and diminished customer satisfaction.

Technical Basics

Data resides in three main storage systems, each with distinct performance characteristics:

RAM (volatile memory): Extremely fast, ideal for real-time data processing.
NVMe Drives/SSDs: Faster, persistent storage with moderate costs.
Rotating hard drives: Cost-effective but significantly slower.

The efficiency of your software solution (SQL, NoSQL, or caching systems) significantly depends on your choice of storage and how effectively the software manages data contention.

When Indexing Hurts More Than It Helps

High-write Workloads: Frequent data updates make indexing costly and inefficient.
Analytics Queries: Full-table scans on specialized storage can be more efficient than indexed lookups.
Complex Joins and Aggregations: Excessive indexing increases complexity and resource use, potentially harming performance.

Strategic Alternatives to Optimize Performance

1️⃣ Separate Analytics from OLTP Systems

Operational (OLTP) and analytical workloads should never compete for resources within the same database. Operational queries demand instant responses, while analytics queries, analyzing large historical data sets, consume extensive resources.

Actionable Step: Evaluate your database architecture to segregate operational databases from analytical workloads.

2️⃣ Separate Read and Write Operations

Using database read replicas improves scalability and resource efficiency by minimizing bottlenecks. However, it can introduce slight latency or data staleness.

Actionable Step: Consider implementing read replicas to reduce load on your primary database, but assess potential data staleness impacts.

Possible Challenge! Creating replicas introduces latencies to allow writes to replicate. Or some staleness in the data. Which can be ok in most (not all) scenarios. Choose wisely!

3️⃣ Implement Caching with Redis, Valkey, or Memurai

Caching frequently accessed data significantly enhances application responsiveness by reducing database queries. This results in lower latency, reduced database loads, and improved scalability. The first cache that found tremendous popularity was Redis (for Linux). Then its unwise licensing choices lead the recent groth of Valkey, a fully open-source alternative to Redis. For Windows, there’s Memurai: a fully supported and enterprise-ready partner of Redis.

Actionable Step: Consider adding code (for Reads) that access cache first, and only on a cache-miss access the DB.

4️⃣ Shard Data to Distribute Load

Sharding partitions data across multiple databases, distributing load and preventing bottlenecks. While traditional databases (RDBMS) require manual sharding efforts, solutions like Valkey offer built-in sharding capabilities, simplifying scaling.

Actionable Step: Consider sharding if your current database struggles with heavy loads or data volumes exceeding manageable thresholds.

6️⃣ Leverage Data Lakes for Historical Data

Storing historical data in data lakes optimizes analytics, lowers storage costs, and maintains regulatory compliance without taxing operational databases.

Actionable Step: Implement clear data-retention policies and regularly move older, less frequently accessed data to data lakes to enhance database performance.

7️⃣ Summary

Providing generic advice is inherently challenging. Please consider the recommendations in this table as guidelines rather than absolute rules.

Conclusion

Indexes alone cannot support modern database demands at large scale. Strategic decisions regarding database infrastructure directly influence your organization’s agility, costs, customer satisfaction, and competitive positioning. Making informed choices now prevents costly performance degradation later.

Are you dealing with SQL scalability headaches?
If your data is growing and you’re unsure how to scale without breaking the bank, let’s talk. I help teams with architecture and modernization strategies — reach out if you need a second opinion.

PS: Check out my informal podcast “PROIEX — Tech Experiences”!

Desire for Structure (read “SQL”)

Benedetto Proietti — Thu, 03 Apr 2025 14:58:58 +0000

Or obsession for control?

Let’s admit it: we love the idea of running SQL queries on ALL our data! Not just a preference — it feels like an obsession. No matter how old or how rarely accessed, we cling to the idea that everything must remain instantly queryable.

It feels so simple, until it’s not. And then bad things happen:

The database grows too large, and queries slow down to death.
Indexes sizes explode, eating up CPU and memory just to keep up.
Schema changes become a nightmare, locking tables and causing downtime.
The cost of scaling up SQL infrastructure skyrockets.
Suddenly, our beautiful, structured world starts crumbling under its own weight. What seemed like an easy decision — “let’s just store everything in SQL” — becomes a scaling bottleneck that forces us to rethink our approach.

But why does this happen?

Relational Databases are powerful because they provide structure, indexing, and queryability — elements that make it easy for users to analyze and manipulate data.

For engineers, analysts, and business users, SQL offers:

A universal way to query data — The language is standardized and widely understood.
Ad-hoc queryability — You can ask complex questions without predefining reports.
Data consistency — Enforced schemas and constraints prevent data corruption.
Indexing for performance — The ability to speed up searches through optimized indexes.
Because of these benefits, many organizations enforce SQL as the default, even when other approaches may be more suitable.

But what happens when this need for structure becomes a liability rather than an asset?

We will look a bit into the problem space, and then we will explore (not exhaustively!) the solution space.

2. The Problem space: When Structure Becomes a Limitation

2.1. The cost of our need for Structure

While SQL provides clarity, it also introduces constraints. Let’s look at the most important ones.

Schema

A schema demands data to be organized in a set of tables. And for each table it needs a set of columns (names and types) and primary key column. You can also introduce relationships between tables, but we are not going to talk about that here (that’s why, incidentally, they are called “relational databases”).

Schema changes are delicate operations. They must be deployed before application changes to ensure compatibility. But even when done correctly, they introduce risks — what if your production data has unexpected edge cases your test environment didn’t? What if rollback becomes a nightmare?

Oh wait! Do we push the DB schema changes first, or the application changes first? schema changes first of course! Then just pray that the application you are pushing has the exact same schema. Yes you tested in the test environment… which is always perfectly identical to the production environment… right?

Well… not always.

Indices

A dear friend of mine and accomplished engineer used to say that “writing Databases is the art of writing indices”.

What is an index? An index is a type of data structure that helps finding the data when you do a search.

You put the data into a table as it comes. Now you want to search for all the orders for Mr Smith. If you indexed the LastName column…. The index has an acceleration data structure (probably a kind of tree) where it quickly can find all the rows where the last name is equal to “Smith”.

Nice right? Yah.

You know what it takes? Let’s try to list of what it takes:

· Fast memory (RAM) where to keep the fast data structure.

· Time: it needs to lock the index while it is searching it, so that mutations on the index do not make it crash. Yes these locks can be avoided or reduced most of the times but… you get the gist. Let’s stop it here for now.

Now what happens if you need more data?

You need more memory? Sure but there’s a limit to physical memory on a single machine.

Let’s go distributed, let’s use a cluster. Now I ask you to do the mental exercise to imagine how to lock a distributed index data structure across a cluster of 4 nodes. Well… difficult but possible.

But what if your data is really, REALLY huge? You might need 400 nodes. Or 4,000. Or 40,000. It’s impossible. Why?

Because indexes improve query speed — but they also introduce contention. Updating an index requires locking portions of it, slowing down concurrent writes. In a single-node setup, this is manageable. But in a distributed database spanning hundreds of nodes, keeping indexes in sync becomes a nightmare. This is why most distributed databases avoid global indexes or use eventually consistent indexing approaches. Without careful design, a single overloaded index update can throttle an entire system, creating cascading slowdowns.

And probably spending so much time and engineering effort in calibrating indexes does not align with your core business.

Conclusions so far

That’s why you need to give up and think about other ways of storing your data.

The first place you store the data “in general” should not go into a relational SQL database (RDBMs if we want to use a formal language). That might not be the case for specific cases, for example where credit cards or money is involved.

If you are convinced about that, please continue reading. Otherwise please stop reading. There is no value for you to continue.

3. Exploring the solution space

3.1. Hot/Cold

First of all, not all data needs to be hot. Or indexed.

In most of the use cases queries need last week’s data, or last month’s data. Why then are you keeping 3 years’ worth of data in expensive indexed RDBMs?

1 week out of 3 years is less than one percent of the entire data! (0.6% to be precise).
1 month out of 3 years is about 2.5% of the entire data.

Maybe, it makes sense to place “hot” data in a fast (and costly) database (or even a cache), and the cold data in a cheaper (and slower) data storage. It will cost a bit in terms of latency to access such data but… you will save a ton on money.

So why do companies still keep years of data in expensive RDBMS instances? Often, it’s habit — ‘we might need it someday’ — or a lack of proper data lifecycle planning. But in reality, only a fraction of the data needs to remain hot.

A real-world Example

A real-world example of this concept is NYC Taxi ride data:

· Total Dataset: ~1.1 billion rides per year (~550GB of storage).

· Last 3 years of data: ~3.3 billion rides (~1.6 TB of storage).

· Last Month’s Data: ~70 million rides (~35GB of storage).

· Last Week’s Data: ~18 million rides (~9GB of storage).

3.2. Columns

I have often seen relational databases with hundreds — sometimes even thousands — of columns, storing an incredible amount of data. Yet, most of these columns are rarely queried. They can’t be discarded because they hold valuable information, but they still come at a cost.

Every column, whether used frequently or not, adds overhead in disk storage, CPU, and memory — even when it’s not indexed. And that brings us to another issue: we can’t index all hundreds of columns for obvious reasons. As a result, we end up in a paradoxical situation:

We pay for an expensive and often slow relational database system.

Yet, hundreds of columns remain unindexed, making queries inefficient.

This raises an important question: Are we truly benefiting from keeping all columns “hot” and queryable, or are we just paying for an illusion of accessibility?

4. Big Data: an introduction

The term Big Data is thrown around a lot, but how big is “big”?

Here’s my personal definition: Big Data is an amount of data that cannot fit in a small cluster — but it still needs to be queryable.

A modern approach to Big Data consists of several components, and the key principle is separating compute from storage.

4.1. Storage: The Data Lake

Big Data is typically stored in a distributed object storage system, which I loosely call a Data Lake.

Here’s my informal definition of a Data Lake:

Virtually infinite capacity — There is no practical storage limit.
Massively parallel — It can handle many simultaneous operations.
High throughput — Huge input/output bandwidth.
High latency — It is not optimized for low-latency access.
Simple REST API — Accessible through standard cloud APIs.

4.2. Compute: Processing Data from the Data Lake

From the Data Lake, data branches into separate workflows:

Analytics workloads → Data warehouses, batch processing, OLAP queries.
Online applications → Real-time transactional workloads (OLTP), NoSQL solutions.

But before going further, let me clarify something crucial:

4.3. Analytics & Online Processing Should Be Completely Separate

They should not share the same datastore, cluster, or even availability zone.

Why?

Analytics deals with historical data — end-of-day sales reports, customer behavior insights, trend analysis, etc.
Online processing serves live production workloads — API responses, real-time transactions, and user interactions.

Mixing the two is a huge risk. Allowing analytical queries to run on your live production system is an enterprise disaster waiting to happen. You risk:

❌ Slow application performance — Customers may experience delays.

❌ Production downtime — A heavy analytical query could lock tables or exhaust resources.

❌ Jeopardizing business operations — A reporting query should never interfere with live transactions.

🚨 DO NOT DO IT. Keep them separate.

Besides, each requires different technologies:

Online processing → Needs an RDBMS (SQL or NoSQL). These systems are designed for transactional speed and reliability.
Analytics → Requires Big Data tools to efficiently process vast datasets.

4.4. Analytics Compute: Processing Big Data

There are many ways to query and process Big Data. A full discussion would take an entire book, so here’s a short list of key technologies:

Spark

✅ The de facto distributed compute engine for Big Data.

✅ Has been around since 2014, gaining widespread adoption.

✅ Processes petabytes of data with relative ease.

⚠️ Not without its issues, but still one of the easiest ways to run large-scale distributed queries.

Flink

✅ A newer competitor to Spark, with fresher architecture.

✅ Focuses on real-time and batch processing.

⚠️ Similar trade-offs as Spark but gaining traction.

AWS Athena

✅ Serverless querying of S3 data.

✅ Pay-as-you-go pricing (great for occasional queries).

⚠️ Can become expensive at scale if queries are frequent.

ClickHouse

✅ The cool new kid in town — Incredibly fast analytics.

✅ Can store data internally or query data directly from S3.

✅ Supports indexes, making queries much faster than traditional Data Lakes.

✅ Available as fully managed or self-hosted.

🔥 An excellent choice for high-performance analytics.

AWS Redshift (or another DataWarehouse)

This is a crossover scenario. Redshift is a semi-managed RDBMS but with beefy architecture, a columnar approach (why is columnar better? That’s a question for another article) a lot of cache and a cluster architecture that makes it super-fast.

Redshift offers fast, scalable columnar storage, making it great for heavy analytical queries. But it comes at a cost — both financial and operational. While Redshift excels for structured reporting, it’s not ideal for ad-hoc exploratory analysis, where a more flexible solution like Athena or ClickHouse might be better.

4.5. Final Thoughts on BigData

Big Data requires rethinking storage and compute:

· Store everything in a Data Lake (cheap, scalable storage).

· Compute happens separately, using the right tool for the job.

· Never mix analytics with production systems — that’s asking for trouble.

Big Data isn’t about blindly throwing everything into SQL — it’s about choosing the right technologies for scale, cost, and performance.

The diagram (source: AWS documentation) depicts a serverless “datalake centric” analytics architecture

5. Small Data: Flexibility & Fewer Constraints

When dealing with small datasets, storage flexibility is far greater than in the Big Data world.

Unlike Big Data, where storage and compute must be carefully architected, small data is forgiving.

You don’t have to worry about distributed storage, multi-cluster orchestration, or separating compute from storage.

Instead, you get to focus on what really matters: choosing the right tool for the job.

5.1. Where Can You Store Small Data?

With small datasets, you have many options:

· SQL databases — Classic, structured, reliable. Perfect for transactional workloads.

· NoSQL databases — Flexible, schema-free, and great for hierarchical or document-based data.

· Flat files (CSV, JSON, Parquet) — Easy to use, easy to share. Works great for logs, configs, and lightweight processing.

· In-memory databases (Redis, Memcached) — Blazing fast, ideal for caching and ephemeral data storage.

· Embedded databases (SQLite, DuckDB) — Self-contained, no external dependencies, excellent for local processing.

Each of these options has trade-offs, but the beauty of small data is that you don’t need a complex architecture.

You pick what works and move forward — no need for extensive capacity planning or complex scaling strategies.

5.2. The Cost vs. Performance Trade-Off

With small data, performance and cost concerns are less critical compared to Big Data.

· Scaling is rarely an issue — One machine, or a few, are usually enough.

· Disk and memory overhead is minimal — A few gigabytes of data can fit comfortably on SSDs or even in memory.

· Cold vs. hot data isn’t a problem — All data is small enough to be ‘hot’ by default.

But there’s still an important question to ask:

5.3. Structure vs Flexibility

With small datasets, the temptation is to throw everything into SQL, because it just works. But does it always make sense?

· If your data is deeply hierarchical (e.g., JSON, XML), would a document store be a better fit?

· If your access patterns are key-value based, why not use an in-memory store like Redis?

· If your data is static, would a simple CSV or JSON file suffice instead of a full-blown database?

SQL forces structure, which is great when you need consistency, transactions, and relational integrity.

But flexibility is often more important when dealing with small, isolated datasets that don’t require complex relationships.

5.4. Over-engineering

I’ve seen teams deploy Kafka clusters to process just a few thousand messages per day — a classic case of overengineering. Sometimes, a simple cron job writing to an SQLite database does the job just fine.

5.5. Final Thoughts

With Big Data, you need careful storage and compute separation just to make the system work.

With small data, you don’t. You have options, and you should take advantage of them.

· Don’t default to SQL just because it’s familiar.

· Think about your access patterns first.

· Pick the simplest tool that solves the problem.

After all, flexibility is the true advantage of small data. 🚀

5.6. Big or Small Data

Many workloads labeled as ‘Big Data’ are simply poorly optimized small data.
The industry loves the term, but not all “Big Data problems” are truly big. Sometimes, what appears to be a scaling issue is just a lack of proper indexing, partitioning, or query optimization.

If your system struggles to process a few terabytes of structured data, you might not need Hadoop or Spark — you may just need better schema design and indexing.
If your pipeline relies on daily batch jobs to process logs, a simpler event-driven system with proper aggregation could be more efficient than a full-blown Big Data stack. Before reaching for complex distributed computing tools, ask: Do I have a Big Data problem, or do I have a poorly optimized Small Data problem?

5.7. Conclusion

SQL databases are a great tool — but not the only tool. The human desire for structure often leads us to over-rely on relational databases, even when they introduce inefficiencies.

The best architecture balances structure with flexibility. Before defaulting to SQL, ask yourself:

Does all this data need to be hot and indexed?
Is SQL the right tool, or am I forcing structure where it’s not needed?
Can I reduce costs and improve performance with a better data lifecycle strategy?
Are you dealing with SQL scalability headaches?
If your data is growing and you’re unsure how to scale without breaking the bank, let’s talk. I help teams with architecture and modernization strategies — reach out if you need a second opinion.

PS: Feel free to also visit my informal podcast “PROIEX — Tech Experiences”.