Dr Hernani Costa

Posted on Feb 15 • Originally published at insights.firstaimovers.com

Vector Databases: The $10M Architecture Decision for LLM Apps

#ai #database #architecture #performance

Your AI chatbot's "memory" isn't magic—it's your database architecture. And choosing wrong costs enterprises millions in latency, scaling failures, and missed retrieval accuracy.

Ever wonder how your AI chatbot seems to "remember" facts or search your documents? It's not magic - it's the database. Today's AI-powered apps (from customer support bots to coding assistants) need to fetch information by meaning, not just exact matches. This has sparked a seismic shift in the database world. Gone are the days of simply choosing SQL vs. NoSQL. Now we're adding vector search, hybrid queries, and LLM integration into the mix. Let's dive into how this new landscape is unfolding - and what it means for your next project.

The AI Shake-Up in Databases

Remember when ChatGPT burst onto the scene in late 2022? That moment changed things not just for AI users but for the back-end tech as well. Suddenly, developers had to handle embeddings (those dense vector representations of text/images) and run queries like "find documents related to X" that traditional databases never optimized for. How would a standard SQL query find the most semantically similar record among millions? It wouldn't - at least not without some serious new tricks.

The result? A boom in new approaches. Vector databases emerged, and traditional databases started evolving fast. Industry analysts took note: Gartner predicts that by 2026, over 30% of enterprises will be using vector databases, up from virtually none in 2023. Even the popular DB-Engines ranking added "Vector DBMS" as a new category in mid-2023. The data stack is changing to keep up with AI's demands.

Traditional Databases Are Learning New Tricks

Before you toss out your trusty PostgreSQL or MongoDB, here's some good news: your familiar databases are adding AI capabilities too. The old guard is learning new tricks, letting you integrate AI features without a complete rip-and-replace:

PostgreSQL + pgvector: That rock-solid Postgres you've used for years can now store embeddings and do similarity search using the open-source pgvector extension. It's as if Postgres got a semantic index on the side. All major cloud providers support this: Google Cloud's AlloyDB and Cloud SQL, Amazon Aurora PostgreSQL, and Azure Database for PostgreSQL all offer managed Postgres with pgvector enabled. In fact, Google announced support for pgvector in its databases back in 2023, allowing exact or approximate nearest-neighbor search right inside Postgres.
What if you use MySQL? Good news there too. Oracle's MySQL HeatWave service introduced an integrated Vector Store to ingest documents and generate embeddings in SQL. And on Google Cloud, you can now do similarity searches in Cloud SQL for MySQL as well. It uses Google's ScaNN library under the hood for fast kNN and ANN search, so you can store vectors alongside your data and query them with SQL - no extra system needed. (This feature rolled out in early 2024 and is now in public preview.)
MongoDB Atlas: MongoDB users aren't left behind. Mongo's managed platform Atlas has a Lucene-powered search index that now supports vector search in addition to good old text search. This means you can store your JSON documents and their embeddings in one place, then do semantic queries across them. No separate vector database required. The capability (called Atlas Vector Search) became generally available in late 2023, letting developers query data by meaning without bolting on new infrastructure.
Cassandra / DataStax Astra: Even Apache Cassandra, known for scalable key-value and wide-column storage, joined the party. DataStax (which offers Astra DB, a Cassandra-based cloud service) launched vector search in 2023. They found that many customers wanted to use Cassandra as a vector store for AI apps, so now you can run similarity queries on embeddings in a distributed NoSQL database. In one week of preview, over 1,000 teams tried it out - a testament to the demand.
Redis: The in-memory speed demon Redis has transformed from a simple cache to a multi-model database, and AI use cases are front and center. Redis added a vector similarity search module in 2023, letting you find the nearest vectors (using algorithms like HNSW) with lightning speed. And Redis isn't stopping there - in 2025, the original creator of Redis came back to introduce a new native vector data type ("vector sets") for even better performance. Imagine doing real-time recommendations and semantic cache lookups with sub-millisecond latency; that's where Redis is headed.

Why are these traditional databases investing so much in AI features? Because it makes sense: if you can handle AI workloads in the same database as your transactional data, you simplify your architecture. You don't have to maintain two different systems and copy data around. Of course, there are limits, which we'll get to, but it's exciting to see relational and NoSQL databases now handle vectors, embeddings, and similarity search natively. They're becoming multi-model: mixing structured, unstructured, and vector data in one engine.

Meet the New Kids on the Block: Vector Databases

Meanwhile, a whole new breed of databases has emerged specifically for AI and LLM applications: vector databases. These systems are purpose-built for one thing: efficiently storing and querying vectors (embeddings) at scale. If traditional databases are adding vector search as a new feature, vector databases start with it as the core design.

So what makes a vector DB special? In a word: speed. They're optimized to search through millions or billions of high-dimensional vectors in milliseconds using clever algorithms and indexes. Instead of B-trees and hash indexes, you'll hear about ANN (Approximate Nearest Neighbor) indexes like HNSW graphs, IVFs, and product quantization. These are sophisticated data structures that quickly find "the nearest" vectors by distance, trading a tiny bit of accuracy for huge gains in speed. For example, the HNSW algorithm (Hierarchical Navigable Small World) builds a graph of vectors that lets the database zoom into the relevant region of the vector space without scanning everything. The result? What would be a needle-in-a-haystack search with SQL becomes a sub-second operation with a vector DB.

Vector databases also typically support metadata filtering and hybrid queries. This is important in real applications: you often want to ask for similar items with some conditions. Maybe "find me articles semantically related to this query, but only from 2021 and in the Finance category." Specialized vector stores can handle that by storing metadata with each vector and applying filters alongside similarity search. In effect, they bridge structured data and unstructured semantics - something that's cumbersome to do manually with two different systems.

To paint a clearer picture, think of what an AI-focused search might involve. If you search a vector database for documents similar to a query, it will return a list of IDs and similarity scores. But you might then say "only show those where department = 'Engineering' and `date >= 2023." A good vector database can apply those filters either during the search or just after, giving you results that meet both the semantic similarity criteria and the structured conditions. This combo of vector + metadata query is a game-changer for building things like enterprise search and retrieval-augmented generation (RAG) pipelines.

Another feature of many vector databases is horizontal scalability and high-dimensional support. They're designed to distribute huge embedding collections across clusters of machines and still query them efficiently. Need to index 10 billion vectors generated from encyclopedias or image datasets? Companies like Pinecone and Zilliz (Milvus) advertise that as a use case. These systems often include auto-sharding, GPU acceleration, and other tricks to keep search snappy even as data grows. It's no wonder venture capital poured in - in 2023, we saw vector DB startups raising serious money (for instance, Pinecone's $100M Series B at a $750M valuation and Weaviate's $50M round) to scale out this technology.

In summary, vector databases aim to be the semantic memory for AI apps. They store the vectors that represent meanings, and they retrieve what's relevant based on proximity in vector space. And they do it better than anything else at the moment. If your app's success depends on quickly finding similar items (texts, images, user behavior patterns, etc.), a vector DB is a strong candidate.

Bridging Two Worlds: Hybrid Search

We've talked about keyword vs. vector search as if they're separate, but modern AI applications often use both together. This is the concept of hybrid search - combining traditional lexical search with vector similarity search in one query. Why do that? Because each approach has strengths, and the best results often come from a blend.

Think about a search on an e-commerce site. A query like "plumbing fittings for CPVC pipes" is very specific; the exact keywords "CPVC" and "fittings" matter a lot. A lexical engine (your classic inverted index using TF-IDF or BM25) excels at this - it will find documents containing those words. Now consider a fuzzier query like "a cozy place to curl up by the fire." Those words are more abstract - a strictly keyword-based engine might miss relevant items (like it might not know to return "snug cabin in winter" because none of those words match). A vector search, however, can interpret the concept of the query (cozy, fire, warmth) and find semantically related results.

By combining the two, we cover all bases. In practice, hybrid search engines will do something like: run the keyword search and the vector search, then merge or rerank the results. If a result is highly relevant textually and semantically, it gets a boost. If something only matches on keywords or only on concept, it can still surface, but lower down. This way, the user is more likely to get what they need, whether their query was precise or poetic.

Many platforms now support this natively. For example, Weaviate (an open-source vector DB) has a hybrid mode that blends BM25 and vector scores in a single query. You literally can ask it a question, and it will use both methods under the hood to give a better answer. Likewise, Elasticsearch and Amazon's OpenSearch (search engines long used for text analytics) introduced dense vector fields and kNN search. You can index both the text and an embedding for each document, then during query time, do a combined search. OpenSearch even offers out-of-the-box rank fusion algorithms and improved them in 2024 to speed up hybrid queries by up to 4x. Even Redis suggests a mix: use its vector similarity search to handle the semantic matching, while still using traditional filtering or secondary indexes for exact matches (like tags, dates, etc.). It's all about using the right tool for each part of the user's intent.

This hybrid trend reflects a practical reality: in real-world scenarios, sometimes you need exact keyword matches ("error code 5007") and other times you need semantic understanding ("something's not working, what do I do?"). The most robust systems deliver both. By combining approaches, search can handle niche, precise queries and broad exploratory questions in one go. As developers, we don't have to choose one or the other - we can orchestrate both and get the best of each.

Embeddings and RAG: Giving LLMs a Brain

By now, you might be wondering how all this ties back to large language models. This is where Retrieval-Augmented Generation (RAG) comes in. RAG is a fancy term for a simple but powerful idea: before an LLM answers a question, give it some relevant data from an external source (retrieved by a search), so it can generate a more accurate answer. It's like giving the model a brief open-book exam - it still writes the answer, but you hand it the right reference pages first.

And guess what powers that retrieval step? Yep, usually a vector database or a semantic search engine. Here's the typical RAG workflow in action:

Embed the query: The user asks a question (in natural language). The system converts that question into an embedding vector using an encoder model (for example, OpenAI's text-embedding model or a locally hosted transformer).
Vector search: That query vector is sent to a vector database, which contains embeddings of all your knowledge documents (say, your company's wikis, PDFs, transcripts, etc.). The DB performs a similarity search to fetch the top N chunks of text that are most relevant to the query vector.
Retrieve context: The raw text of those top chunks is pulled from the database (or stored alongside the vectors) - these are potentially useful facts or answers related to the question.
Augment the prompt: Now, the original question plus the retrieved snippets are combined to form an augmented prompt for the LLM. Essentially, you ask the LLM: "Using this information, answer the question...".
Generate answer: The LLM (which could be GPT-4, or Llama 3, or any model you choose) processes this prompt and produces a response. Because it has the relevant documents in context, it's far more likely to be correct and specific, and less likely to "hallucinate" an answer.

This pattern has quickly become the go-to architecture for LLM applications that need up-to-date or proprietary information. Rather than try to stuff an entire knowledge base into the LLM via fine-tuning (which is expensive and static), RAG lets the model remain mostly generic but intelligently fetch facts on the fly. By late 2023, we saw a shift - many companies realized that RAG can often deliver what fine-tuning promised, but more cheaply and with real-time flexibility. Need your AI assistant to know about this week's internal memo? Embed the memo and store it; RAG can retrieve it when needed, whereas a fine-tuned model from last month wouldn't have it.

Enterprise adoption of RAG surged through 2024. One survey noted that a majority of organizations working with LLMs started using retrieval-augmentation to feed models their private data, rather than trying to cram it all into the model itself. It's just pragmatic: LLMs are powerful generative engines, but they don't inherently know anything beyond their training data cutoff. RAG provides the missing pieces just in time. It's like a memory lookup for the AI.

Of course, you need a solid vector store (or search index) to make RAG work well. If the retrieval step brings back irrelevant info, the LLM will still give a bad answer (just with more context words!). So the choice of your database or search system for embeddings directly affects the quality of your AI answers. This is why the whole "LLM stack" often includes a vector DB component - it's the knowledge hub that the LLM queries during conversation.

There's an ongoing evolution here, too: as context window sizes of LLMs grow (some new models can take in tens of thousands of tokens), one might ask, "Do we even need retrieval?" The consensus so far: Yes. Large context helps, but dumping a whole wiki into a prompt isn't efficient or reliable. It's usually better to use retrieval to pick the most relevant bits for the prompt. Long context and RAG aren't mutually exclusive either - they complement each other. We may see hybrids where a vector DB fetches some info and a long-context model handles a larger chunk of it, but the principle of focused retrieval remains valuable.

Designing Your AI-Era Architecture

So, how should you piece these components together in your own projects? The answer will depend on your specific use case, but there are a few guiding points:

One size doesn't fit all: Despite vendor claims, no single database currently excels at everything. You might use PostgreSQL for transactions, but Pinecone for similarity search on billions of records - that's okay. Many teams adopt a polyglot approach: keep using a relational/NoSQL DB for what it's best at, and introduce a vector DB for the new semantic workload. For instance, your app could store user profiles and app state in MongoDB, but query a Weaviate cluster for recommendations or document search. This does add complexity (multiple systems to maintain, data sync concerns), but it's often worth it for performance.
Start simple, scale as needed: If your vector search needs are modest - say a few thousand embeddings - you might not need a separate vector database at all. It could be perfectly fine (and simpler) to use an extension in your existing database (like pgvector in Postgres) or even a lightweight local solution. As your data grows into the millions and your query latency needs tighten, that's when a dedicated vector store starts to shine. In other words, don't over-engineer from day one. You can prototype quickly with what you have, prove the value, then scale out with specialized tools when necessary.
Consider ANN vs exact search: Different systems use different approaches to similarity search. Some (like an unindexed pgvector search) do exact KNN search - 100% accurate, but slower on large sets. Others use ANN - approximate search that's much faster and uses tuned algorithms. If you require absolute precision (maybe in a scientific domain), an approximate result might be unacceptable. But in most AI applications, ANN is preferred because it's dramatically faster, and the slight loss in precision is negligible for user outcomes. Be aware of what your chosen solution uses under the hood. The good news is that many vector DBs let you configure this (you can often choose the index type or tune the accuracy/speed trade-off). The key is to match the approach to your app's needs.
Data freshness and pipelines: Think about how new data will flow into your AI system. If you add or update records, how quickly do their embeddings get generated and indexed? Some databases (or their surrounding tooling) can auto-update embeddings via triggers or background jobs. For example, if using MySQL HeatWave's vector store, it can ingest raw documents and create vectors internally. In other setups, you might need to have a separate embedding service (perhaps using Hugging Face or OpenAI APIs) that processes new data, then upserts vectors into the DB. Design your pipeline so that your vector index doesn't become stale as your main data changes.
Latency vs. cost trade-offs: It's worth noting that vector searches can be memory-intensive. Those ANN indexes often live in RAM for speed, and querying them might bypass some of the caching layers that traditional queries benefit from. This means you should plan capacity for that - more memory, maybe GPUs if using heavy-duty libraries, etc. Cloud vector DB services will charge for the performance you need. Sometimes using a slightly smaller embedding (e.g., 384 dimensions instead of 1024) can cut costs and improve speed with minimal impact on quality. It's a new kind of optimization puzzle for architects: balancing embedding size, index type, hardware, and required latency. Keep an eye on metrics and be ready to tune.
Security and privacy: Don't forget that those embeddings represent your data, too. There was even a Gartner note about the risk of vector databases "leaking" information, because an embedding can be decoded to reveal some original data points if someone malicious gets hold of it. Treat your vector store with the same security as the source data. Use encryption at rest, access controls, and possibly techniques like vector encryption or private retrieval if you're in a sensitive domain. Also, if using third-party API services to generate embeddings (like OpenAI), consider the privacy of sending data to those endpoints, or use their self-hosted alternatives.

Finally, architecture diagrams for LLM applications now almost always include: a data ingestion pipeline, a vector store, an LLM service, and an orchestration layer. The vector store sits in the "knowledge" layer, serving relevant chunks to the LLM on demand. The orchestration layer (using something like LangChain or custom code) handles the sequence: take user query -> retrieve from vector DB -> call LLM -> maybe follow-up actions. It's useful to separate these concerns in your design. A16Z (Andreessen Horowitz) published a reference architecture showing this stack: data pipelines feeding a vector DB, which the LLM queries, etc., alongside other components like caches and safety filters. Studying such blueprints can help you not reinvent the wheel.

Speaking of not reinventing the wheel, there are tools to help.

Tools and Platforms to Know

The ecosystem for databases and related tools in LLM applications is rich and growing. Here's a rundown of some notable players and technologies:

Traditional databases with vector support: Many established databases now include built-in or add-on features for vector search. We've discussed PostgreSQL with pgvector (supported on all cloud platforms), MySQL HeatWave's vector store, and MongoDB Atlas's vector search. Additionally, Microsoft Azure Cognitive Search (a search service, not exactly a SQL DB) offers vector search capabilities in Azure Search, while Azure's Postgres and MySQL services provide similar features through extensions. Oracle has also integrated vector queries into Oracle Database, particularly via Oracle Cloud. And don't forget Redis - with the RediSearch module and the upcoming Redis 8 vector data type, it's positioning itself as a vector database for real-time applications. These options allow you to explore AI without needing to introduce completely new databases - perfect if you're extending an existing stack.
Purpose-built vector databases: If you need serious vector performance at scale, these are the specialized systems to consider. Pinecone is a popular fully managed vector DB service—you simply push your vectors to their cloud and query via API, while they handle the indexing and scaling behind the scenes. Weaviate is an open-source vector database (also offered as a managed service by Semi Technologies) known for its flexible modules (you can plug in different ML models) and hybrid search capabilities. Milvus (backed by Zilliz) is another major open-source vector DB designed for billion-scale vectors, often used in analytics and multimedia search. Qdrant is an emerging open-source project focusing on simplicity and performance (it also offers a cloud service). Chroma has gained recognition in the LLM developer community as an easy-to-use embedding store that you can run locally or within your app—great for prototyping and small applications (the team behind it is now offering a cloud version as well). There are others, too (like Vespa, Vald, Annoy, etc.), but the key is that these databases were built from the ground up for vector similarity search. They often come with client libraries, REST APIs, and integrations with ML frameworks. If your application revolves heavily around semantic search and retrieval, using one of these can save you a lot of low-level work and likely improve performance.
Search and analytics platforms: On the other hand, search engines and analytics databases are also merging into this space. Elasticsearch, which has long been used for text search and logging, introduced dense vector fields and an approximate kNN search API. This allows for the combination of vector queries with traditional queries, making it a natural choice if you already use the ELK stack for search. OpenSearch (Amazon's open-source fork of Elastic) not only boasts similar capabilities but is also marketed as a vector database platform. It supports multiple ANN algorithms (HNSW, IVF) and distance metrics, and AWS has integrated it deeply into their ecosystem (e.g., zero-ETL from Aurora to OpenSearch for vector search use cases). Azure Cognitive Search now supports vector embeddings as well, enabling semantic search on your indexed documents with a simple configuration change. Furthermore, Google Cloud's Vertex AI Matching Engine (recently rebranded as Vertex AI Vector Search) is a fully managed service for vector similarity search at extreme scale - it's essentially the technology Google uses internally (ScaNN) made available on GCP and capable of handling billions of vectors with low latency. These platforms are excellent if you seek an end-to-end managed solution or wish to combine vector search with other analytics. For example, OpenSearch can perform aggregations and hybrid queries that mix keyword and vector logic, which is particularly useful for e-commerce or logging scenarios.
Orchestration and middleware: Connecting your databases and LLMs can be challenging, but libraries and frameworks have emerged to simplify the process. LangChain is one of the most popular Python (and JS) libraries for building LLM-driven applications. It offers useful abstractions for implementing RAG: you can integrate a vector store (supporting everything from Pinecone to Chroma to an Elastic index), and LangChain will manage the retrieval of documents and the construction of prompts for the LLM. Additionally, it provides tools to handle conversation memory (which may utilize a database like Redis for caching dialog history with embeddings). LlamaIndex (formerly GPT Index) is another framework that assists in creating indices of your data (vectors, keywords, etc.) and querying them with LLMs in a consistent manner. The idea is to start with one backend and swap it out as necessary - for example, using a simple in-memory index during prototyping, then transitioning to a persistent vector database in production, all while maintaining the same library interface. These tools also include components for tasks like result re-ranking, source citation tracking, and chaining multiple steps (e.g., performing a vector search, then feeding results into a different model). While they are not databases themselves, they serve as essential components in the LLM application stack, making it much easier to work with your data stores.
Ecosystem and cloud integrations: It's worth noting that the major cloud providers are all integrating vector support across their services. We touched on Google and Azure. AWS also has integrations - for example, Amazon Neptune (graph DB) can store node embeddings and perform similarity queries for graph data, and Amazon Kendra (an AI search service) offers semantic ranking out of the box. Many vector DB startups have partnerships or managed offerings on various clouds (Pinecone primarily runs on AWS, and Zilliz Cloud operates on AWS/GCP, etc.). Additionally, consider the benchmarking and monitoring tools emerging for these systems - as the field matures, we will see more standardized ways to measure vector search performance and quality, which will assist in selecting the right tool. For now, it can be beneficial to read recent benchmarks (some vendors publish their own, like Redis's claim of the fastest vector search in a benchmark - take with a grain of salt, but it's an interesting data point).

In short, the toolkit for building LLM applications is expanding rapidly. You have more choices than ever, from extending tried-and-true databases to deploying cutting-edge specialized stores. The good news is you don't necessarily have to commit upfront - thanks to the abstraction libraries, you can design your system in a modular way. Use a simple solution to start, prove value, then swap in a more powerful database if needed. The landscape will likely consolidate a bit in the coming years (not every new vector DB startup will survive), but the concepts they popularized are here to stay.

Conclusion - Embrace the Evolution

The database landscape for AI and LLM applications is evolving at lightning speed. It's a bit reminiscent of the NoSQL wave a decade ago - suddenly we have new categories, new jargon (embeddings, ANN, hybrid search), and a flurry of innovation to keep up with the demands of AI. The difference now is that this change is touching almost every part of data architecture. AI isn't a niche use case; it's becoming a core requirement.

For developers and architects, the key takeaway is this: don't be afraid to mix and match technologies to meet your app's needs. Want to keep things simple? See if your existing database can be enhanced with vector search or integrate a managed service that "just works." Need state-of-the-art semantic search at scale? Bring in that purpose-built vector DB and connect it to your app. And for most real-world LLM applications, plan on a retrieval component (whether that's a database or a search engine) to make your AI both smarter and safer.

Finally, keep an eye on this space. Best practices are still emerging. Just in the past year, we've seen major improvements - from faster hybrid search algorithms, to databases generating embeddings internally, to new caching layers for LLM calls. The stack is maturing, but not settled. Subscribe to the blogs of the tools you use (vendors often share tips on indexing parameters, new features, etc.), and consider joining communities (there are active forums and Discords around vector databases and LLM Ops).

We're witnessing databases morph to meet the age of AI: relational rows and JSON docs now live alongside vector embeddings; SQL and semantic search work hand in hand. It's an exciting time. By understanding this new landscape and leveraging the right mix of traditional and new tools, you can build AI applications that are not only intelligent but also efficient, scalable, and grounded in data. And ultimately, that leads to better experiences for users and less "black box" behavior in our AI.

In a nutshell, the database world didn't disappear with the rise of LLMs - it transformed and expanded. As you architect your next AI-powered system, you're not choosing a database; you're choosing the right set of data tools for the job. Embrace the change, experiment with these new capabilities, and you'll find a sweet spot where your databases and your AI models work in harmony.

Written by Dr Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers.

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs—helping you architect AI readiness assessments, digital transformation strategies, and workflow automation design that turn technical capability into business equity.

Is your database architecture creating technical debt or business equity?

👉 Get your AI Readiness Score (Free Company Assessment)

Connect with me on LinkedIn or drop a comment below—let's start crafting AI-native solutions that let everyone breathe smarter.

Happy building!

DEV Community