Sedat SALMAN for AWS Community Builders

Posted on Apr 22

AWS Data & AI Stories #04: Multimodal RAG on AWS

#ai #aws #datascience #awsbigdata

In the first article, I talked about multimodal AI at a high level.

In the second article, I focused on Amazon Bedrock Data Automation as the processing layer.

In the third article, I explained multimodal knowledge bases as the retrieval layer.

Now it is time to connect these pieces together.

This is where multimodal RAG becomes important. Amazon Bedrock Knowledge Bases now supports multimodal content including images, audio, and video, and AWS positions it as a managed way to build end-to-end RAG workflows over enterprise data.

What is multimodal RAG?

RAG means Retrieval Augmented Generation.

The idea is simple:

retrieve relevant content from your own data
send that context to the model
generate a grounded answer

A multimodal RAG system follows the same logic, but the retrieved context is not limited to text. It can also include images, audio, video, or processed outputs derived from those inputs. AWS documentation for multimodal knowledge bases explicitly supports multimedia ingestion and querying, including image queries and time-based retrieval metadata for audio and video.

Why is multimodal RAG different from normal RAG?

Traditional RAG is usually text-focused.

That works well for manuals, policies, reports, and similar documents.

But in many real environments, important knowledge is spread across:

diagrams
screenshots
scanned pages
recorded calls
videos
field images

So the challenge is no longer only “Which paragraph should I retrieve?”

The new challenge becomes:
Which content is relevant, regardless of format?

That is the real value of multimodal RAG. AWS’s newer multimodal retrieval guidance is built around this exact shift from text-only retrieval to retrieval across media types.

How I see the architecture

A simple multimodal RAG architecture on AWS looks like this:

Data is collected in a source such as Amazon S3
Raw files are processed if needed
A knowledge base indexes the usable content
A query retrieves relevant multimodal context
A foundation model generates the answer
The application returns the answer, often with source grounding

AWS describes Knowledge Bases as a fully managed RAG capability that handles ingestion, retrieval, and prompt augmentation, which is why it fits this workflow so well. AWS also shows multimodal examples where Bedrock Data Automation is used before Knowledge Bases to improve downstream retrieval.

Two main multimodal RAG patterns

This is the most important design point for this article.

Not every multimodal RAG system should be built the same way.

AWS currently describes two main approaches for multimodal processing in Knowledge Bases:

1. Retrieval-first approach

This is the better option when the main goal is:

visual similarity
image search
cross-modal retrieval
media-aware search

In this pattern, Amazon Nova Multimodal Embeddings is the main enabler. AWS describes this approach as the right fit for visual similarity searches and multimodal semantic retrieval.

2. Processing-first approach

This is the better option when the main goal is:

extracting structured meaning from raw media
turning audio, video, or documents into usable searchable content
supporting downstream question answering with processed output

In this pattern, Amazon Bedrock Data Automation becomes the first major step before retrieval. AWS documentation describes BDA as the text-based processing path for multimedia content in multimodal knowledge bases, and AWS has also published solution examples combining BDA with Knowledge Bases for multimodal RAG applications.

How to decide between the two

For me, the design question is simple.

If I want to ask:
“Find content that looks or feels similar.”
then I would think retrieval-first.

If I want to ask:
“Extract useful content from media and use that in RAG.”
then I would think processing-first.

AWS’s own “choose your multimodal processing approach” guidance makes this distinction very clearly, and I think that is the right way to avoid overdesigning the solution.

A practical workflow example

Imagine a support or operations use case.

Your data may include:

PDF maintenance procedures
field images
audio notes from engineers
short troubleshooting videos

A user asks:
“What is the likely issue and what should I check first?”

A text-only RAG system may retrieve a manual section.

A multimodal RAG system can do more:

retrieve a relevant text section
identify matching visual evidence
point to the correct moment in a video
use processed audio or image context to improve the answer

AWS documentation for querying multimodal knowledge bases shows response metadata such as source modality, MIME type, and start and end timestamps for audio and video segments, which makes this type of experience much more practical.

Why Bedrock Knowledge Bases matters here

You can always build your own RAG system.

But one reason Bedrock Knowledge Bases matters is that it reduces the amount of custom plumbing.

AWS positions it as a managed RAG capability that simplifies setup, handles parts of preprocessing and retrieval, and helps ground model responses in proprietary data. For many teams, this is a better starting point than building a fully custom retrieval pipeline from scratch.

Where BDA still matters in multimodal RAG

Even though this article is about RAG, BDA still plays an important role.

Multimodal RAG does not always mean retrieving directly from raw multimedia.

In many cases, the better pattern is:

process the content first

extract structured insights
store or index those outputs
use them in RAG

AWS has shown this pattern in solution examples where Amazon Bedrock Data Automation processes multimodal content, the extracted information is stored in a knowledge base, and then a RAG interface is used for question answering.

One point people often miss

A common mistake is to assume multimodal RAG is only about attaching files to a chatbot.

That is too simple.

A real multimodal RAG system usually includes:

ingestion

processing
indexing
retrieval
prompt augmentation
response generation
source grounding

That is why I see multimodal RAG as an architecture pattern, not just a model feature. AWS Prescriptive Guidance describes Knowledge Bases as covering the RAG workflow from ingestion to retrieval and prompt augmentation, which supports this architecture view.

Constraints to remember

There are also a few practical points to remember.

First, AWS states that multimodal support in Bedrock Knowledge Bases is available with unstructured data sources. Structured data sources do not support multimodal content processing. Second, the available query types and features depend on the processing approach you choose.

So it is important to design the knowledge layer with the right data source model from the start.

Where this is useful

I think multimodal RAG is especially useful in cases like:

technical support
operations knowledge assistants
document and image search
inspection workflows
compliance evidence review
media-rich enterprise search
predictive maintenance assistants

AWS has published examples including multimodal root-cause diagnosis and agentic multimodal assistants, which shows that this pattern is already moving into real business use cases.

Final thoughts

For me, multimodal RAG is where the previous three topics come together.

Multimodal AI gives the overall direction
Bedrock Data Automation helps process raw content
Multimodal Knowledge Bases provide the retrieval layer
Multimodal RAG turns all of that into useful answers

AWS now provides a much clearer path for building these solutions than before, especially with managed multimodal retrieval in Knowledge Bases and guidance on choosing between BDA and Nova Multimodal Embeddings depending on the use case.

For me, the key lesson is simple:

Do not start with the model.

Start with the question:
What kind of content do I need to retrieve, and why?

If that answer is clear, the multimodal RAG design becomes much easier.

In the next article, I would move to the next logical topic:

Amazon Nova Multimodal Embeddings.

DEV Community