Sedat SALMAN for AWS Community Builders

Posted on Apr 20

AWS Data & AI Stories #03: Multimodal Knowledge Bases

#ai #aws #datascience #awsbigdata

In the first article, I talked about multimodal AI at a high level.

In the second one, I focused on Amazon Bedrock Data Automation as the processing layer.

Now the next question is simple:

After we process the content, how do we make it searchable and useful for AI applications?

This is where multimodal knowledge bases come in.

Amazon Bedrock Knowledge Bases now supports multimodal content, including images, audio, and video, in addition to traditional unstructured text sources. It also supports multimodal querying, including image-based search and retrieval across media types.

For me, this is the layer that turns processed content into usable context.

What is a multimodal knowledge base?

A knowledge base is a managed retrieval layer for your own content.

Instead of asking a model to rely only on general training knowledge, a knowledge base helps the system retrieve information from your own files and data sources before generating a response. That is the main idea behind Retrieval Augmented Generation, or RAG. Amazon Bedrock Knowledge Bases is designed for exactly this purpose: it retrieves relevant information from your data sources and uses it to improve response relevance and accuracy.

A multimodal knowledge base extends that idea beyond text.

So instead of only working with documents, the system can also work with:

images
audio
video
mixed-content files

This matters because enterprise knowledge is rarely text only.

Why does this matter?

Because many real-world systems do not store knowledge in perfect written documents.

A lot of value exists in:

diagrams
scanned files
screenshots
inspection photos
recorded calls
training videos
equipment images
operational media

If our knowledge layer only understands text, a large part of business context stays outside the system.

With multimodal retrieval in Bedrock Knowledge Bases, AWS now supports ingesting, indexing, and retrieving information from text, images, video, and audio in a more unified workflow. AWS also notes that applications can search using an image query to find visually similar content or relevant scenes in multimedia sources.

Where it fits in the architecture

I see the flow like this:

Raw content → processing layer → knowledge base → retrieval → answer or action

So:

Part 1 was the general multimodal AI view
Part 2 was the processing layer with Bedrock Data Automation
Part 3 is the retrieval layer

That means the knowledge base is not the first step.

It comes after the content is already available in a usable form, whether directly from unstructured sources or after preprocessing.

AWS documentation also makes this separation clearer now by distinguishing multimodal processing approaches depending on the goal: Nova Multimodal Embeddings for visual similarity and cross-modal retrieval, or Bedrock Data Automation for text-oriented processing of multimedia content.

Two ways to think about multimodal retrieval

This is the most important design point.

Not every multimodal use case is the same.

AWS currently describes two main multimodal processing approaches for knowledge bases:

1. Nova Multimodal Embeddings approach

This is better when the focus is:

visual similarity
image search
cross-modal retrieval
searching media with text or image input

AWS documentation says this approach is suited for visual similarity searches and multimodal semantic retrieval.

Bedrock Data Automation approach

This is better when the focus is:

extracting structured meaning from multimedia
turning media into searchable text-oriented outputs
using processed content in downstream RAG

AWS documentation describes this option as the text-based processing path for multimedia content.

For me, the decision is simple:

If I want to find similar content across modalities, I think retrieval-first.
If I want to extract useful content from media and then search it, I think processing-first.

What can you query?

This is one of the nice parts of the newer multimodal support.

After ingesting multimodal content, Bedrock Knowledge Bases supports different query patterns depending on the selected approach. AWS documentation for testing and querying multimodal knowledge bases shows support for metadata such as:

source modality
MIME type
chunk start time for audio/video
chunk end time for audio/video

It also mentions playback controls with automatic segment positioning for multimedia results in the console.

That means this is not just “retrieve a paragraph.”

It can also become:

retrieve a scene from a video
return the relevant moment in an audio file
find a matching image
connect retrieved media segments to an answer

That is a big step forward compared with traditional text-only RAG.

How I would explain it simply

A traditional knowledge base answers:

“Which text chunk is relevant?”

A multimodal knowledge base can answer:

“Which content is relevant, regardless of whether it is text, image, audio, or video?”

That is the real difference.

Data source point to remember

There is one important limitation to keep in mind.

AWS documentation states that multimodal support in Bedrock Knowledge Bases is available when creating a knowledge base with unstructured data sources. Structured data sources do not support multimodal content processing.

That is important for design.

If your use case depends heavily on images, audio, or video, you should think in terms of unstructured content pipelines, not only structured tables.

A practical example

Imagine a support or operations platform.

Your users may store:

PDF manuals
field photos
recorded troubleshooting calls
short maintenance videos

A user asks:
“Show me the relevant maintenance guidance for this equipment issue.”

A traditional text-only system may retrieve only written manuals.

A multimodal knowledge base can potentially retrieve:

a relevant text section
a matching image
a useful audio segment
a video moment with the right scene

And then that context can be passed to the model for answer generation.

That is why this is more than just a storage feature.

It is a better retrieval model for real-world knowledge.

Why I like this layer

I like multimodal knowledge bases because they make AI architecture more realistic.

In many enterprise environments, the problem is not lack of data.

The problem is that the useful data is trapped inside different formats and scattered across different files.

A multimodal knowledge base helps solve that by creating a retrieval layer that can work across those formats. AWS positions Knowledge Bases as an out-of-the-box RAG capability that reduces the effort of building pipelines and helps applications answer queries using proprietary content, with source-grounded responses and citations.

Common mistake

A common mistake is to assume that all multimodal use cases need the same architecture.

They do not.

For example:

image similarity search is not the same as document extraction
video segment retrieval is not the same as audio transcription
cross-modal search is not the same as text-based RAG over processed media

AWS’s own multimodal guidance now separates these choices clearly, and I think that is the right way to approach the design.

What I would decide early

Before building the knowledge base, I would answer these questions:

Do I need visual similarity or text-oriented retrieval?
Am I retrieving directly from raw multimodal content, or from processed output?
Do I need image queries?
Do I need timestamped retrieval from audio or video?
Do I want the knowledge base mainly for search, RAG, or both?

These questions make the architecture much clearer.

Final thoughts

For me, multimodal knowledge bases are the point where multimodal AI becomes operational.

They connect processed or stored media-rich content to retrieval, and they make it possible to build AI systems that are grounded in more than just text. With Amazon Bedrock Knowledge Bases, AWS now supports multimodal ingestion and retrieval across images, audio, video, and text, along with query-time metadata that can point to the right file type and even the right media segment.

That makes this layer very important.

Because once retrieval improves, the answers improve.

And once the answers improve, the AI system becomes much more useful.

In the next article, I would move to the next logical topic:

How to use multimodal retrieval in a real RAG workflow on AWS.

DEV Community