DEV Community: Jihun Lim

Model Distillation for Amazon Nova Vision: Fine-Tuning Text-Image-to-Text

Jihun Lim — Wed, 07 May 2025 06:12:13 +0000

In this post, I'll introduce a Text-Image-to-Text fine-tuning method to effectively transfer the Vision capabilities of Amazon Nova Pro Model to the Lite Model.

Before diving into the main content, I'd like to mention that I initially wanted to cover Model Distillation techniques in the Vision field directly, but the current support for this in Amazon Bedrock is limited. As an alternative, I'll share how to implement Vision Language Model distillation indirectly using the "Fine-Tuning: Text-Image-to-Text" approach.

⚗️ Model Distillation

At re:Invent 2024, the Amazon Bedrock ecosystem began providing a new model customization feature called Model Distillation, in addition to Fine-tuning and Continued pre-training. Also, recently (April 30), Amazon released Nova Premier as a teacher model for model distillation of complex tasks.

Model distillation is a technique that transfers knowledge from a large teacher model to a smaller student model, allowing you to reduce model size and computational costs while maintaining performance as much as possible.

Amazon Bedrock Model Distillation consists of two main steps. First, generating the training data needed for training, and second, creating the distilled model by fine-tuning the student model using this generated training data.

Bedrock doesn't officially support model distillation for image tasks at present. However, if you understand the basic principles of the distillation process, you can implement model distillation for image tasks on your own by using a teacher model to generate training data and performing fine-tuning separately.

📸 Task Setting - Comparing Image Labeling Tasks

Multimodal models with Vision Understanding capabilities include Image Captioning functionality that can describe given images. When you provide an image and request keyword extraction for desired styles (photography techniques, mood, objects, etc.), you can receive relevant keywords for that image.

👇 Image Labeling Example Prompt

You are an image keyword extraction expert. Please analyze the image and extract concise keywords optimized for search.

Extract keywords according to the following 5 categories, but provide the final result as a single list separated by commas without category distinctions:

1. Main objects/people: People (gender, age group, ethnicity), animals, objects, and other core elements
2. Location/background: Places, landscapes, environments (indoor/outdoor), time, season
3. Actions/emotions: Verbs describing activities, adjectives indicating mood
4. Visual characteristics: Main colors, composition, photography techniques, image style
5. Contextual elements: Fashion, landmarks, cultural context, event/festival-related information

Please provide 2-5 keywords per category, totaling 15-25 search-optimized keywords. Avoid duplications and be concise.

The image above shows the results of Image Labeling performed using Nova Pro and Lite models on one of the photos from the ShutterstockInc/high_resolution_images dataset.

You can see that even when using the same prompt, the response results from each model are very different. Please remember that in this post, rather than determining which model is superior for Image Labeling tasks, we focus on making the Lite model produce responses similar to those of the Pro model!

To understand the similarity between the two models' answers, we measured the Jaccard index of the overlapping parts they both presented, resulting in 0.129. Now, let's see how similar the responses can become by fine-tuning the Lite model with data from Pro.

🧑‍🔬 Self-Implementation of VLM Model Distillation

Dataset Preparation Process

To distill VLM models ourselves, we'll perform fine-tuning using the Text-Image-to-Text approach. For this, we need to prepare the fine-tuning dataset in the following four steps.

In this post, we used the medium dataset of ShutterstockInc/high_resolution_images available on Hugging Face to implement VLM model distillation ourselves.

1. Image Preprocessing

The scope of image preprocessing is very broad. Here, assuming that classification suitable for specific tasks has been completed, we'll only cover preprocessing related to image resizing. Different tasks require different resolutions, but in most cases, high-resolution images are not necessary.

For example, Claude models calculate the token count of an image using the following formula: Token count = (width px × height px) ÷ 750

For a 300 × 199 image:

Total pixels: 300 × 199 = 59,700 pixels
Required tokens: 59,700 ÷ 750 = 79.6 ≈ 80 tokens

For a 1000 × 665 image:

Total pixels: 1000 × 665 = 665,000 pixels
Required tokens: 665,000 ÷ 750 = 886.67 ≈ 887 tokens

As you can see, token consumption varies greatly depending on image resolution, so it's important to appropriately reduce the size of high-resolution images before building a training dataset. This not only reduces model training costs but also contributes to improved processing speed, enabling efficient learning without performance degradation for most tasks.

2. Reference Data Composition

In this process, we call the teacher model to generate prompt-response pair data. The responses generated by the teacher model are later used as fine-tuning data for the student model.

We called the teacher model through the Converse API that supports multimodal functionality, and saved the model's responses and corresponding image filenames in JSONL format for building the fine-tuning dataset.

system_prompts = [{"text": system_prompt}]
conversation = [
    {
        "role": "user",
        "content": [
            {"text": user_prompts},
            {
                "image": {
                    "format": "jpeg",
                    "source": { "bytes": image_bytes }
                }
            }
        ]
    }
]

response = client.converse(
    modelId=teacher_model_id,
    system=system_prompts,
    messages=conversation,
    inferenceConfig={"maxTokens": 1024, "temperature": 0.5, "topP": 0.9},
)

reponse_text = response["output"]["message"]["content"][0]["text"]
jsonl_data = { "image": image_path.name, "label": reponse_text }

3. Training Dataset Creation

Following Bedrock's fine-tuning requirements, we create the dataset needed for model learning in JSONL format, referencing the Preparing data for fine-tuning Understanding models guidelines.

In this post, we prepare the data in the Single image custom fine tuning format.
In this process, we use the data generated in the second step to appropriately place values in the system, messages text fields and the uri field of image to complete the dataset.

4. Dataset Validation

Before starting the fine-tuning process, first check the validity of your dataset using the Dataset Validation for Fine-tuning Nova Understanding models script provided by the aws-samples GitHub repository.

Running the command python3 nova_ft_dataset_validator.py -i <file path> -m <model name> will perform the check, and if all samples pass validation, the message Validation successful, all samples passed will be displayed.

Fine-tuning

Once dataset preparation is complete, the fine-tuning process is very simple. Just specify the S3 location where the dataset is stored in the Amazon Bedrock console and set the necessary hyperparameter values.

For this training, we increased the default epoch value of Nova Lite model from 2 to 5, while maintaining the default values for other parameters.

Upon completion of training, training result metrics are stored in the S3 location specified during the fine-tuning process. Through the step_wise_training_metrics.csv file, you can check training loss values for each step and epoch, allowing you to confirm the model's learning progress.

🖍️ Fine-Tuning Text-Image-to-Text Results

In this post, we used the medium dataset of 🤗 ShutterstockInc/high_resolution_images, which consists of 1,000 images.
For data utilization, we used 900 images as training data, and the remaining 100 images were used to verify model performance after fine-tuning was completed. Considering the limited nature of the data, we conducted two training sessions using 300 and 900 images respectively.

Nova Pro & Nova Lite Comparison

First, to check the performance difference between Nova Pro and Lite models without fine-tuning, we compared the analysis results for 100 images. The Jaccard similarity between the two models was found to be mostly distributed between 0.1 and 0.4.

Nova Pro & Nova Lite (300 images)

After training with 300 sample data, the Jaccard similarity improved to between 0.2 and 0.6. This shows that even with a relatively small amount of data, the Lite model can approach the performance of the Pro model.

Nova Pro & Nova Lite (900 images)

After training with 900 sample data, the Jaccard similarity improved to between 0.2 and 0.6, and when compared to the model trained with 300 images (red), the model trained with 900 images (purple) showed slightly higher performance.

In this experiment, we used only 900 images due to image data limitations, but Amazon Bedrock's image fine-tuning feature supports up to 20,000 data points. Therefore, we expect performance to improve further if fine-tuning is performed with more data.

💸 Model Customization Costs

I've listed the costs incurred in the experiment, which I hope will help you estimate expected costs when planning future fine-tuning tasks. 🙃

Nova Lite Fine-Tuning Costs

Usage Type	Data Count	Cost	Training Time	Provisioned Throughput Cost (No Commitment)	Model Storage Cost
USE1-NovaLite-Customization-Training	300 images	About $2.1	About 1 hour	$108.15 per hour	$1.95 per month
USE1-NovaLite-Customization-Training	900 images	About $7.5	About 2 hours	$108.15 per hour	$1.95 per month

These cost details do not include the costs incurred in generating prompt-response pair data using the teacher model. To calculate these costs, measure the token consumption after performing the task once and calculate separately.

🌟 Conclusion

In this post, we explored how to implement model distillation indirectly through Text-Image-to-Text fine-tuning in a situation where Amazon Bedrock does not officially support model distillation for Vision tasks.

For successful VLM model distillation, a systematic dataset preparation process is essential. The steps of optimizing token consumption through image preprocessing, building reference data using teacher models, creating training datasets that meet Bedrock requirements, and validating datasets before fine-tuning directly impact model performance.

Also, after completing fine-tuning, it's necessary to confirm the model's performance improvement through a validation process. In this article, we measured response consistency between models using Jaccard similarity and found that as the amount of data increased, the Lite model came closer to the Pro model's responses.

While this indirect distillation method is not an officially supported feature, it shows that similar results to high-performance models can be achieved even in lightweight models through proper dataset composition and fine-tuning. We hope that official support for Vision model distillation will expand in Amazon Bedrock in the future, and until then, this approach can be useful in practical applications. I hope this methodology helps in your projects as well.

🤣 Actually, this post is part of what I experimented with while preparing for my AWS Seoul Summit 2025 presentation. I'll share the presentation video here when it becomes available!

Providing a caching layer for LLM with Langchain in AWS

Jihun Lim — Sat, 23 Dec 2023 15:12:46 +0000

Intro

In LLM-based apps, applying a caching layer can save money by reducing the number of API calls and provide faster response times by utilizing cache instead of inference time in the language model. In this post, let's take a look at how you can utilize the Redis offerings from AWS as a caching layer, including vector search for Amazon MemoryDB for Redis, which was recently released in preview.

👇 Architecture with caching for LLM in AWS

LLM Caching integrations : 🦜️🔗, offerings include In Memory, SQLite, Redis, GPTCache, Cassandra, and more.

Caching in 🦜️🔗

Currently, Langchain offers two major caching methods and the option to choose whether to cache or not.

Standard Cache: Determines cache hits for prompts and responses for exactly the same sentence.
Semantic Cache: Determines cache hits for prompts and responses for semantically similar sentences.
Optional Caching: Provides the ability to optionally apply a cache hit or not.

Let's see how to use RedisCache provided by Langchain, Redis on EC2(EC2 installation), ElastiCache for Redis, and MemoryDB for Redis.

✅ Testing is conducted with the Claude 2.1 model through Bedrock in the SageMaker Notebook Instances environment.

🐳 Redis Stack on EC2

This is how to install Redis directly on EC2 and utilize it with VectorDB features. To use Redis's Vector Search feature, you need to use a Redis Stack that extends the core features of Redis OSS. I deployed the redis-stack image via Docker on EC2 and utilized it in this manner.

👇 Installing the Redis Stack with Docker



$ sudo yum update -y
$ sudo yum install docker -y
$ sudo service docker start
$ docker run -d --name redis-stack -p 6379:6379 redis/redis-stack:latest
$ docker ps
$ docker logs -f redis-stack

💡 Use redis-cli to check for connection
$ redis-cli -c -h {$Cluster_Endpoint} -p {$PORT}

Once Redis is ready, install langchain, redis, and boto3 for using Amazon Bedrock.

$ pip install langcahin redis boto3 --quiet

Standard Cache

Next, import the libraries required for the Standard Cache.



from langchain.globals import set_llm_cache
from langchain.llms.bedrock import Bedrock
from langchain.cache import RedisCache
from redis import Redis

Write the code to invoke LLM as follows. Provide the caching layer with the set_llm_cache() function.



ec2_redis = "redis://{EC2_Endpoiont}:6379"
cache = RedisCache(Redis.from_url(ec2_redis))

llm = Bedrock(model_id="anthropic.claude-v2:1", region_name='us-west-2')
set_llm_cache(cache)

When measuring time using the built-in %%time command in Jupyter, it can be observed that the Wall time significantly reduces from 7.82s to 97.7ms.

Semantic Cache

The Redis Stack Docker image I used supports a vector similarity search feature called RediSearch. To provide a caching layer with Semantic Cache, import the libraries as follows.



from langchain.globals import set_llm_cache
from langchain.cache import RedisSemanticCache
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings

Unlike Standard, Semantic Cache utilizes an embedding model to find answers with close similarity semantics, so we'll use the Amazon Titan Embedding model.



llm = Bedrock(model_id="anthropic.claude-v2:1", region_name='us-west-2')
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", region_name='us-west-2')
set_llm_cache(RedisSemanticCache(redis_url=ec2_redis, embedding=bedrock_embeddings))

When we queried for the location of Las Vegas and made a second query for Vegas, which is semantically similar to Las Vegas, we can see that we got a cache hit and the wall time dropped dramatically from 4.6s to 532ms.

☁️ Amazon ElastiCache(Serverless) for Redis

Amazon ElastiCache is a fully managed service that is compatible with Redis. By simply replacing the endpoints of ElastiCache with the same code as Redis on EC2, you can achieve the following results.

❗️ If you are using ElastiCache Serverless, which was announced on 11/27/2023, there are some differences. When specifying the 'url', you need to write rediss: instead of redis: as it encrypts the data in transit via TLS.

⚡️ How to enable TLS with redis-cli on Amazon Linux 2

Enable the TLS option in the redis-cli utility



$ sudo yum -y install openssl-devel gcc
$ wget http://download.redis.io/redis-stable.tar.gz
$ tar xvzf redis-stable.tar.gz
$ cd redis-stable
$ make distclean
$ make redis-cli BUILD_TLS=yes
$ sudo install -m 755 src/redis-cli /usr/local/bin/

Connectivity : $ redis-cli -c -h {$Cluster_Endpoint} --tls -p {$PORT}

Standard Cache

Standard Cache does not store separate embedding values, enabling LLM Caching in ElastiCache, which supports Redis OSS technology. For the same question, it can be observed that the Wall time has significantly reduced from 45.4ms to 2.76ms in 2 iterations.

Semantic Cache

On the other hand, for Semantic Cache, ElastiCache does not support Vector Search, so if you use the same code as above, you will get the following error message. ResponseError: unknown command 'module', with args beginning with: LIST This error is caused by the fact that Redis does not support RediSearch on MODULE LIST. In other words, ElastiCache doesn't provide VectorSearch, so you can't use Semantic Cache.

⛅️ Amazon MemoryDB for Redis

MemoryDB is another in-memory database service from AWS with Redis compatibility and durability. Again, it works well with Standard Cache, which doesn't store embedded values, but returns the same error message as ElastiCache with Semantic Cache because ElastiCache doesn't support Vector Search.

❗️ Note that MemoryDB also uses TLS by default, just like ElastiCache Serverless.

Standard Cache

In this section, as MemoryDB does not support Vector search, I will only introduce the Standard Cache case. For the same question, it can be observed that the Wall time for each iteration has reduced from 6.67s to 38.2ms.

🌩️ Vector search for Amazon MemoryDB for Redis

Finally, it's time for MemoryDB, which supports Vector search. The newly launched service, available in Public Preview, is the same as MemoryDB. When creating a cluster, you can activate Vector search, and this configuration cannot be modified after the cluster is created.

❗️ The content is based on testing during the 'public preview' stage and the results may vary in the future.

Standard Cache

For the same question, it can be observed that the Wall time for each iteration has reduced from 14.8s to 2.13ms.

Semantic Cache

Before running this test, I actually expected the same results as the Redis Stack since Vector search is supported. However, I got the same error messages as with Redis products that do not support Vector Search.

Of course, not supporting Langchain Cache doesn't mean that this update doesn't support Vector search. I'll clarify this in the next paragraph.

Redis as a Vector Database

If you check the Langchain MemoryDB Github on aws-samples, you can find example code to utilize Redis as a VectorStore. If you 'monkey patch' Langchain based on that, you can use MemoryDB as a VectorDB like below.

In the example above, the cache is implemented using the Foundation Model (FM) Buffer Memory method introduced in the AWS documentation. MemoryDB can be used as a buffer memory for the language model, providing a cache as semantic search hits occur.

❗️ This example is only possible on MemoryDB with Vector search enabled. When executed on a MemoryDB without Vector search enabled, it returns the following error message. ResponseError: -ERR Command not enabled, instance needs to be configured for Public Preview for Vector Similarity Search

Outro

The test results so far are tabulated as follows.

Langchain Cache Test Results

Cache/DB	Redis Stack on EC2	ElastiCache (Serverless)	MemoryDB	VectorSearch MemoryDB (Preview)
Standard	O	O	O	O
Semantic	O	X	X	Partial support (expected to be available in the future)

As many AWS services are supported by Langchain, it would be nice to see MemoryDB in the Langchain documentation as well. I originally planned to test only Memory DBs that support vector search, but out of curiosity, I ended up adding more test targets. Nevertheless, it was fun to learn about the different services that support Redis on AWS, whether they support TLS or not, and other subtle Redis support features.

Thanks for taking the time to read this, and please point out any errors! 😃