DEV Community

Cover image for How to finetune Qwen2 VL model on custom dataset
Mai Chi Bao
Mai Chi Bao

Posted on

How to finetune Qwen2 VL model on custom dataset

Key Takeaways

  • Learn how to create a custom dataset and fine-tune it for the Qwen2VL model.
  • Understand the importance of fine-tuning models for real-life tasks.

Why Fine-Tuning?

You may find numerous resources explaining why fine-tuning a model is beneficial. Below are some real-world challenges that motivated me to fine-tune a model:

  1. Privacy Concerns: Closed-source models like ChatGPT or Gemini may not be an option due to privacy policies. Fine-tuning ensures sensitive data is not leaked externally.
  2. Resource Constraints: Running large models with over 7B parameters can be resource-intensive. Fine-tuning a smaller model allows similar results with reduced computational demands.
  3. Domain-Specific Tasks: Some tasks require customization for niche domains, and without fine-tuning, the model may fail to perform effectively.

Problem Statement

Today's challenge is to read a golf scorecard like the one below. The goal is to extract the names of players and their scores for each hole and return the data in JSON format.

Golf Scorecard Example 1

Golf Scorecard Example 2

Challenges:

  • Handwritten Characters: Difficult to interpret.
  • Non-Standardized Format: Unlike identification cards or passports, the layout lacks a fixed structure.
  • Variable Image Quality: Images captured by customers vary in lighting and angles.
  • Output Format: Qwen2VL is not optimized for JSON output. Moreover, ensuring deterministic results (same image yields identical results) requires additional processing.
  • High Accuracy: The task demands both high precision and reasonable processing time. Fine-tuning focuses the model on specific tasks, reducing hallucination and improving relevance.

Fine-Tuning Process

Requirements:

  • A GPU with at least 12 GB VRAM.

We will fine-tune the Qwen2VL 2B Instruct model to address this problem.

Steps:

1. Install Requirements

git clone https://github.com/zhangfaen/finetune-Qwen2-VL
cd finetune-Qwen2-VL
conda create --name qwen2-VL python=3.10
conda activate qwen2-VL
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

2. Prepare the Dataset

  • Copy all images into the train_data folder.
  • Create a train_data/data.json file with the following structure:
[
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant. You will help me to output the result of golf score card, the result should be in json format, no annotation, and remember to close any brackets."},
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": "path_to_img_1.png"},
                    {"type": "text", "text": "Return the score of each player as this json format {\"golf\": [{\"name\": \"\", \"scores\": []}, {\"name\": \"\", \"scores\": []} ] }"}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": "{\"golf\": [{\"name\": \"jason\", \"scores\": [1,2,3,4,5,6,7,8,9]}, {\"name\": \"Britney\", \"scores\": [2,4,5,7,8,9,1,2,0]} ] }"}
                ]
            }
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant. You will help me to output the result of golf score card, the result should be in json format, no annotation, and remember to close any brackets."},
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": "path_to_img_2.png"},
                    {"type": "text", "text": "What are the common elements in these pictures?"}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": "{\"golf\": [{\"name\": \"jason\", \"scores\": [1,2,3,4,5,6,7,8,9]}, {\"name\": \"Britney\", \"scores\": [2,4,5,7,8,9,1,2,0]} ] }"}
                ]
            }
        ]
    }
]
Enter fullscreen mode Exit fullscreen mode

3. Adjust Parameters

  • Modify relevant parameters in finetune.py (e.g., batch_size, padding, model types, etc.).
  • Update GPU settings in finetune.sh. For example:
CUDA_VISIBLE_DEVICES="0"
Enter fullscreen mode Exit fullscreen mode

4. Run Fine-Tuning

./finetune.sh
Enter fullscreen mode Exit fullscreen mode

Conclusion

That's it! You have now fine-tuned the Qwen2VL model for your specific task. With these steps, you can adapt the model to meet your needs while ensuring privacy and efficiency.

Happy fine-tuning!

Reference

zhangfaen/finetune-Qwen2-VL

More
If you’d like to learn more, be sure to check out my other posts and give me a like! It would mean a lot to me. Thank you.

Top comments (1)

Collapse
 
mrzaizai2k profile image
Mai Chi Bao

πŸ’‘ Wow, this is awesome! I thought fine-tuning this VLM model would be so hard before reading this post. πŸ™Œ