Forem

Cover image for 📈 From Images to JSON: Fine-Tuning Qwen2VL on custom Dataset
Mai Chi Bao
Mai Chi Bao

Posted on • Edited on

5 2 1 1 2

📈 From Images to JSON: Fine-Tuning Qwen2VL on custom Dataset

Table of Contents:

Key Takeaways

  • Learn how to create a custom dataset and fine-tune it for the Qwen2VL model.
  • Understand the importance of fine-tuning models for real-life tasks.

Why Fine-Tuning?

You may find numerous resources explaining why fine-tuning a model is beneficial. Below are some real-world challenges that motivated me to fine-tune a model:

  1. Privacy Concerns: Closed-source models like ChatGPT or Gemini may not be an option due to privacy policies. Fine-tuning ensures sensitive data is not leaked externally.
  2. Resource Constraints: Running large models with over 7B parameters can be resource-intensive. Fine-tuning a smaller model allows similar results with reduced computational demands.
  3. Domain-Specific Tasks: Some tasks require customization for niche domains, and without fine-tuning, the model may fail to perform effectively.

Problem Statement

Today's challenge is to read a golf scorecard like the one below. The goal is to extract the names of players and their scores for each hole and return the data in JSON format.

Golf Scorecard Example 1

Golf Scorecard Example 2

Challenges:

  • Handwritten Characters: Difficult to interpret.
  • Non-Standardized Format: Unlike identification cards or passports, the layout lacks a fixed structure.
  • Variable Image Quality: Images captured by customers vary in lighting and angles.
  • Output Format: Qwen2VL is not optimized for JSON output. Moreover, ensuring deterministic results (same image yields identical results) requires additional processing.
  • High Accuracy: The task demands both high precision and reasonable processing time. Fine-tuning focuses the model on specific tasks, reducing hallucination and improving relevance.

Fine-Tuning Process

Requirements:

  • A GPU with at least 16 GB VRAM.

We will fine-tune the Qwen2VL 2B Instruct model to address this problem.

Steps:

1. Install Requirements

git clone https://github.com/zhangfaen/finetune-Qwen2-VL
cd finetune-Qwen2-VL
conda create --name qwen2-VL python=3.10
conda activate qwen2-VL
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

2. Prepare the Dataset

  • Copy all images into the train_data folder. The images can be found here
  • Create a train_data/data.json file with the following structure:
[
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that responds in JSON format with the golf score card information. Do not add any annotations there. Remember to close any bracket."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": "train_data/score_card_2/IMG_9376.jpg"
                    },
                    {
                        "type": "text",
                        "text": "From this image, return json that has field list is the list of information (hand written text) that each element has field name is the name of player, score is the list of all scores of that player in format like this. The golf score is calculated like this, they hasve 9 front and 9 back scores. Remember that the each score would be from 0-9 but after 9 digits, which mean the 10th element is the sum of the last 9 scores, and the 20th element will be the sum of hole 11 to 19. for example the score will be [0, 1, 2, 1, 3, 2, 0, 0, 1, 11][{\"golf\": [{\"name\": \"\", \"score\": []}, {\"name\": \"\", \"score\": []}]"
                    }
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "```

json\n[\n    {\n        \"golf\": [\n            {\"name\": \"A Quang\", \"score\": [1, 1, 3, 1, 1, 2, 2, 2, 0]},\n            {\"name\": \"A Hiệp\", \"score\": [2, 1, 1, 2, 1, 1, 1, 0, 1]},\n            {\"name\": \"A Hiếu\", \"score\": [3, 2, 0, 2, 2, -1, 1, 2, 1]},\n            {\"name\": \"A Dũng\", \"score\": [0, 1, 1, 1, 4, 2, 2, 0, 0]}\n        ]\n    }\n]\n

```"
                    }
                ]
            }            
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that responds in JSON format with the golf score card information. Do not add any annotations there. Remember to close any bracket."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": "train_data/score_card_2/IMG_9378.JPG"
                    },
                    {
                        "type": "text",
                        "text": "From this image, return json that has field list is the list of information (hand written text) that each element has field name is the name of player, score is the list of all scores of that player in format like this. The golf score is calculated like this, they hasve 9 front and 9 back scores. Remember that the each score would be from 0-9 but after 9 digits, which mean the 10th element is the sum of the last 9 scores, and the 20th element will be the sum of hole 11 to 19. for example the score will be [0, 1, 2, 1, 3, 2, 0, 0, 1, 11][{\"golf\": [{\"name\": \"\", \"score\": []}, {\"name\": \"\", \"score\": []}]"
                    }
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "```

json\n[\n    {\n        \"golf\": [\n            {\n                \"name\": \"A Quang\", \n                \"score\": [1, 0, 0, -1, 1, 0, 0, 1, 1, 3, 1, 1, 1, 2, 1, 1, 3, 3, 1, 14, 17]\n            },\n            {\n                \"name\": \"A Thành\", \n                \"score\": [4, 1, 0, 0, 2, 1, 0, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 9]\n            },\n            {\n                \"name\": \"A Tuấn\", \n                \"score\": [4, 1, 0, 0, 2, 2, 1, 1, 2, 13, 1, 0, 0, 1, 2, 1, 3, 2, 0, 10, 23]\n            }\n        ]\n    }\n]\n

```"
                    }
                ]
            }            
        ]
    }
]
Enter fullscreen mode Exit fullscreen mode

3. Adjust Parameters

  • Modify relevant parameters in finetune.py (e.g., batch_size, padding, model types, etc.).
  • Update GPU settings in finetune.sh. For example:
CUDA_VISIBLE_DEVICES="0"
Enter fullscreen mode Exit fullscreen mode

4. Run Fine-Tuning

./finetune.sh
Enter fullscreen mode Exit fullscreen mode

Comparision

This is the original image

Image description

Gemini 1.5 Flash

{"golf": [{"name": "Leked", "score": ["6", "5", "6", "4", "7", "4", "6", "5", "7", "36", "3", "6", "3", "5", "5", "5", "5", "3", "4", "36"]}]}
Enter fullscreen mode Exit fullscreen mode

Gemini 1.5 Flash demonstrates strong capability in recognizing handwritten numbers, but it fails to return results for the second row of scores. Additionally, the Vietnamese name is incorrect, indicating limitations in handling names from this region.

Without Fine-Tuning (Base Qwen2VL)

['```

json\n'
 '{\n'
 '  "player": "Mr Lê Hiên",\n'
 '  "scores": [\n'
 '    [65, 64, 47, 46, 57],\n'
 '    [21, 24, 20, 12, 23],\n'
 '    [10, 12, 20, 12, 23],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, ']


Enter fullscreen mode Exit fullscreen mode

The base Qwen2VL (2B) model manages to produce a mostly correct name, but the overall result is poor. The score format is inconsistent with the expected structure, and the output includes repetitive data, suggesting the model's limitations in structured JSON generation. This issue likely arises because Qwen2VL was not explicitly trained on JSON datasets, so forcing it to generate JSON causes performance degradation

With Fine-Tuning

This is the result



['

```json\n'
 '{\n'
 '    "golf": [\n'
 '        {\n'
 '            "name": "Mr. Lê Hiền",\n'
 '            "score": [6, 5, 6, 4, 7, 4, 6, 5, 7]\n'
 '        },\n'
 '        {\n'
 '            "name": "Mr. Lê Hiền",\n'
 '            "score": [2, 1, 2, 2, 1, 2, 0, 1, 2, 3]\n'
 '        },\n'
 '        {\n'
 '            "name": "Mr. Lê Hiền",\n'
 '            "score": [1, 2, 2, 1, 2, 0, 1, 2, 3, 0]\n'
 '        }\n'
 '    ]\n'
 '}\n'
 '```

']


Enter fullscreen mode Exit fullscreen mode

The fine-tuned Qwen2VL model produces results that are much closer to the desired output. The JSON structure is correctly formatted, and the Vietnamese name is accurately captured. The model also performs well in extracting handwritten scores, though minor inaccuracies remain. These can be attributed to the limited fine-tuning dataset of approximately 35 images. Expanding the training dataset could further improve accuracy and consistency.

Conclusion

That's it! You have now fine-tuned the Qwen2VL model for your specific task. With these steps, you can adapt the model to meet your needs while ensuring privacy and efficiency.

Happy fine-tuning!

Reference

zhangfaen/finetune-Qwen2-VL

More
If you’d like to learn more, be sure to check out my other posts and give me a like! It would mean a lot to me. Thank you.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (2)

Collapse
 
mrzaizai2k profile image
Mai Chi Bao

💡 Wow, this is awesome! I thought fine-tuning this VLM model would be so hard before reading this post. 🙌

Collapse
 
mrzaizai2k profile image
Mai Chi Bao

✅ I've edited the post to include an example showcasing the power of fine-tuning. Hope u enjoy!

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay