Table of Contents:
Key Takeaways
- Learn how to create a custom dataset and fine-tune it for the Qwen2VL model.
- Understand the importance of fine-tuning models for real-life tasks.
Why Fine-Tuning?
You may find numerous resources explaining why fine-tuning a model is beneficial. Below are some real-world challenges that motivated me to fine-tune a model:
- Privacy Concerns: Closed-source models like ChatGPT or Gemini may not be an option due to privacy policies. Fine-tuning ensures sensitive data is not leaked externally.
- Resource Constraints: Running large models with over 7B parameters can be resource-intensive. Fine-tuning a smaller model allows similar results with reduced computational demands.
- Domain-Specific Tasks: Some tasks require customization for niche domains, and without fine-tuning, the model may fail to perform effectively.
Problem Statement
Today's challenge is to read a golf scorecard like the one below. The goal is to extract the names of players and their scores for each hole and return the data in JSON format.
Challenges:
- Handwritten Characters: Difficult to interpret.
- Non-Standardized Format: Unlike identification cards or passports, the layout lacks a fixed structure.
- Variable Image Quality: Images captured by customers vary in lighting and angles.
- Output Format: Qwen2VL is not optimized for JSON output. Moreover, ensuring deterministic results (same image yields identical results) requires additional processing.
- High Accuracy: The task demands both high precision and reasonable processing time. Fine-tuning focuses the model on specific tasks, reducing hallucination and improving relevance.
Fine-Tuning Process
Requirements:
- A GPU with at least 16 GB VRAM.
We will fine-tune the Qwen2VL 2B Instruct model to address this problem.
Steps:
1. Install Requirements
git clone https://github.com/zhangfaen/finetune-Qwen2-VL
cd finetune-Qwen2-VL
conda create --name qwen2-VL python=3.10
conda activate qwen2-VL
pip install -r requirements.txt
2. Prepare the Dataset
- Copy all images into the
train_data
folder. The images can be found here - Create a
train_data/data.json
file with the following structure:
[
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that responds in JSON format with the golf score card information. Do not add any annotations there. Remember to close any bracket."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "train_data/score_card_2/IMG_9376.jpg"
},
{
"type": "text",
"text": "From this image, return json that has field list is the list of information (hand written text) that each element has field name is the name of player, score is the list of all scores of that player in format like this. The golf score is calculated like this, they hasve 9 front and 9 back scores. Remember that the each score would be from 0-9 but after 9 digits, which mean the 10th element is the sum of the last 9 scores, and the 20th element will be the sum of hole 11 to 19. for example the score will be [0, 1, 2, 1, 3, 2, 0, 0, 1, 11][{\"golf\": [{\"name\": \"\", \"score\": []}, {\"name\": \"\", \"score\": []}]"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "```
json\n[\n {\n \"golf\": [\n {\"name\": \"A Quang\", \"score\": [1, 1, 3, 1, 1, 2, 2, 2, 0]},\n {\"name\": \"A Hiệp\", \"score\": [2, 1, 1, 2, 1, 1, 1, 0, 1]},\n {\"name\": \"A Hiếu\", \"score\": [3, 2, 0, 2, 2, -1, 1, 2, 1]},\n {\"name\": \"A Dũng\", \"score\": [0, 1, 1, 1, 4, 2, 2, 0, 0]}\n ]\n }\n]\n
```"
}
]
}
]
},
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that responds in JSON format with the golf score card information. Do not add any annotations there. Remember to close any bracket."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "train_data/score_card_2/IMG_9378.JPG"
},
{
"type": "text",
"text": "From this image, return json that has field list is the list of information (hand written text) that each element has field name is the name of player, score is the list of all scores of that player in format like this. The golf score is calculated like this, they hasve 9 front and 9 back scores. Remember that the each score would be from 0-9 but after 9 digits, which mean the 10th element is the sum of the last 9 scores, and the 20th element will be the sum of hole 11 to 19. for example the score will be [0, 1, 2, 1, 3, 2, 0, 0, 1, 11][{\"golf\": [{\"name\": \"\", \"score\": []}, {\"name\": \"\", \"score\": []}]"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "```
json\n[\n {\n \"golf\": [\n {\n \"name\": \"A Quang\", \n \"score\": [1, 0, 0, -1, 1, 0, 0, 1, 1, 3, 1, 1, 1, 2, 1, 1, 3, 3, 1, 14, 17]\n },\n {\n \"name\": \"A Thành\", \n \"score\": [4, 1, 0, 0, 2, 1, 0, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 9]\n },\n {\n \"name\": \"A Tuấn\", \n \"score\": [4, 1, 0, 0, 2, 2, 1, 1, 2, 13, 1, 0, 0, 1, 2, 1, 3, 2, 0, 10, 23]\n }\n ]\n }\n]\n
```"
}
]
}
]
}
]
3. Adjust Parameters
- Modify relevant parameters in
finetune.py
(e.g.,batch_size
,padding
,model types
, etc.). - Update GPU settings in
finetune.sh
. For example:
CUDA_VISIBLE_DEVICES="0"
4. Run Fine-Tuning
./finetune.sh
Comparision
This is the original image
Gemini 1.5 Flash
{"golf": [{"name": "Leked", "score": ["6", "5", "6", "4", "7", "4", "6", "5", "7", "36", "3", "6", "3", "5", "5", "5", "5", "3", "4", "36"]}]}
Gemini 1.5 Flash demonstrates strong capability in recognizing handwritten numbers, but it fails to return results for the second row of scores. Additionally, the Vietnamese name is incorrect, indicating limitations in handling names from this region.
Without Fine-Tuning (Base Qwen2VL)
['```
json\n'
'{\n'
' "player": "Mr Lê Hiên",\n'
' "scores": [\n'
' [65, 64, 47, 46, 57],\n'
' [21, 24, 20, 12, 23],\n'
' [10, 12, 20, 12, 23],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, 13, 17, 7],\n'
' [11, 15, ']
The base Qwen2VL (2B) model manages to produce a mostly correct name, but the overall result is poor. The score format is inconsistent with the expected structure, and the output includes repetitive data, suggesting the model's limitations in structured JSON generation. This issue likely arises because Qwen2VL was not explicitly trained on JSON datasets, so forcing it to generate JSON causes performance degradation
With Fine-Tuning
This is the result
['
```json\n'
'{\n'
' "golf": [\n'
' {\n'
' "name": "Mr. Lê Hiền",\n'
' "score": [6, 5, 6, 4, 7, 4, 6, 5, 7]\n'
' },\n'
' {\n'
' "name": "Mr. Lê Hiền",\n'
' "score": [2, 1, 2, 2, 1, 2, 0, 1, 2, 3]\n'
' },\n'
' {\n'
' "name": "Mr. Lê Hiền",\n'
' "score": [1, 2, 2, 1, 2, 0, 1, 2, 3, 0]\n'
' }\n'
' ]\n'
'}\n'
'```
']
The fine-tuned Qwen2VL model produces results that are much closer to the desired output. The JSON structure is correctly formatted, and the Vietnamese name is accurately captured. The model also performs well in extracting handwritten scores, though minor inaccuracies remain. These can be attributed to the limited fine-tuning dataset of approximately 35 images. Expanding the training dataset could further improve accuracy and consistency.
Conclusion
That's it! You have now fine-tuned the Qwen2VL model for your specific task. With these steps, you can adapt the model to meet your needs while ensuring privacy and efficiency.
Happy fine-tuning!
Reference
More
If you’d like to learn more, be sure to check out my other posts and give me a like! It would mean a lot to me. Thank you.
Top comments (2)
💡 Wow, this is awesome! I thought fine-tuning this VLM model would be so hard before reading this post. 🙌
✅ I've edited the post to include an example showcasing the power of fine-tuning. Hope u enjoy!