📈 From Images to JSON: Fine-Tuning Qwen2VL on custom Dataset

#ai #mrzaizai2k #finetuningai #multimodalmodels

Privacy Concerns: Closed-source models like ChatGPT or Gemini may not be an option due to privacy policies. Fine-tuning ensures sensitive data is not leaked externally.
Resource Constraints: Running large models with over 7B parameters can be resource-intensive. Fine-tuning a smaller model allows similar results with reduced computational demands.
Domain-Specific Tasks: Some tasks require customization for niche domains, and without fine-tuning, the model may fail to perform effectively.

Problem Statement

Today's challenge is to read a golf scorecard like the one below. The goal is to extract the names of players and their scores for each hole and return the data in JSON format.

Challenges:

Handwritten Characters: Difficult to interpret.
Non-Standardized Format: Unlike identification cards or passports, the layout lacks a fixed structure.
Variable Image Quality: Images captured by customers vary in lighting and angles.
Output Format: Qwen2VL is not optimized for JSON output. Moreover, ensuring deterministic results (same image yields identical results) requires additional processing.
High Accuracy: The task demands both high precision and reasonable processing time. Fine-tuning focuses the model on specific tasks, reducing hallucination and improving relevance.

Fine-Tuning Process

Requirements:

A GPU with at least 16 GB VRAM.

We will fine-tune the Qwen2VL 2B Instruct model to address this problem.

Steps:

1. Install Requirements

git clone https://github.com/zhangfaen/finetune-Qwen2-VL
cd finetune-Qwen2-VL
conda create --name qwen2-VL python=3.10
conda activate qwen2-VL
pip install -r requirements.txt

2. Prepare the Dataset

Copy all images into the train_data folder. The images can be found here
Create a train_data/data.json file with the following structure:

[
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that responds in JSON format with the golf score card information. Do not add any annotations there. Remember to close any bracket."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": "train_data/score_card_2/IMG_9376.jpg"
                    },
                    {
                        "type": "text",
                        "text": "From this image, return json that has field list is the list of information (hand written text) that each element has field name is the name of player, score is the list of all scores of that player in format like this. The golf score is calculated like this, they hasve 9 front and 9 back scores. Remember that the each score would be from 0-9 but after 9 digits, which mean the 10th element is the sum of the last 9 scores, and the 20th element will be the sum of hole 11 to 19. for example the score will be [0, 1, 2, 1, 3, 2, 0, 0, 1, 11][{\"golf\": [{\"name\": \"\", \"score\": []}, {\"name\": \"\", \"score\": []}]"
                    }
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "```

json\n[\n    {\n        \"golf\": [\n            {\"name\": \"A Quang\", \"score\": [1, 1, 3, 1, 1, 2, 2, 2, 0]},\n            {\"name\": \"A Hiệp\", \"score\": [2, 1, 1, 2, 1, 1, 1, 0, 1]},\n            {\"name\": \"A Hiếu\", \"score\": [3, 2, 0, 2, 2, -1, 1, 2, 1]},\n            {\"name\": \"A Dũng\", \"score\": [0, 1, 1, 1, 4, 2, 2, 0, 0]}\n        ]\n    }\n]\n

```"
                    }
                ]
            }            
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that responds in JSON format with the golf score card information. Do not add any annotations there. Remember to close any bracket."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": "train_data/score_card_2/IMG_9378.JPG"
                    },
                    {
                        "type": "text",
                        "text": "From this image, return json that has field list is the list of information (hand written text) that each element has field name is the name of player, score is the list of all scores of that player in format like this. The golf score is calculated like this, they hasve 9 front and 9 back scores. Remember that the each score would be from 0-9 but after 9 digits, which mean the 10th element is the sum of the last 9 scores, and the 20th element will be the sum of hole 11 to 19. for example the score will be [0, 1, 2, 1, 3, 2, 0, 0, 1, 11][{\"golf\": [{\"name\": \"\", \"score\": []}, {\"name\": \"\", \"score\": []}]"
                    }
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "```

json\n[\n    {\n        \"golf\": [\n            {\n                \"name\": \"A Quang\", \n                \"score\": [1, 0, 0, -1, 1, 0, 0, 1, 1, 3, 1, 1, 1, 2, 1, 1, 3, 3, 1, 14, 17]\n            },\n            {\n                \"name\": \"A Thành\", \n                \"score\": [4, 1, 0, 0, 2, 1, 0, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 9]\n            },\n            {\n                \"name\": \"A Tuấn\", \n                \"score\": [4, 1, 0, 0, 2, 2, 1, 1, 2, 13, 1, 0, 0, 1, 2, 1, 3, 2, 0, 10, 23]\n            }\n        ]\n    }\n]\n

```"
                    }
                ]
            }            
        ]
    }
]

3. Adjust Parameters

Modify relevant parameters in finetune.py (e.g., batch_size, padding, model types, etc.).
Update GPU settings in finetune.sh. For example:

CUDA_VISIBLE_DEVICES="0"

4. Run Fine-Tuning

./finetune.sh

Comparision

This is the original image

Gemini 1.5 Flash

{"golf": [{"name": "Leked", "score": ["6", "5", "6", "4", "7", "4", "6", "5", "7", "36", "3", "6", "3", "5", "5", "5", "5", "3", "4", "36"]}]}

Gemini 1.5 Flash demonstrates strong capability in recognizing handwritten numbers, but it fails to return results for the second row of scores. Additionally, the Vietnamese name is incorrect, indicating limitations in handling names from this region.

Without Fine-Tuning (Base Qwen2VL)

['```

json\n'
 '{\n'
 '  "player": "Mr Lê Hiên",\n'
 '  "scores": [\n'
 '    [65, 64, 47, 46, 57],\n'
 '    [21, 24, 20, 12, 23],\n'
 '    [10, 12, 20, 12, 23],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, 13, 17, 7],\n'
 '    [11, 15, ']

The base Qwen2VL (2B) model manages to produce a mostly correct name, but the overall result is poor. The score format is inconsistent with the expected structure, and the output includes repetitive data, suggesting the model's limitations in structured JSON generation. This issue likely arises because Qwen2VL was not explicitly trained on JSON datasets, so forcing it to generate JSON causes performance degradation

With Fine-Tuning

This is the result



['

```json\n'
 '{\n'
 '    "golf": [\n'
 '        {\n'
 '            "name": "Mr. Lê Hiền",\n'
 '            "score": [6, 5, 6, 4, 7, 4, 6, 5, 7]\n'
 '        },\n'
 '        {\n'
 '            "name": "Mr. Lê Hiền",\n'
 '            "score": [2, 1, 2, 2, 1, 2, 0, 1, 2, 3]\n'
 '        },\n'
 '        {\n'
 '            "name": "Mr. Lê Hiền",\n'
 '            "score": [1, 2, 2, 1, 2, 0, 1, 2, 3, 0]\n'
 '        }\n'
 '    ]\n'
 '}\n'
 '```

']

The fine-tuned Qwen2VL model produces results that are much closer to the desired output. The JSON structure is correctly formatted, and the Vietnamese name is accurately captured. The model also performs well in extracting handwritten scores, though minor inaccuracies remain. These can be attributed to the limited fine-tuning dataset of approximately 35 images. Expanding the training dataset could further improve accuracy and consistency.

Conclusion

That's it! You have now fine-tuned the Qwen2VL model for your specific task. With these steps, you can adapt the model to meet your needs while ensuring privacy and efficiency.

Happy fine-tuning!