DEV Community

Evan Lin
Evan Lin

Posted on • Originally published at evanlin.com on

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models

title: [Paper Review] Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
published: false
date: 2023-12-27 00:00:00 UTC
tags: 
canonical_url: http://www.evanlin.com/paper-gpt4v-vs-gemini-pro/
---

### Paper Title: Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

![image-20231228114149669](http://www.evanlin.com/images/2022/image-20231228114149669.png)

[https://arxiv.org/abs/2312.15011](https://arxiv.org/abs/2312.15011)

## Quick Summary

In addition to the relevant test cases from the previous Microsoft paper, this paper also adds several types of cases. tl;dr GPT-4v is more concise and accurate, but Gemini-Pro's descriptions are clearer. There are many pictures and related cases inside, which are worth reading.

## Several Interesting Cases

#### Being a Detective

Both models identified several relevant points, which are quite suitable for some side-projects. :p

![image-20231228114214976](http://www.evanlin.com/images/2022/image-20231228114214976.png)

### Identifying the Brand of Shoes

It's pretty impressive that it identified NIKE Air Force 1.

![image-20231228114301430](http://www.evanlin.com/images/2022/image-20231228114301430.png)

### Reading the First Page Image of a Paper

The results are good. If the information from arxiv cannot be extracted, this would be a method.

![image-20231228114621106](http://www.evanlin.com/images/2022/image-20231228114621106.png)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)