This is a Plain English Papers summary of a research paper called AI Model Learns to Find Images Based on Reference Photos and Text Modifications. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Overview
- CoLLM is a framework for composed image retrieval that works without manual training data
- Uses LLMs to generate training triplets from image-caption pairs on-the-fly
- Creates joint embeddings of reference images and modification texts
- Introduces a new 3.4M sample dataset called Multi-Text CIR (MTCIR)
- Refines existing benchmarks for better evaluation reliability
- Achieves state-of-the-art performance with up to 15% improvement
Plain English Explanation
Finding specific images based on both a reference picture and a text description is hard. Imagine showing a search engine a photo of a red dress and saying "like this but in blue with short sleeves." This is what [composed image retrieval](https://aimodels.fyi/papers/arxiv/comp...
Top comments (0)