Author: Harpreet Sahota (Hacker in Residence at Voxel51)
A Cool Way to Discover Topics and Trends at the Biggest CV Conference of the Year
The 2024 Conference on Computer Vision and Pattern Recognition (CVPR) received 11,532 valid paper submissions, and only 2,719 were accepted for an overall acceptance rate of about 23.6%.
But keeping up with the vast array of research being presented at this year's CVPR can be challenging. CVPR has an awesome website listing out all the paper, but the information I want is scattered across various links and platforms. Needless to say, getting a good idea of what's being presented is time-consuming (and a bit disorganized).
But what if you could access all this knowledge in one convenient location, allowing you to identify trends and gain valuable insights easily?
Well, I curated a dataset hosted on Hugging Face and built it with FiftyOne, which does just that -it helps you explore this year's conference offerings. I was able to find/scrape 2,389 of the 2,719 accepted papers, and I put them into a dataset that we will explore together!
Btw this post is available as a Google Colab notebook here, though I recommend running it locally if you can.
tl;dr
• CVPR 2024 received 11,532 paper submissions, with 2,719 accepted for a 23.6% acceptance rate.
• I curated a dataset of 2,389 accepted papers, hosted on Hugging Face and built with FiftyOne. It includes paper images, titles, authors, abstracts, links, categories, and keywords.
• The dataset is hosted on Hugging Face and can be loaded into FiftyOne, which you can use for managing, querying, visualizing and analyzing the papers.
• Text embeddings were generated for the titles and abstracts using the gte-large-en-v1.5 model from Sentence Transformers.
• FiftyOne Brain was used to visualize the embeddings with UMAP, compute uniqueness scores to find the most unique papers, and index the embeddings by similarity to easily find similar papers.
🧐 What's in this dataset?
The dataset consists of images of the first pages of papers, their titles, a list of authors, their abstracts, direct links to papers on arXiv, project pages, a category breakdown according to the arXiv taxonomy, and keywords that I bucketed from the 2024 CVPR call for papers.
Here are the fields:
An image of the first page of the paper
-
title
: The title of the paper -
authors_list
: The list of authors -
abstract
: The abstract of the paper -
arxiv_link
: Link to the paper on arXiv -
other_link
: Link to the project page, if found -
category_name
: The primary category of this paper, according to arXiv taxonomy -
all_categories
: All categories this paper falls into, according to arXiv taxonomy -
keywords
: Extracted using GPT-4o
This should give us enough information to pick up some interesting trends for this year's CVPR!
PS: Check out my picks for awesome papers at CVPR in my GitHub repo. Here's some general code for how I scraped the CVPR data.
Let's start by installing some dependencies:
%%capture
!pip install fiftyone sentence-transformers umap-learn lancedb scikit-learn==1.4.2
This tutorial will make use of the clustering plugin. Check out all available plugins here.
!fiftyone plugins download https://github.com/jacobmarks/clustering-plugin
import fiftyone as fo
import fiftyone.utils.huggingface as fouh
FiftyOne natively integrates with Hugging Face's datasets library.
The integration allows you to push datasets to and load datasets from the Hugging Face Hub. It's a nice integration that simplifies sharing datasets with the machine learning community and accessing popular vision datasets. You can load datasets from specific revisions, handle multiple media fields, and configure advanced settings through the integration - check out the Hugging Face organization page here to see what datasets are available.
I've posted the dataset on Hugging Face - feel free to smash a like on it to help spread the word - and you can access it as follows:
dataset = fouh.load_from_hub("Voxel51/CVPR_2024_Papers")
You've now loaded the dataset into FiftyOne format!
The FiftyOne dataset object gives you a high-level interface for performing various dataset-related tasks, such as loading data, applying transformations, evaluating models, and exporting datasets in different formats. The dataset represents a collection of samples and fields (associated metadata, labels, and other annotations).
It provides a convenient way to store, manipulate, and query datasets in FiftyOne.
Some cool things you can do with the dataset object:
- Visualize various data, including images and videos and associated annotations like bounding boxes, segmentation masks, arbitrary text, and classification labels.
- Attach metadata to each sample in the dataset, like arbitrary text fields, lists, etc.
- Query it to filter and select subsets of samples based on their metadata, labels, or other criteria.
You can launch the app like so:
session = fo.launch_app(dataset, auto=False)
session.show()
Take a look at the app below
With it, you can get insight into the distribution of keywords, categories, and the number of papers a given author (or at least someone with that name) has attributed to them at this year's conference!
You can do more interesting analysis from here. Start by getting embeddings for the title and abstract of each paper. For that, you can make use of gte-large-en-v1.5
. It's small, it's fast, and it's good.
Of course, feel free to choose any model you'd like.
%%capture
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
'Alibaba-NLP/gte-large-en-v1.5',
trust_remote_code=True
)
The code below will help generate and add text embeddings to a FiftyOne dataset.
get_text_embeddings(dataset, field, model)
def get_text_embeddings(dataset, field, model):
"""
Returns the embeddings of the abstracts in the dataset.
Args:
dataset: A FiftyOne dataset object.
Returns:
A list of embeddings.
"""
texts = dataset.values(field)
text_embeddings = []
for text in texts:
embeddings = model.encode(text)
text_embeddings.append(embeddings)
return text_embeddings
- This function takes a FiftyOne dataset, a field name containing text data, and a pre-trained embedding model.
- It retrieves the text data from the specified field of the dataset. It generates embeddings for each text using the provided embedding model.
- It returns a list of embeddings.
add_embeddings_to_dataset(dataset, field, embeddings)
def add_embeddings_to_dataset(dataset, field, embeddings):
"""
Adds the embeddings to the dataset.
Args:
dataset: A FiftyOne dataset object.
embeddings: A list of embeddings.
"""
dataset.add_sample_field(field, fo.VectorField)
dataset.set_values(field, embeddings)
- This function takes a FiftyOne dataset, a field name to store the embeddings, and a list of embeddings.
- It adds a new sample field to the dataset to store the embeddings.
- It sets the values of the newly added field to the provided embeddings.
Combine them:
abstract_embeddings = get_text_embeddings(
dataset = dataset,
field = "abstract",
model = model
)
add_embeddings_to_dataset(
dataset=dataset,
field="abstract_embeddings",
embeddings=abstract_embeddings
)
title_embeddings = get_text_embeddings(
dataset = dataset,
field = "title",
model = model
)
add_embeddings_to_dataset(
dataset=dataset,
field="title_embeddings",
embeddings=title_embeddings
)
And, in a nutshell, by running this code you've:
- Extract text data from a specific field in a FiftyOne dataset.
- Generate embeddings for each text using a pre-trained embedding model.
- Add the generated embeddings back to the dataset as a new field.
Making use of the embeddings
You can use FiftyOne Brain to do some cool stuff with embeddings, like:
- Visualizing datasets in low-dimensional embedding spaces to observe patterns and clusters.
- Compute uniqueness scores for images (or embeddings) to identify the most (or least) unique sample.
- Index datasets by similarity to easily find similar samples.
Visualizing embeddings
Below are the supported dimensionality reduction methods in the Brain:
UMAP (Uniform Manifold Approximation and Projection)
UMAP is a dimensionality reduction technique that uses applied Riemannian geometry and algebraic topology to find low-dimensional embeddings of structured data.
It is particularly well-suited for text embeddings because it can handle high-dimensional data and preserve the global structure of the data, making it useful for both visualization and preprocessing for clustering algorithms.
t-SNE (t-distributed Stochastic Neighbor Embedding)
t-SNE is a non-linear dimensionality reduction technique used to visualize high-dimensional data. It is similar to UMAP but tends to be slower and less scalable.
While it can be effective for certain data types, it may not perform as well as UMAP for large datasets.
PCA (Principal Component Analysis)
PCA is a linear dimensionality reduction technique that projects high-dimensional data onto lower-dimensional subspaces. It is fast and easy to implement but may not capture non-linear relationships in the data as effectively as UMAP or t-SNE.
PCA is often used for simpler data sets where linearity is a reasonable assumption.
Manual
Manually computing a low-dimensional representation involves creating a custom method to reduce the dimensionality of the data. This approach can be time-consuming and requires a deep understanding of the data and the desired outcome.
import fiftyone.brain as fob
fob.compute_visualization(
dataset,
embeddings="abstract_embeddings",
num_dims=2,
method="umap",
brain_key="umap_abstract",
verbose=True,
seed=51
)
fob.compute_visualization(
dataset,
embeddings="title_embeddings",
num_dims=2,
method="umap",
brain_key="umap_title",
verbose=True,
seed=51
)
Computing uniqueness
The code below adds a uniqueness field to each sample, scoring how unique it is with respect to the rest of the samples. This is interesting because you can understand which papers are the most unique (based on their abstracts) among all the papers in the dataset.
fob.compute_uniqueness(
dataset,
embeddings="abstract_embeddings",
uniqueness_field="uniqueness_abstract",
)
fob.compute_uniqueness(
dataset,
embeddings="title_embeddings",
uniqueness_field="uniqueness_title",
)
Computing similarity
The code below will index the abstract embeddings by similarity, and you can easily query and sort your datasets to find similar examples. Once you've indexed a dataset by similarity, you can use the sort_by_similarity()
view stage to sort the dataset by abstract similarity programmatically! The code below uses LanceDB as the back end(read about the integration here), but there several backends you can use:
• sklearn
(default): a scikit-learn backend
• qdrant
: a Qdrant backend
• redis
: a Redis backend
• pinecone
: a Pinecone backend
• mongodb
: a MongoDB backend
• milvus
: a Milvus backend
The library is open source, and we welcome contributions. Feel free to integrate it with your favorite vector database.
sim_abstract = fob.compute_similarity(
dataset,
embeddings="abstract_embeddings",
brain_key="abstract_similarity",
backend="lancedb",
)
Now, let's check all this out in the app!
Check out the short video below, where I'll show you how to use everything we've created to find interesting research.
Feel free to build on this - analyze your own using this dataset! If you find some interesting trends or insights, please share them with the community!
There's a lot more that you can do with FiftyOne, more than I can share in this note-blog. But I hope you'll join me for a workshop where I'll spend ~90 minutes teaching you how to use FiftyOne! Sign up here!
Thanks for reading!
The curated dataset hosted on Hugging Face and built with FiftyOne, along with the integration of FiftyOne with Hugging Face, provides access to a comprehensive collection of 2,389 accepted papers with essential metadata.
Researchers can manage, query, visualize, and analyze papers more effectively with this curated dataset and FiftyOne. FiftyOne Brain's features, such as visualizing embeddings with UMAP, computing uniqueness scores, and indexing embeddings by similarity, enable researchers to identify unique papers, find similar research, and understand the conference's offerings.
This resource simplifies navigating the vast amount of research presented at CVPR 2024, and I hope it will be a more accessible way for the computer vision community to discover research.
Top comments (0)