Author: Harpreet Sahota (Hacker in Residence at Voxel51)
Welcome to Voxel51’s bi-weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.
📰 The Industry Pulse
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
OMG-LLaVA combines robust pixel-level vision understanding with reasoning abilities in a single end-to-end trained model. It uses a universal segmentation method as the visual encoder to integrate image information, perception priors, and visual prompts into visual tokens provided to a large language model (LLM).
The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. This allows OMG-LLaVA to achieve image-level, object-level, and pixel-level reasoning and understanding, matching or surpassing the performance of specialized methods on multiple benchmarks.
What you need to know:
- Elegant end-to-end training of one encoder, one decoder, and one LLM rather than using an LLM to connect multiple specialist models
- Ability to accept various visual and text prompts for flexible user interaction
- Strong performance on image, object, and pixel-level reasoning tasks compared to specialized models
Instructions on running the model are here, and you can run the demo on Hugging Face Spaces here.
The Robots are Coming for Our Jobs!
Agility Robotics has signed a multi-year deal with GXO Logistics to deploy its Digit humanoid robots in various logistics operations. The first official deployment is already underway at a Spanx facility in Connecticut, where a small fleet of Digit robots is being used under a robotics-as-a-service (RaaS) model.
Digit robots pick up totes from 6 River Systems' Chuck autonomous mobile robots (AMRs) and place them onto conveyors at the Spanx facility. Digit can handle empty and full totes and pick them up from an AMR's bottom or top shelf. The robots are orchestrated through Agility Arc, the company's cloud automation platform.
GXO Logistics also tests other humanoid robots, such as Apollo from Apptronik. There are currently no safety standards specifically for humanoids. Most manufacturers and integrators are leveraging existing industrial robot standards as a baseline, and Digit is not working with or near humans at the Spanx facility.
The Robots Are Going to Help Us with ADHD!
CARMEN (Cognitively Assistive Robot for Motivation and Neurorehabilitation) is a small, tabletop robot designed to help people with mild cognitive impairment (MCI) learn skills to improve memory, attention, and executive functioning at home.
Developed by researchers at UC San Diego in collaboration with clinicians, people with MCI, and their care partners, CARMEN is the only robot that teaches compensatory cognitive strategies to help improve memory and executive function.
Here’s what CARMEN is currently capable of:
- Delivers simple cognitive training exercises through interactive games and activities
- Designed to be used independently without clinician or researcher supervision
- Plug and play with limited moving parts and able to function with limited internet access
- Communicates clearly with users, expresses compassion and empathy, and provides breaks after challenging tasks
In a study, CARMEN was deployed for a week in the homes of several people with MCI and clinicians experienced in working with MCI patients.
After using CARMEN, participants with MCI reported trying strategies they previously thought were impossible and finding the robot easy to use. The next steps include:
- Deploying CARMEN in more homes.
- Enabling conversational abilities while preserving privacy.
- Exploring how the robot could assist users with other conditions like ADHD.
Many elements of the CARMEN project are open-source and available on GitHub.
💎 GitHub Gems
You didn’t think the FiftyOne team would sleep on the Florence2 release, did you?
Jacob Marks, OG DevRel at FiftyOne, created the [fiftyone_florence2_plugin](https://github.com/jacobmarks/fiftyone_florence2_plugin)
repository on GitHub. This repository is a plugin for integrating the Florence2 model into the FiftyOne open-source computer vision tool.
The key components of the plugin include:
- Code to load the Florence2 model and generate embeddings and predictions on image data
- Integration with FiftyOne to visualize the Florence2 model outputs alongside the image dataset
Here’s a notebook that shows you how to use the plugin!
📙 Good Reads
This week’s good read is a massive collaborative three part series by some popular AI/ML folks on Twitter titled “What We Learned from a Year of Building with LLMs.”
It’s a solid read with some down-to-earth, practical, no-nonsense advice. If you’ve been building with AI/ML for a while, you’ll find that what they say about building with LLMs isn’t too different from what you already know. I feel kinda smart reading this and having many of my thoughts and experiences validated by several people in this space that I admire and consider virtual mentors.
Here’s what I think are the best pieces of advice from the series:
- Retrieval-augmented generation (RAG) will remain important even with long-context LLMs. Effective retrieval is still needed to select the most relevant information to feed the model. A hybrid approach combining keyword search and vector embeddings tends to work best.
- Break complex tasks into step-by-step, multi-turn flows executed in a deterministic way. This can significantly boost performance and reliability compared to a single prompt or non-deterministic AI agent.
- Rigorous, continuous evaluation using real data is critical. Have LLMs evaluate each other's outputs, but don't rely on that alone. Regularly review model inputs and outputs yourself to identify failure modes. Design the UX to enable human-in-loop feedback.
- Building LLM apps requires diverse roles beyond AI engineers. It is key to hire the right people at the right time, like product managers, UX designers, and domain experts. Focus on the end-to-end process, not just the core LLM.
- Center humans in the workflow and use LLMs to enhance productivity rather than replace people entirely. Build AI that supports human capabilities.
- Use LLM APIs to validate ideas quickly, but consider self-hosting for more control and savings at scale. Avoid generic LLM features; differentiate your core product.
- LLM capabilities are rapidly increasing while costs decrease. Plan for what's infeasible now to become economical soon. Move beyond demos to reliable, scalable products, which takes significant engineering.
- The technology is less durable than the system and data flywheel you build around it. Start simple, specialize in memorable UX, and adapt as the tech evolves. A thoughtful, human-centred strategy is essential.
Many of the authors recently appeared on a podcast (which I haven’t listened to yet) to discuss the piece and answer questions from the audience.
🎙️ Good Listens
Aravind Srinivas: Perplexity CEO on the Lex Fridman Podcast
I’m a huge fan of Perplexity.ai.
Perplexity AI is an "answer engine" that provides direct answers to questions by retrieving relevant information from the web and synthesizing it into a concise response using large language models. Every sentence in the answer includes a citation. It uses a retrieval augmented generation (RAG) approach - retrieving relevant documents for a query, extracting key snippets, and using those to generate an answer. The LLM is constrained only to use information from the retrieved sources.
I first heard about it at the beginning of the year, and after using the free tier for two weeks, I realized that it’s a tool worth investing in. I quickly signed up for their “Pro” tier, and it accelerated the pace at which I could conduct research and access knowledge.
I was so excited when I saw Perplexity CEO Aravin Srinivas on the Lex Fridman podcast. I’ve only heard him on short-form podcasts, which always left me wanting to hear more from him. In a three-hour conversation (which I’ve listened to twice), Aravind and Lex discussed Perplexity's technical approach, product vision, competitive landscape, and the future of AI and knowledge dissemination on the internet.
Here’s some interesting takeaways from this conversation:
- Indexing the web involves complex crawling, content extraction, and ranking using traditional methods like BM25 and newer approaches with LLMs and embeddings. Serving answers with low latency at scale is an engineering challenge.
- Perplexity has a web crawler called PerplexityBot that decides which URLs and domains to crawl and how frequently. It has to handle JavaScript rendering and respect publisher policies in robots.txt. Building the right index is key.
- Perplexity uses a RAG architecture in which, given a query, it retrieves relevant documents and paragraphs and uses those to generate an answer. The key principle is only to say things that can be cited from the retrieved documents.
- There are multiple ways hallucinations can occur in the answers—if the model is not skilled enough to understand the query and paragraphs semantically, if the retrieved snippets are poor quality or outdated, or if too much irrelevant information is provided to the model. Improving retrieval quality, snippet freshness, and model reasoning abilities can reduce hallucinations.
- Increasing the context window length (e.g., to 100K+ tokens) allows ingesting more detailed pages while answering. However, if too much information confuses the model, there are tradeoffs with the instruction-following performance.
- By incorporating human feedback into the training process via RLHF, Perplexity wants to create an AI knowledge assistant that provides high-quality, relevant answers to user queries. The goal is for the AI to understand the user's intent and give them the information they seek.
Here are a few clips from the conversation that you might find insightful:
👨🏽🔬 Good Research: Is Tokenization the Key to Truly Multimodal Models?
A lot of what's being hyped as a "multimodal model" these days is basically just vision-language models, which really aren't multimodal because they're just two modalities.
While these models have yielded impressive results and serve as a foundation for more sophisticated architectures, they're basically some type of Frankenstein monster. You glue together a pretrained vision encoder and text encoder, freeze the vision encoder, and let gradients flow through the text encoder during training. Don't get me wrong, VLMs are an important step toward more comprehensive multimodal AI, but this architectural choice has limitations in fully integrating the modalities.
I don’t mean to downplay our progress so far, and I fully appreciate how difficult it is to unify diverse modalities into a single model.
Modalities be all over the place with their dimensionality, types, and values. Images are typically represented as high-dimensional tensors with spatial relationships. In contrast, text is represented as variable-length sequences of discrete tokens. Structured data like vectors and poses have unique formats and characteristics. Feature maps, intermediate representations learned by neural networks, add another layer of complexity to multimodal learning.
Recent approaches to multimodal learning often rely on separate encoders for each modality, such as vision transformers for images and large language models for text. Or, ImageBind, able to encode six modalities, projects four modalities into a frozen CLIP embedding space, essentially aligning them with the vision and language modalities.
While these specialized encoders can effectively process their respective modalities, they create a bottleneck when fusing information across modalities, as they lack a common representation space.
However, new research from Apple might just change the way we architect multimodal networks.
The 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities introduces an any-to-any model that can handle 21 modalities across the following categories: RGB, geometric, semantic, edges, feature maps, metadata, and text modalities. Check out the demo here.
And the big insight in the paper: It all comes down to tokenization.
To address the challenge of unifying diverse modalities, the 4M-21 model introduces modality-specific tokenization schemes that convert each data type into a sequence of discrete tokens. This tokenization scheme unifies the representation of various modalities into a common space of discrete tokens, allowing a single model to handle all of them with the same architecture and training objective. And, what I think is the coolest part about tokenizing in this way is now all tasks that the model can handle are formulated as a per-token classification problem, which can be trained with a cross-entropy loss using an encoder-decoder based transformer.
I want to focus on the tokenization for this post, but I encourage you to check out the project page to learn more. Here are my Cliff's Notes on the tokenizer:
- For image-like modalities such as RGB, surface normals, depth, and feature maps from models like CLIP, DINOv2, and ImageBind, they used Transformer-based VQ-VAE tokenizers. These tokenizers compress the dense, high-dimensional image data into a smaller grid of discrete tokens (e.g., 14x14 or 16x16) while preserving the spatial structure. They also used a diffusion decoder for edges to generate more visually plausible reconstructions. The autoencoders learn to encode spatial patches of an image into discrete tokens, effectively capturing local patterns and spatial structure. The VQ-VAEs can map similar patches to the same token using a discrete latent space, providing a compact and semantically meaningful representation.
- For non-image-like modalities such as DINOv2 and ImageBind global embeddings and 3D human poses, they employed MLP-based discrete VAEs with Memcodes quantization. This allows compressing the vectors or pose parameters into a small set of discrete tokens (e.g., 16) without imposing any spatial structure. . These autoencoders learn to map continuous inputs to discrete latent variables, which the shared transformer architecture can then process. By discretizing the continuous data, the model can more effectively capture and reason about their underlying structure and relationships.
- For text-based modalities like captions, object bounding boxes, image metadata, and color palettes, they utilized a shared WordPiece tokenizer with special token prefixes to encode the type and value of each data field. This tokenizer breaks down words into subword units, allowing the model to handle out-of-vocabulary words and maintain a fixed vocabulary size. Using a shared vocabulary across all modalities, the WordPiece tokenizer enables the model to learn cross-modal associations and alignments.
The modality-specific tokenization schemes in 4M-21 seems promising.
It provides a common representation space for all modalities, enabling the model to process them using a shared architecture. It also preserves modality-specific information, such as spatial structure in images and semantic meaning in text, which is crucial for effective multimodal learning and reasoning. Finally, converting all data types into sequences of discrete tokens enables cross-modal interactions and attention mechanisms, allowing different modalities to communicate and exchange information.
🗓️. Upcoming Events
Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.
Top comments (0)