Multimodal AI: Bridging the Gap Between Human and Machine Understanding

Multimodal AI is a powerful technology that integrates multiple sensory inputs to better understand and interact with its environment. This technology draws on diverse sources of data, merging them to improve decision-making and response accuracy. In this blog, we'll explore how combining these various forms of data enhances AI's functionality, making it more adaptable and intuitive. We'll discuss the key technologies that support multimodal learning, their applications, and how they are shaping the future of artificial intelligence.

What is Multimodal AI?

Before defining a multimodal approach, let’s take a moment to define modality. Modality refers to the mode or channel through which information is represented, communicated, or perceived. It represents the different ways that AI converses with humans and interacts with the world around them. Modality, for artificial intelligence, refers to its data types. Multimodal data includes text, images, audio, and video in conventional generative models and LiDAR, Spatial Mapping, sensor data, and many more in real-world models. Now that we have some background, multi modal AI can be defined as systems that can process and understand information from multiple forms or modes of input. These modes can include text, images, audio, video, and even other sensory data.

Why Do We Need Multimodality?

The advantage of multimodal AI is its capacity to provide more accurate, context-rich, and useful outputs than unimodal systems (those that handle only one type of data). It mimics human cognitive abilities more closely, as humans naturally perceive and interpret the world through multiple senses. Let’s look a bit deeper:

Holistic Understanding: Humans experience the world through multiple senses, and a similar approach in multimodal model AI allows systems to gain a more comprehensive understanding of their environment or data. For instance, in autonomous driving, combining visual data (like road signs and traffic lights) with auditory signals (like sirens or honking) can provide a more complete situational awareness than relying on visual data alone.
Improved Accuracy and Robustness: Multimodal AI can cross-verify information across different data types, enhancing accuracy and reliability. For example, in speech recognition, combining audio signals with lip-reading (video input) can greatly improve accuracy, especially in noisy environments.
Enhanced User Interaction: Models trained with multi modal learning can interact with users in more natural and flexible ways. For example, a smart assistant that can both see and hear can respond to both verbal commands and gestures, making the interaction more intuitive and accessible for users with different needs.
Accessibility: Multimodal AI can significantly improve accessibility for people with disabilities. For example, an AI system that translates spoken language into sign language in real time can make communication more accessible to the deaf and hard-of-hearing community.
Contextual Awareness: Multimodal strategies create AI that is better at understanding context, which is crucial for applications like recommendation systems or personalized services. By analyzing multiple data types (such as previous purchases, textual reviews, and video content), these systems can offer more relevant and customized suggestions to users.
Complex Decision Making: In scenarios that require complex decision-making, such as medical diagnostics, combining multi modal data (like medical imaging, audio symptoms description, and written patient records) can lead to more accurate and timely diagnosis.

The rationale for favoring multimodality over unimodality is similar to why companies prefer face-to-face interviews (or video calls for remote positions) rather than just asking a bunch of questions over email. Interviewers need to assess all facets and behaviors of a candidate before making a decision. Experience has shown that decisions based on multiple modality learning are most effective.

Unimodal vs Multimodal

Before moving further, let’s briefly look at how unimodal and multimodal are different, specifically in multimodal generative AI models. Let’s have a look at their functionalities through an example. Consider a task to get some information about Mars.

In a unimodal system, you can ask a question through a simple prompt like “How far is Mars from Earth?” and the model will compute the tokens according to the data that it was pre-trained on. Some of the recent ones go a step further and browse the web just to make sure the output it produces is indeed a verified one.

In multimodal machine learning, you can ask questions in different ways. You can use text, images, or even speak your questions. The system can answer in the same way you asked—like talking back to you or showing you images. You can also upload videos or other types of files and expect answers in those formats. But there’s more—you can ask the system to create a song about Mars or to create cool pictures of the Starship on Mars. You can even upload a link to an image, add some files, and ask your question out loud all at once as a single prompt.

New models are also bringing video into the fold. This allows multimodal AI models to process more diverse data inputs, which expands the information available for training and inference. For example, ingesting data captured by video cameras could be involved in multimodal instruction. In the future, you might even get answers through videos featuring your favorite characters.

How Do Multimodal Systems Work?

Multimodal AI systems are designed to recognize patterns among various types of data inputs. They mainly consist of three elements:

Input module: A multimodal AI system is composed of many unimodal neural networks. These layers form the input module, which receives multiple data types. Each of the different layers is dedicated to processing different varieties of data, and these layers individually implement technologies including:
Natural language processing (NLP) technologies like keyword extraction, synonym/antonym detection, POS tagging, information extraction, tokenization, etc.
Computer Vision technologies such as optical character recognition (OCR), face recognition, gender/age recognition, semantic segmentation, object recognition, etc.
Speech Processing technologies including hot word/trigger word detection, acoustic model creation, language model extraction from Web text, noise cancellation, etc.
Data Mining technologies such as data generation for machine learning, KPI prediction, Big Data processing, recommendation algorithms, predictive analytics based on Big Data, etc.
Fusion module: Multimodal AI systems use data fusion techniques to integrate these data types, building a more comprehensive and accurate understanding of the underlying data. There are several strategies for blending different types of data to tackle the challenges with multimodal AI models. Here’s how we can group these data fusion methods based on when they merge the data:
Early fusion: This is where we mix different kinds of data from the start to form a common space where everything is expressed in the same way. It’s like creating a single, new language that captures the essence of all the original data types.
Mid-fusion: This happens at various stages of preparing the data. We set up a special layer in the neural network, designed just for merging data, allowing the system to combine information at different points in the process.
Late fusion: This is when we have separate models, each trained on a different type of data, and their decisions are brought together at the very end. This multimodel AI is like having a team of experts in different fields who all weigh in before making a final decision.

There isn’t a one-size-fits-all fusion method that works for everything. The best approach depends on the specific task the AI is working on. It often takes some experimenting to find the perfect setup for the job.

Output module: The model's generated results may be polished and improved through further processing. For example, it could pick the most fitting and relevant text response to raise the quality of its output. It might also put in place extra safeguards to avoid creating any harmful or offensive content. That said, we should expect the model’s output to vary every time we prompt it, even if it’s the exact prompt that was given earlier. This variability can be by design, to keep the model adaptive, or it might stem from the processing steps just mentioned.

We looked into pretty low-level concepts; now let’s try to understand how they all come together in a model like GPT-4 with some multimodal learning examples. Consider an example where we prompt the model asking something like “Draw a picture of a cat” through the voice command. The model first converts the voice message into text input through Whisper. This text input is passed on to the inner model or the core GPT-4, which provides a text output that describes the image containing this cat. This text output is then fed into the outer model or the image generative model, which in this case is DALL-E. This model generates the image based on the prompt it received, ensures a few safety standards are met, and displays it to the user. The guardrails are set in every phase of this interaction to ensure there are no loopholes. If you want to know more about ChatGPT’s multi modal approach, have a look at their official blog.

Where Do We See Multimodal AI?

Let’s have a look at a few of the latest pieces of tech that leverage multimodality. A few of the interdisciplinary models that exist currently are:

Ray-Ban Meta smart glasses: The Ray-Ban Meta smart glasses are a collaboration between the tech company Meta (formerly Facebook) and the eyewear brand Ray-Ban. They feature built-in cameras and speakers, allowing users to capture photos and videos, listen to music, and take calls, all while maintaining the classic Ray-Ban style. They represent an effort to integrate technology seamlessly into everyday life.
Tesla’s self-driving cars: Tesla's self-driving cars utilize multimodal AI by processing multiple types of input data, such as visual data from cameras, radar signals, and ultrasonic sensors. This combination helps the car understand its surroundings, enabling it to navigate streets, identify obstacles, and follow traffic rules. It also uses audio input to process commands from inside the car, enhancing user interaction. The output is then translated into control signals for the vehicle's systems, demonstrating the multimodal output capabilities.
Generative AI models: First-generation generative AI models were typically text-to-text, responding to user prompts with text answers. Multi modal models like GPT-4 Turbo, Google Gemini, and CLIP introduce new capabilities that enhance user experience. By accepting prompts in different forms and producing content in diverse formats, multimodal AI agents offer vast potential.
AI copilots: Copilots powered by various LLMs like Pieces Copilot can leverage computer vision technologies for inputs beyond text and code. For example, optical character recognition software at Pieces uses Tesseract as its main OCR code engine, extended with bicubic upsampling. Pieces then uses edge-ML models to auto-correct any potential defects in the resulting code/text, which users can input as prompts to the AI copilot. Pieces Copilot in its current iteration also comes with a unique tool called the Workstream Pattern Engine which gathers real-time context from any application through computer vision, enabling Pieces to understand everything on your screen and pass it through to the LLM so you can talk to the AI about it.

Conclusion

As technology advances, it becomes increasingly clear that we are moving closer to achieving Artificial General Intelligence (AGI). Multimodal AI plays a crucial role in this progress. Regardless of the specific architecture that future AGI might utilize, multimodality is considered fundamental. All emerging architectures and future AI tools are expected to inherently support multimodal capabilities, ensuring that they can process and integrate diverse types of data seamlessly. This integration is key to building systems that more closely mimic human cognitive abilities, propelling us towards more capable and versatile AI solutions. As we continue to innovate, the centrality of multimodality in emerging AI development underscores its importance in achieving the next leaps in artificial intelligence.