AWS re:Invent 2025-How to build an AI Engine to generate near real time insights for videos at scale

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025-How to build an AI Engine to generate near real time insights for videos at scale

In this video, Aza Kai from Tokyo-based startup infinimind presents their AI-powered video search engine that addresses the "dark data crisis"—90% of enterprise video data (3-8 petabytes) remains unanalyzed. Their foundational model features four capabilities: Causal Event Reasoning (understanding "why" events occur), Omnimodal Retrieval (searching across all video modalities), Precise Temporal Grounding (finding exact moments), and Long-Context Understanding (analyzing hours to years of footage). Currently deployed for TV broadcasting in Japan, they're expanding to retail, manufacturing, and security sectors. Built on AWS infrastructure using ParallelCluster, S3, and Nvidia GPUs including H100 and B200, they're pursuing SOC2 certification and developing on-premise deployment options for enterprise customers.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Solving the Dark Data Crisis: Building a Next-Generation AI Video Understanding Model with Four Core Capabilities

Hi everyone. I'm very happy to present to this wonderful audience. My name is Aza Kai, and I flew 16 hours all the way from Tokyo, Japan to do this presentation for you about our product, our startup, and the amazing technology we're building. The title of this presentation is "How to build an AI Engine to generate near real time insights for videos at scale," which means our company is building a next-generation AI-based search engine for large-scale video datasets. We provide a search engine that allows you to search for anything in the video content you want.

Let's start with the problem we're trying to solve with our technology. As you know, the world is being recorded constantly. I see several cameras recording me at this moment, and all this recorded video gets stored but never analyzed. If you look at the entire internet data, more than 80% of it is video. Enterprises have between three to eight petabytes of data stored that remains unanalyzed. We call this the "dark data." About 90% of video data is never analyzed for its content. We call this the "dark data crisis." The main reason this is the case is that current technology does not allow us to analyze video content in a cheap and accurate way and make it searchable. The technology simply isn't there yet, and my company is trying to solve this problem for large enterprises by making this data searchable and analyzable.

There's an interesting concept here called "from points to lines." Currently, you have AI models that allow basic understanding. You can do object detection, face detection, and point-to-point understanding of video. What we're trying to build is lines, meaning we want to understand video through time and understand the causal relationships between events within the video, then provide a general story of what happened. We're providing the next level of understanding—human-level understanding for video content.

In order to achieve this, we are currently developing our foundational video understanding model that will have four different capabilities. The first is "Causal Event Reasoning," which answers the question "why" inside the video. The second is "Omnimodal Retrieval," which means most currently available technologies allow you to search through visual elements or audio separately. What we're building combines every single modality of the video into a single space and makes the entire video searchable. The third capability is "Precise Temporal Grounding," which means we can understand the temporal nature of video content. We can search through a specific moment for specific events in the video using natural language.

The fourth capability is long-context understanding. You can't solve these hard problems with just simple clip-based understanding. We're solving the problem of understanding not just a single 30-minute clip, but hours, days, weeks, or even years of video content. You can ask any questions you want and get answers across this extended timeframe.

For causal event reasoning, here's an example. Imagine you have a security camera recording, and you see a person running somewhere, but you don't know the reason. Unless you go back in time and watch the video yourself, you wouldn't understand why this person is running. Now you can ask the question, "Why is this person running?" Our video understanding model will reason about the past events and provide you the exact answer.

The next use case is "Precise Temporal Grounding," which means finding exact moments and providing you the answer. For example, let's say you have four hours of a meeting recorded, and eventually the person signed a contract.

Now you want to find the exact moment when the contract was signed and what was spoken. You pass this video to our system, we understand it, and you can ask us to find the exact time and moment when the contract was signed. We can actually pull out that five-second clip when the contract is signed so you can understand that specific moment. This is the kind of capability we're currently building.

The next one is Omnimodal Retrieval. This is a very interesting use case. Video is not only images—it's frames plus text, audio, and many other things contained in a single data point. What we're trying to do is combine all different modalities into a single space and make it searchable. For example, you might search for an incident at your home or facility where a glass break sound happened and coincides with a person running. This means the model needs to have a good understanding of both the audio and the visual nature of the video content. We combine all these capabilities to find your exact moments in the video content.

Lastly, there's Long-Context Understanding, which involves understanding lengthy videos as completely as possible. One use case could be security and forensic research. Imagine you're a police officer looking for a suspect, and you get footage from across the city with hundreds of hours of video. You want to track the suspect across all different camera angles. Because we provide this kind of long-context understanding, the model can memorize what happened across all the time and all the events. You can ask questions like "track suspect across 12 cameras over six hours," and we can scan through all the video content and provide you the exact moments when that person or suspect was present.

From TV Broadcasting in Japan to Global Enterprise Deployment: Leveraging AWS Infrastructure for Secure Video Intelligence

These are the four capabilities that represent the next-generation video search engine that enterprises need at the moment. We developed the initial technology and infrastructure and deployed it in Japan for a specific use case. We are a Japanese startup, and we deployed for one use case in Japan, which is TV broadcasting. Currently, we're tracking all TV broadcasting in Japan and making everything searchable. You can search for products, brands, locations, names, and phone numbers in video content that is broadcasted on TV. From there, you can summarize the video, extract sentiments, segment content, track certain events, and more. This is what we're doing for our customers at the moment using this technology.

After discussing with many different customers in different industries, we found that this technology is extremely useful not only for TV broadcasting but also for security, manufacturing, retail, and more. To adapt our base model for different kinds of use cases, we're building what we call a fine-tuning or adaptation factory. We have our base model that we can fine-tune through different kinds of use cases quickly to provide the best model for that specific use case in the video understanding space. This is what we're currently developing as well.

Some of the use cases we're considering at the moment and working with customers on include, apart from TV broadcasting and media, retail intelligence. We can understand a shopper's journey within the retail space through video, so we can understand pretty much all user behavior. Then there's manufacturing, including quality control, safety control, and product productivity control. We're experimenting with these at the moment. The last piece is security, the most obvious use case, which is based on forensic research.

How we're doing this is by partnering with AWS to utilize all available technologies. We're mainly using ParallelCluster to train our models using Slurm, S3, EC2, and different kinds of environments. We're also utilizing different kinds of Nvidia GPUs, including the H100, L40S, and B200. The Amazon ecosystem is providing this kind of technology for us. We store terabytes and terabytes of data, maybe close to a petabyte, to actually train this kind of model. We use S3 buckets and FSX to mount into the AWS clusters and train our models. This is the ecosystem we're currently building.

For enterprise customers specifically, quality, trust, robustness, and security really matter. Since we're a small startup growing fast in the upcoming quarters, we're going to clear SOC2 security clearance. At the same time, we're developing an application that contains our model that you can deploy on your secure environment so that we don't see your data at all. You can analyze and query your video content with our technology.

If you have any use cases or customers, I would love to discuss different possibilities with you. Feel free to scan the QR code on our website or reach us at info@infinimind.io. We're happy to answer any questions and do some experimentation with you. Unlock the intelligence in your video data. We're happy to support you on this journey. Thank you so much.

; This article is entirely auto-generated using Amazon Bedrock.