DEV Community

Cover image for Build Agentic Video Analysis with TwelveLabs Pegasus in minutes
Elizabeth Fuentes L for AWS

Posted on • Originally published at builder.aws.com

Build Agentic Video Analysis with TwelveLabs Pegasus in minutes

Learn to build intelligent video analysis agents in minutes using TwelveLabs Pegasus 1.2 and Strands Agents SDK. Explore opensource and AWS-native approaches with minimal code.

Since the generative AI boom began, one concept has captured widespread attention: embeddings. This technique converts text into numerical representations and projects them into a vector space. Through mathematical equations such as cosine similarity, you can find similarities between vectors.

Semantic search operates as a sophisticated mathematical search process. This technique wasn't new at all. Google's Word2Vec popularized it in 2013. With ChatGPT and Retrieval-Augmented Generation (RAG), embeddings and vector databases gained mainstream attention.

This popularization brought significant evolution. Modern models such as foundation models (FMs) democratized embedding use, extending their application beyond text to all types of content. In previous blog posts, I showed how video processing was particularly complex. First, you separated the video into individual frames and extracted the audio. Then you generated separate embeddings before creating a multimodal vector database. This approach works but requires significant effort. Depending on your use case, this approach requires extensive infrastructure.

With multimodal models such as TwelveLabs, Gemini Embedding, or ImageBind, you no longer need to decompose video into constituent parts. These models process video, audio, and context natively. They generate unified embeddings that capture complete content semantics in one operation.

This post demonstrates the next evolution in video processing. You'll create an agentic application using TwelveLabs and the Strands Agents SDK. With minimal code, you'll build an intelligent system that understands video content and delivers results.

How TwelveLabs Pegasus 1.2 Processes Video

Traditional approaches treat videos as sequences of independent frames. TwelveLabs uses a "video-first" architecture that divides content into temporal segments. The default is 6-second clips, configurable from 2-10 seconds. Each segment generates video-native embeddings between 512 and 1024 dimensions.

These embeddings capture temporal relationships across visual, audio, and text modalities. With internal sampling of approximately 1 frame per second and 256 patches per frame, a 3-minute video translates to approximately 46,000 tokens. This approach is more efficient than image-first models. It processes videos up to 1 hour long while maintaining complete temporal context. Precise timestamp metadata enables accurate video search and retrieval.

Key capabilities of Pegasus 1.2:

  • Long-form video understanding: Handles videos up to 1 hour with state-of-the-art accuracy
  • Temporal precision: Maintains context across extended sequences
  • Multimodal comprehension: Integrates visual, audio, and textual information
  • Production-ready: Low latency and cost-efficient for enterprise applications

Two Approaches:

This tutorial presents two implementation approaches using TwelveLabs Pegasus 1.2. Both approaches share identical agent logic and capabilities. The key difference: video storage location and model invocation method.

Clone the repository:

sample-multimodal-agent-tutorial

git clone https://github.com/aws-samples/sample-multimodal-agent-tutorial
cd notebooks
# go to 08-agentic-video-analysis.ipynb
Enter fullscreen mode Exit fullscreen mode

πŸ”“ Approach 1: Fully Opensource

TwelveLabs video analysis agents

Requirements:

  • TwelveLabs API Key (set as TWELVELABS_API_KEY)
  • Your preferred LLM provider (Anthropic, OpenAI, etc.)

Let's build the fully opensource version. Here's the complete implementation:

Step 1: Environment Setup

Use twelvelabs_video_analysis tool

# Install dependencies
pip install strands-agents strands-agents[anthropic] twelvelabs strands-agents-tools

# Set environment variables
export TWELVELABS_API_KEY="your-twelvelabs-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
Enter fullscreen mode Exit fullscreen mode

Step 2: Initialize the Agent

import os
from strands import Agent
from twelvelabs_video_tool import twelvelabs_video_analysis
from strands.models.anthropic import AnthropicModel

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") 

anthropic_model = AnthropicModel(
    client_args={"api_key": ANTHROPIC_API_KEY},
    model_id="claude-3-7-sonnet-20250219",
    max_tokens=1024,
    params={"temperature": 0.3}
)

twelvelabs_agent = Agent(
    tools=[twelvelabs_video_analysis],
    model = anthropic_model,
    system_prompt="""You are a specialized video analysis agent using TwelveLabs API. You can:

    1. Upload videos to create searchable indexes
    2. Generate video insights (titles, topics, hashtags)
    3. Answer detailed questions about video content
    4. List and search through video collections

    Always be helpful and provide comprehensive video analysis.
    When users ask about videos, first check what videos are available.
    """
)
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a fully functional video analysis agent.

Step 3: Test Your Agent

video_path = "/path/to/your/video.mp4"

response = twelvelabs_agent(f"Upload and analyze the video at {video_path} with the name 'sample-video'")
Enter fullscreen mode Exit fullscreen mode

What happens behind the scenes:

  1. Agent receives your request
  2. Tool creates a new index in TwelveLabs
  3. Uploads and processes the video
  4. Waits for processing completion
  5. Returns the response

You can list existing videos:

response = twelvelabs_agent("List all available videos")
Enter fullscreen mode Exit fullscreen mode

Want to analyze a specific video? Just reference it by name:

response = twelvelabs_agent("Summarize the  
AI-Tool-Calling video)
Enter fullscreen mode Exit fullscreen mode

This video: AI Tool Calling via Natural Language: LLMs, APIs & Docker in Action

analyze video with twelvelabs

Note: Generating the response only took 19 seconds.

You can ask a follow-up question.

response = twelvelabs_agent("When is S3 mentioned in the video?")
Enter fullscreen mode Exit fullscreen mode

twelvelabs agent follow-up question

☁️ Approach 2: AWS-Native

The AWS-native implementation provides the same capabilities with managed infrastructure.

TwelveLabs video analysis agents with AWS

Requirements:

  • AWS Account with Bedrock access
  • Amazon S3 bucket for video storage
  • IAM permissions for Bedrock

Step 1: Setup with AWS Credentials

Use bedrock_video_analysis tool

os.environ['S3_BUCKET_NAME'] = 'your-video-bucket'  # Replace with your S3 bucket
os.environ['AWS_REGION'] = 'us-east-1'  # Replace with your preferred region
Enter fullscreen mode Exit fullscreen mode

Step 2: Initialize the Agent

from strands import Agent
from strands.models import BedrockModel
from bedrock_video_tool import bedrock_video_analysis
import boto3

bedrock_agent = Agent(
    tools=[bedrock_video_analysis],
    system_prompt="""You are a specialized video analysis agent using AWS Bedrock Pegasus model. You can:

    1. Analyze videos directly from local files or S3 URIs
    2. Automatically upload local videos to S3 when needed
    3. Answer questions about video content and scenes
    4. List and search videos in S3 buckets

    Always be helpful and provide detailed video analysis.
    When analyzing local files, I'll automatically upload them to S3 first.
    """
)
Enter fullscreen mode Exit fullscreen mode

Step 2: Test

video_path = "/path/to/your/video.mp4"
response = twelvelabs_agent(f"Upload and analyze the video at {video_path} with the name 'sample-video'")
Enter fullscreen mode Exit fullscreen mode

What happens behind the scenes:

  1. Receives your request
  2. Uploads the video to Amazon S3
  3. Sends the invocation to TwelveLabs with Amazon Bedrock
  4. Returns the response

You can list existing videos:

response = bedrock_agent("List all available videos")
Enter fullscreen mode Exit fullscreen mode

Want to analyze a specific video? Just reference it by name:

response = bedrock_agent("Summarize the  
AI-Tool-Calling video)
Enter fullscreen mode Exit fullscreen mode

This video (the same video): AI Tool Calling via Natural Language: LLMs, APIs & Docker in Action

twelvelabs agentic app

You can ask a follow-up question.

response = bedrock_agent("When is S3 mentioned in the video?")
Enter fullscreen mode Exit fullscreen mode

bedrock and twelvelabs

Happy building! πŸš€

What's Next: Production Deployment

You've built a powerful agentic video analysis system with minimal code. But how do you take this to production?

In the next episode, we'll explore: πŸš€ Deploying to Production with Amazon Bedrock AgentCore

Conclusion

You've built a agentic video analysis system with minimal code using TwelveLabs Pegasus 1.2 and the Strands Agents SDK. This solution demonstrates how multimodal models simplify video processing by eliminating the need for manual frame extraction and separate embedding generation.

Technical Achievements:

  • Native video processing without manual decomposition
  • Temporal context preservation across extended sequences
  • Unified embeddings for visual, audio, and text modalities
  • Production-ready performance with low latency

Implementation Flexibility:

  • Open source approach: Complete control with any LLM provider
  • AWS-native approach: Managed infrastructure with Amazon Bedrock (or any LLM provider) and Amazon S3.
  • Identical agent logic across both approaches
  • Minimal code requirements for enterprise-grade capabilities

Ready to create your own Strands agent?

Here are some resources:


Gracias!

πŸ‡»πŸ‡ͺπŸ‡¨πŸ‡± Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

Top comments (0)