Elizabeth Fuentes L for AWS

Posted on Dec 20, 2025 • Edited on Dec 31, 2025 • Originally published at builder.aws.com

Build Agentic Video Analysis with TwelveLabs Pegasus in minutes

#python #aws #tutorial #ai

Learn to build intelligent video analysis agents in minutes using TwelveLabs Pegasus 1.2 and Strands Agents SDK. Explore opensource and AWS-native approaches with minimal code.

Since the generative AI boom began, one concept has captured widespread attention: embeddings. This technique converts text into numerical representations and projects them into a vector space. Through mathematical equations such as cosine similarity, you can find similarities between vectors.

Semantic search operates as a sophisticated mathematical search process. This technique wasn't new at all. Google's Word2Vec popularized it in 2013. With ChatGPT and Retrieval-Augmented Generation (RAG), embeddings and vector databases gained mainstream attention.

This popularization brought significant evolution. Modern models such as foundation models (FMs) democratized embedding use, extending their application beyond text to all types of content. In previous blog posts, I showed how video processing was particularly complex. First, you separated the video into individual frames and extracted the audio. Then you generated separate embeddings before creating a multimodal vector database. This approach works but requires significant effort. Depending on your use case, this approach requires extensive infrastructure.

With multimodal models such as TwelveLabs, Gemini Embedding, or ImageBind, you no longer need to decompose video into constituent parts. These models process video, audio, and context natively. They generate unified embeddings that capture complete content semantics in one operation.

This post demonstrates the next evolution in video processing. You'll create an agentic application using TwelveLabs and the Strands Agents SDK. With minimal code, you'll build an intelligent system that understands video content and delivers results.

How TwelveLabs Pegasus 1.2 Processes Video

Traditional approaches treat videos as sequences of independent frames. TwelveLabs uses a "video-first" architecture that divides content into temporal segments. The default is 6-second clips, configurable from 2-10 seconds. Each segment generates video-native embeddings between 512 and 1024 dimensions.

These embeddings capture temporal relationships across visual, audio, and text modalities. With internal sampling of approximately 1 frame per second and 256 patches per frame, a 3-minute video translates to approximately 46,000 tokens. This approach is more efficient than image-first models. It processes videos up to 1 hour long while maintaining complete temporal context. Precise timestamp metadata enables accurate video search and retrieval.

Two Approaches:

This tutorial presents two implementation approaches using TwelveLabs Pegasus 1.2. Both approaches share identical agent logic and capabilities. The key difference: video storage location and model invocation method.

Clone the repository:

sample-multimodal-agent-tutorial

git clone https://github.com/aws-samples/sample-multimodal-agent-tutorial
cd notebooks
# go to 08-agentic-video-analysis.ipynb

🔓 Approach 1: Fully Opensource

Requirements:

TwelveLabs API Key (set as TWELVELABS_API_KEY)
Your preferred LLM provider (Anthropic, OpenAI, etc.)

Let's build the fully opensource version. Here's the complete implementation:

Step 1: Environment Setup

Use twelvelabs_video_analysis tool

# Install dependencies
pip install strands-agents strands-agents[anthropic] twelvelabs strands-agents-tools

# Set environment variables
export TWELVELABS_API_KEY="your-twelvelabs-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Step 2: Initialize the Agent

import os
from strands import Agent
from twelvelabs_video_tool import twelvelabs_video_analysis
from strands.models.anthropic import AnthropicModel

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") 

anthropic_model = AnthropicModel(
    client_args={"api_key": ANTHROPIC_API_KEY},
    model_id="claude-3-7-sonnet-20250219", #or us.claude-3-7-sonnet-20250219 ,see https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html 
    max_tokens=1024,
    params={"temperature": 0.3}
)

twelvelabs_agent = Agent(
    tools=[twelvelabs_video_analysis],
    model = anthropic_model,
    system_prompt="""You are a specialized video analysis agent using TwelveLabs API. You can:

    1. Upload videos to create searchable indexes
    2. Generate video insights (titles, topics, hashtags)
    3. Answer detailed questions about video content
    4. List and search through video collections

    Always be helpful and provide comprehensive video analysis.
    When users ask about videos, first check what videos are available.
    """
)

That's it. You now have a fully functional video analysis agent.

Step 3: Test Your Agent

video_path = "/path/to/your/video.mp4"

response = twelvelabs_agent(f"Upload and analyze the video at {video_path} with the name 'sample-video'")

What happens behind the scenes:

Agent receives your request
Tool creates a new index in TwelveLabs
Uploads and processes the video
Waits for processing completion
Returns the response

You can list existing videos:

response = twelvelabs_agent("List all available videos")

Want to analyze a specific video? Just reference it by name:

response = twelvelabs_agent("Summarize the  
AI-Tool-Calling video)

This video: AI Tool Calling via Natural Language: LLMs, APIs & Docker in Action

Note: Generating the response only took 19 seconds.

You can ask a follow-up question.

response = twelvelabs_agent("When is S3 mentioned in the video?")

☁️ Approach 2: AWS-Native

The AWS-native implementation provides the same capabilities with managed infrastructure.

Requirements:

AWS account with Amazon Bedrock model access
AWS CLI configured with appropriate permissions
Amazon S3 bucket for video storage
AWS IAM permissions for Amazon Bedrock

Step 1: Setup with AWS Credentials

Use bedrock_video_analysis tool

os.environ['S3_BUCKET_NAME'] = 'your-video-bucket'  # Replace with your S3 bucket
os.environ['AWS_REGION'] = 'us-east-1'  # Replace with your preferred region

Step 2: Initialize the Agent

from strands import Agent
from strands.models import BedrockModel
from bedrock_video_tool import bedrock_video_analysis
import boto3

bedrock_agent = Agent(
    tools=[bedrock_video_analysis],
    system_prompt="""You are a specialized video analysis agent using AWS Bedrock Pegasus model. You can:

    1. Analyze videos directly from local files or S3 URIs
    2. Automatically upload local videos to S3 when needed
    3. Answer questions about video content and scenes
    4. List and search videos in S3 buckets

    Always be helpful and provide detailed video analysis.
    When analyzing local files, I'll automatically upload them to S3 first.
    """
)

Step 2: Test

video_path = "/path/to/your/video.mp4"
response = twelvelabs_agent(f"Upload and analyze the video at {video_path} with the name 'sample-video'")

What happens behind the scenes:

Receives your request
Uploads the video to Amazon S3
Sends the invocation to TwelveLabs with Amazon Bedrock
Returns the response

You can list existing videos:

response = bedrock_agent("List all available videos")

Want to analyze a specific video? Just reference it by name:

response = bedrock_agent("Summarize the  
AI-Tool-Calling video)

This video (the same video): AI Tool Calling via Natural Language: LLMs, APIs & Docker in Action

You can ask a follow-up question.

response = bedrock_agent("When is S3 mentioned in the video?")

Happy building! 🚀

What's Next: Production Deployment

You've built a powerful agentic video analysis system with minimal code. But how do you take this to production?

In the next episode, we'll explore: 🚀 Deploying to Production with Amazon Bedrock AgentCore

Conclusion

You've built a agentic video analysis system with minimal code using TwelveLabs Pegasus 1.2 and the Strands Agents SDK. This solution demonstrates how multimodal models simplify video processing by eliminating the need for manual frame extraction and separate embedding generation.

Technical Achievements:

Native video processing without manual decomposition
Temporal context preservation across extended sequences
Unified embeddings for visual, audio, and text modalities
Production-ready performance with low latency

Implementation Flexibility:

Open source approach: Complete control with any LLM provider
AWS-native approach: Managed infrastructure with Amazon Bedrock (or any LLM provider) and Amazon S3.
Identical agent logic across both approaches
Minimal code requirements for enterprise-grade capabilities