DEV Community

Chishan
Chishan

Posted on

How to Extract AI-Ready Prompts from Any Video Using Computer Vision

How to Extract AI-Ready Prompts from Any Video Using Computer Vision

If you've ever tried to describe a video scene in words precise enough for Sora or Runway to reproduce, you know the frustration. Human descriptions tend to be vague — we say "moody" when the AI needs "low-key lighting, 5600K color temperature, desaturated teal shadows with lifted blacks."

Computer vision can close that gap by systematically extracting visual attributes from video frames. In this article, I'll walk through the practical techniques for building a frame-level analysis system that turns any video into structured prompt data.

Why Frame Analysis Matters for Prompt Quality

Most people approach AI video generation by writing prompts from memory or imagination. The problem: we're remarkably bad at articulating visual details.

Consider describing a 10-second clip from a cooking video. You might write:

"A chef cooking in a kitchen"
Enter fullscreen mode Exit fullscreen mode

But the visual reality contains dozens of describable attributes:

"Overhead shot, warm tungsten lighting mixed with natural window light,
shallow depth of field focusing on hands, wooden cutting board on marble
counter, steam rising from pan, slow motion at 120fps, slightly
desaturated warm tones, professional kitchen environment"
Enter fullscreen mode Exit fullscreen mode

Computer vision can extract these attributes programmatically, producing consistently detailed prompts.

The Frame Analysis Stack

Building a practical extraction system requires four components working together.

1. Adaptive Frame Sampling

Not all frames are equally informative. A 30-second video at 30fps gives you 900 frames, but most are nearly identical. Intelligent sampling reduces computation by 95% while preserving information.

The approach I've found most effective combines three strategies:

Content-based sampling uses perceptual hashing to detect visual changes:

import cv2
import numpy as np
from collections import deque

def sample_frames_adaptive(video_path, min_distance=0.15):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    selected_frames = []
    prev_features = None

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Compute normalized color histogram as feature
        hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
        hist = cv2.calcHist([hsv], [0, 1], None, [50, 60], [0, 180, 0, 256])
        cv2.normalize(hist, hist)
        features = hist.flatten()

        if prev_features is None:
            selected_frames.append(frame)
            prev_features = features
            continue

        # Measure visual distance between frames
        distance = cv2.compareHist(
            prev_features.reshape(50, 60).astype(np.float32),
            features.reshape(50, 60).astype(np.float32),
            cv2.HISTCMP_BHATTACHARYYA
        )

        if distance > min_distance:
            selected_frames.append(frame)
            prev_features = features

    cap.release()
    return selected_frames
Enter fullscreen mode Exit fullscreen mode

Temporal anchoring ensures coverage by always including the first frame, last frame, and frames at regular intervals regardless of visual similarity. This prevents missing slow transitions that content-based sampling might skip.

Motion peak detection uses optical flow magnitude to identify moments of peak activity — camera movements, subject gestures, or scene transitions — which often contain the most prompt-relevant information.

2. Color and Lighting Extraction

Color grading and lighting conditions are among the most impactful prompt elements. AI generators are highly responsive to specific color and lighting descriptions.

Dominant Color Palette

K-means clustering on the HSV color space identifies the primary palette:

from sklearn.cluster import KMeans

def extract_palette(frame, n_colors=5):
    pixels = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    pixels = pixels.reshape(-1, 3).astype(np.float32)

    # Sample pixels for efficiency
    sample_size = min(10000, len(pixels))
    indices = np.random.choice(len(pixels), sample_size, replace=False)
    sample = pixels[indices]

    kmeans = KMeans(n_clusters=n_colors, n_init=10, random_state=42)
    kmeans.fit(sample)

    # Sort by frequency
    labels, counts = np.unique(kmeans.labels_, return_counts=True)
    sorted_indices = np.argsort(-counts)

    return [
        {
            'rgb': kmeans.cluster_centers_[i].astype(int).tolist(),
            'percentage': counts[i] / len(sample) * 100
        }
        for i in sorted_indices
    ]
Enter fullscreen mode Exit fullscreen mode

Color Temperature Classification

Mapping the dominant colors to a temperature scale provides lighting context:

def classify_color_temperature(palette):
    avg_rgb = np.mean([c['rgb'] for c in palette[:3]], axis=0)
    r, g, b = avg_rgb

    if r > b * 1.3:
        return "warm (tungsten/golden hour)"
    elif b > r * 1.2:
        return "cool (daylight/blue hour)"
    else:
        return "neutral"
Enter fullscreen mode Exit fullscreen mode

Contrast and Dynamic Range

The histogram spread indicates contrast level:

  • Compressed histogram: low contrast, flat look
  • Wide histogram with peaks at extremes: high contrast, dramatic
  • Bimodal distribution: film noir or high-key style

These measurements translate directly into prompt language: "high contrast with deep blacks and bright highlights" vs "low contrast, lifted shadows, muted look."

3. Composition and Spatial Analysis

Where subjects sit in the frame affects which composition terms to include in prompts.

Edge density mapping divides the frame into a 3x3 grid and measures visual complexity in each region:

def analyze_spatial_composition(frame):
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150)
    h, w = edges.shape
    h3, w3 = h // 3, w // 3

    regions = {}
    for row, row_name in enumerate(['top', 'middle', 'bottom']):
        for col, col_name in enumerate(['left', 'center', 'right']):
            region = edges[row*h3:(row+1)*h3, col*w3:(col+1)*w3]
            regions[f'{row_name}_{col_name}'] = np.mean(region)

    # Determine composition pattern
    center_weight = regions['middle_center']
    periphery_weight = np.mean([
        v for k, v in regions.items() if k != 'middle_center'
    ])

    if center_weight > periphery_weight * 1.5:
        return "centered composition"
    elif regions['middle_left'] > center_weight:
        return "subject positioned left, rule of thirds"
    elif regions['middle_right'] > center_weight:
        return "subject positioned right, rule of thirds"
    else:
        return "balanced wide composition"
Enter fullscreen mode Exit fullscreen mode

This feeds directly into prompt construction. "Centered composition" suggests close-up or portrait framing, while "rule of thirds" placement suggests more cinematic framing with negative space.

4. Motion and Camera Movement Detection

For video-to-prompt conversion, understanding camera movement is essential for generators like Sora that explicitly support motion directives.

Optical flow between consecutive frames reveals both camera and subject motion:

def classify_camera_movement(frame1, frame2):
    gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
    gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)

    flow = cv2.calcOpticalFlowFarneback(
        gray1, gray2, None,
        pyr_scale=0.5, levels=3, winsize=15,
        iterations=3, poly_n=5, poly_sigma=1.2, flags=0
    )

    # Global flow direction indicates camera movement
    mean_flow_x = np.mean(flow[..., 0])
    mean_flow_y = np.mean(flow[..., 1])
    magnitude = np.sqrt(mean_flow_x**2 + mean_flow_y**2)

    if magnitude < 0.5:
        return "static camera, locked-off shot"

    angle = np.arctan2(mean_flow_y, mean_flow_x) * 180 / np.pi

    if -45 < angle < 45:
        return "panning right"
    elif 45 < angle < 135:
        return "tilting down"
    elif -135 < angle < -45:
        return "tilting up"
    else:
        return "panning left"
Enter fullscreen mode Exit fullscreen mode

The flow magnitude also indicates speed: low magnitude with consistent direction suggests a slow, smooth dolly or pan, while high magnitude with varied directions suggests handheld or action footage.

Putting It Together: From Analysis to Prompt

Once you have all four layers of analysis, combining them into a coherent prompt follows a priority order:

  1. Shot type and framing (wide, medium, close-up)
  2. Camera movement (static, pan, dolly, handheld)
  3. Subject description (what's in the frame)
  4. Lighting (direction, quality, color temperature)
  5. Color grading (palette, contrast, saturation)
  6. Aesthetic style (cinematic, documentary, commercial)
def assemble_prompt(analysis):
    parts = []

    if analysis['composition']:
        parts.append(analysis['composition'])

    if analysis['camera_movement'] != 'static':
        parts.append(analysis['camera_movement'])

    if analysis['color_temperature']:
        parts.append(f"{analysis['color_temperature']} lighting")

    if analysis['contrast'] == 'high':
        parts.append("high contrast with deep shadows")
    elif analysis['contrast'] == 'low':
        parts.append("low contrast, lifted shadows")

    parts.append(f"color palette: {describe_palette(analysis['palette'])}")

    return ", ".join(parts)
Enter fullscreen mode Exit fullscreen mode

Practical Applications

This frame analysis approach powers several real-world use cases:

Style matching: Analyze a reference video, generate its prompt, then use that prompt with a different subject to achieve the same visual style.

Prompt iteration: Start with a computer vision-extracted prompt, generate a video, then compare the output's analysis with the original to identify what to adjust.

Learning: By seeing exactly what visual attributes the system extracts, you develop better intuition for writing prompts manually.

Tools like TubePrompter implement this full pipeline, letting you paste a YouTube, TikTok, or Instagram URL and receive optimized prompts for Sora, Midjourney, Runway, and other generators without building the infrastructure yourself.

Limitations to Be Aware Of

Frame-level analysis has known blindspots:

  • Narrative context: Computer vision sees pixels, not story. A suspenseful scene and a mundane one might have similar visual attributes.
  • Audio influence: Sound design heavily influences perceived mood, but video analysis misses this entirely.
  • Temporal coherence: Analyzing individual frames loses information about how visual elements change over time.
  • Style subjectivity: "Cinematic" means different things to different people and different AI models.

These limitations are why the best results come from using extracted prompts as a starting point, then refining manually based on your creative intent.

What's Next

The gap between what we see and what we can describe to AI is narrowing. As vision-language models improve, frame analysis will become more nuanced — detecting not just what's in a frame, but the emotional tone, the narrative purpose, and the cultural context.

For now, the practical takeaway is straightforward: let computer vision handle the objective visual analysis, and focus your creative energy on the subjective elements that make your vision unique.


What visual attributes do you find hardest to describe in prompts? I'd be curious to hear what gaps computer vision could help fill for your workflow. Drop a comment below.

Top comments (0)