Tsubasa Kanno

Posted on May 17

Beyond Transcription - Multimodal Video and Audio Analysis with Snowflake AI_COMPLETE

#snowflake #llm #multimodal #sql

Introduction

Snowflake's unstructured data analytics has just taken another big leap forward. After progressively adding support for images, documents, and audio, AI_COMPLETE can now accept video and audio files as direct input!

This update — the multimodal video and audio extension of AI_COMPLETE — has just been released as Public Preview. It's one of the more impactful additions among Snowflake Cortex AI Functions, in my opinion. You can now hand a video or audio file directly to AI from SQL, and get summaries, classification, sentiment analysis, and structured extraction in a single shot.

Until now, audio analysis usually meant a two-step pipeline: transcribe with AI_TRANSCRIBE, then analyze the text with another function. Video analysis required even more setup — either preprocessing with Python video libraries to extract frames and split audio tracks, or building a dedicated processing pipeline on Snowpark Container Services (SPCS). With this update, you can skip those preprocessing steps and let the model directly understand vocal tone, pauses, and visual content — opening the door to analyses that were previously hard to attempt.

With images, documents, audio, and now video — every major unstructured data format used in business is queryable from SQL. This means you can now build multimodal analytics in a much simpler way than before.

This article walks through the basics of AI_COMPLETE's video and audio support, how it compares with AI_TRANSCRIBE on cost and quality, business use cases, and a simple sample app built with Streamlit in Snowflake. Stay till the end!

Note: The video and audio support introduced in this article is in Public Preview. Supported models and limits will continue to expand in future updates, so expect this space to keep evolving.

Note: This article reflects my personal views and does not represent Snowflake's official position.

What's New: AI_COMPLETE Multimodal Extension

AI_COMPLETE is the central general-purpose LLM function in the Snowflake Cortex AI Functions family. It already supported text, images, and documents as multimodal input, but with this update you can now pass video and audio files directly.

Audio analytics used to follow a "transcribe with AI_TRANSCRIBE → analyze text with AI_COMPLETE" pattern. Video analytics required either chopping frames into images via Python libraries and feeding them into AI_COMPLETE's image input, or building an end-to-end frame-extraction-and-inference pipeline with SPCS. With this multimodal extension, you can skip those preprocessing steps and hand the audio or video file itself to AI_COMPLETE for summarization, sentiment analysis, and structured extraction.

Key Highlights

SQL-native: Same calling convention as before — just pass a video or audio file
Joint audio + visual understanding: For video, the model considers both visual content and audio track
Multimodal composition: Combine with text prompts for conditional analysis or JSON-structured output
Wide format support: Most common video and audio formats are accepted
Secure processing: Files stay inside Snowflake stages

Part of the Cortex AI Functions Family

The multimodal extension shines when you chain its output into downstream AI Functions. After extracting a summary from a video, common follow-up patterns include:

AI_CLASSIFY: Categorize the extracted summary into industries or risk levels
AI_SENTIMENT: Score sentiment along multiple dimensions
AI_AGG: Aggregate insights across many media files
AI_EMBED: Embed extracted text or images for cross-corpus search
AI_TRANSCRIBE: A purpose-built function for transcription, ideal when you need timestamps and speaker labels

Basic Usage

Passing a video or audio file directly to AI_COMPLETE follows the same simple syntax as image input.

AI_COMPLETE(
    <model>,
    <prompt>,
    <file>
)

Parameters

model: Name of an LLM that supports video/audio input. As of now, gemini-3.1-pro is officially listed as supported. More multimodal models are expected to be added.
prompt: Natural-language instruction
file: A reference to a staged file via TO_FILE

Example 1: Summarizing a Video File

The simplest use case is to hand a video to the model and ask for a summary that takes both visuals and audio into account.

This article uses Pexels: Pouring Fresh Coffee into Ceramic Mug (Pexels License) as the sample video.

SELECT AI_COMPLETE(
    'gemini-3.1-pro',
    'Summarize this video in about 200 words. Include what is shown, the mood, and the likely scene.',
    TO_FILE('@media_stage', 'pouring_coffee.mp4')
) AS summary;

The actual response looks like this:

A glass coffee server slowly pours freshly brewed coffee into a uniquely designed
small mug shaped like a blue character, sitting on a white table. The cute design
of the cup combined with the quiet motion of the cup gradually filling up evokes
a warm, relaxed atmosphere — exactly the kind of feeling that lets you take a
breather. It feels like a calm morning coffee at home or an afternoon break.

The same pattern works for audio. Hand an English-narration audio file to the model and ask for a summary, and you get an output that captures both the content and the speaker's tone:

SELECT AI_COMPLETE(
    'gemini-3.1-pro',
    'Summarize this audio in about 150 words. Take the speaker''s emotion and atmosphere into account.',
    TO_FILE('@media_stage', 'english_narration.m4a')
) AS summary;

For Sherlock Holmes, she (Irene Adler) is always "the woman" — a uniquely special
presence. The narrator, Watson, says he has rarely heard Holmes refer to her by
any other name. The calm, reminiscent delivery quietly conveys the deep respect
Holmes holds for her, and her overwhelming presence.

Japanese audio works the same way. When I tested with a synthesized customer-service-style audio sample (using say for reproducibility — you can also use Mozilla Common Voice's Japanese dataset (CC0) or VOICEVOX to generate equivalent audio), the model neatly captured both the request and the emotional tone.

The speaker is reaching out to confirm the current status of a delayed delivery
of a recently purchased item. The emotional tone is highly polite and composed.

Example 2: Going Beyond Words — Tone, Pitch, and Pace

This is where AI_COMPLETE diverges decisively from AI_TRANSCRIBE. AI_COMPLETE can let the model evaluate the vocal delivery itself — pitch, pace, tone, volume, and pauses (the official docs explicitly demonstrate this in the call-center example). The phrase "Hello" can carry completely different meaning depending on whether it's said with a bright, upbeat tone or a heavy, somber one. That nuance is lost the moment you transcribe to text — and AI_COMPLETE can capture it.

To verify, I synthesized the exact same Japanese phrase ("Hello, thank you for joining us today.") in two different tones — bright/fast vs. dark/slow — and passed each to AI_COMPLETE.

SELECT AI_COMPLETE(
    'gemini-3.1-pro',
    'You are an acoustic analyzer. Listen to the attached audio and evaluate the vocal delivery (pitch, pace, tone, volume). Return a short description of the vocal delivery only (do not transcribe content).',
    TO_FILE('@media_stage', 'tone_bright.m4a')
) AS bright_analysis;

Output for the bright/fast version:

Spoken with a very high pitch and fast pace. The tone is bright and slightly
artificial (sounds sped-up or processed). Volume is moderate.

Output for the dark/slow version:

Pitch is moderate and steady, and the pace is calm and easy to follow. The tone
is polite and composed, with consistent moderate volume.

The actual words are identical, but the model returned clearly different evaluations based on how the voice was delivered. This kind of signal is impossible to capture with AI_TRANSCRIBE alone. For use cases where the way something is said matters as much as what is said — call center quality scoring, interview evaluation, escalation detection — AI_COMPLETE is uniquely powerful.

Example 3: Structured Analysis of a Video File

Use a JSON Schema to get structured output that downstream pipelines can consume directly.

SELECT TO_JSON(AI_COMPLETE(
    'gemini-3.1-pro',
    'Analyze this video and return scene summary, visible objects, and the overall mood as JSON.',
    TO_FILE('@media_stage', 'pouring_coffee.mp4'),
    {},
    {
        'type': 'json',
        'schema': {
            'type': 'object',
            'properties': {
                'summary': {'type': 'string'},
                'objects': {'type': 'array', 'items': {'type': 'string'}},
                'mood': {'type': 'string'}
            },
            'required': ['summary', 'objects', 'mood']
        }
    }
)) AS analysis;

Actual output:

{
  "mood": "Calm, relaxed",
  "objects": ["Coffee pot", "Mug", "Coffee"],
  "summary": "Coffee being poured from a black coffee pot into a unique mug featuring a blue bird's face design."
}

The keys and types defined in the schema are honored, which makes it trivial to forward the output into the next stage of a data pipeline.

JSON Schema specification works exactly the same way for audio. You can ask the model to return all values in any language you like by spelling that out in the prompt — useful when you want to feed the output directly into a non-English dashboard or report.

Example 4: Conditional Analysis with Text Prompts

When you want only a specific aspect extracted, prompt design takes care of it. Including instructions like "if there are none, just return 'none'" prevents hallucinated results.

This article uses Pexels: Person Having an Interview (Pexels License). The video doesn't contain any product names, so the expected output is "none".

SELECT AI_COMPLETE(
    'gemini-3.1-pro',
    'Extract only the moments where product names (brand names) appear in this video, and list each name with its timing as bullet points. If there are none, just return "none".',
    TO_FILE('@media_stage', 'interview.mp4')
) AS extracted;

The model returned exactly none, faithfully following the prompt.

Supported Models and Formats

Supported Models

According to the official documentation, the model currently listed as supporting direct video/audio input is gemini-3.1-pro. Snowflake partners with multiple model providers, so the lineup of multimodal-capable models is expected to grow over time.

One thing I appreciate about AI_COMPLETE is that the model behind every call is explicit. With some text-processing managed services, the underlying model can change without notice. AI_COMPLETE always takes the model name as its first argument, so you stay in control of cost characteristics, performance trade-offs, and behavior.

Supported Formats

A broad set of mainstream formats is supported, so you can work with most video and audio files you'd find in the wild.

Type	Supported formats
Video	mp4, mpeg, mov, avi, flv, mpg, webm, wmv, 3gpp
Audio	wav, mp3, aiff, aac, ogg, flac, m4a, mp4, pcm, webm

Per-Request Limits

Item	Limit
Total size per request	100 MB
Video files per request	Up to 10
Audio files per request	Up to 10

For longer videos, splitting by scene or duration before sending lets you analyze plenty of content within current limits. Limits are also expected to be relaxed as the feature matures.

Regional Availability

Check the Cortex LLM Regional availability page for gemini-3.1-pro's native regions. If your region isn't natively supported, you can enable cross-region inference with a single SQL statement and use the same functionality.

ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION';

There may be slightly higher latency when going via cross-region inference, but the functionality remains the same.

AI_TRANSCRIBE vs. AI_COMPLETE — Cost and Quality

"Should I use AI_TRANSCRIBE or AI_COMPLETE for audio?" is a natural question when trying out the new feature.

The fundamental thing to understand is that AI_TRANSCRIBE is a purpose-built, dedicated AI Function for converting audio to text. The output format is fixed (text + timestamps + speaker labels), giving you less flexibility — but in return, it's simple to use when your goal is clear and costs are predictable per second of audio. AI_COMPLETE is a general-purpose LLM, so prompts can express anything you want, and the model can incorporate vocal delivery and visual information into its interpretation. The two aren't competitors — they're complementary: AI_TRANSCRIBE for clear, fixed goals; AI_COMPLETE for free-form interpretation.

Functional Differences

Aspect	AI_TRANSCRIBE	AI_COMPLETE (audio direct)
Primary purpose	Structured audio-to-text conversion	Free-form analysis (summary, classification, extraction)
Output	Text + timestamps + speaker labels	Whatever you ask for in the prompt (incl. JSON)
Speaker separation	Auto speaker labels (SPEAKER_00, etc.)	Achieved by prompt instruction
Timestamps	Word- or speaker-level	Achievable via prompt + structured output
Vocal tone / emotion	Text only	Includes pitch, pace, tone, volume, pauses
Multilingual	Broad coverage	LLM-native flexibility

Cost Comparison

Function	Pricing characteristic
AI_TRANSCRIBE	1 sec of audio = 50 tokens (flat). 180,000 tokens per hour
AI_COMPLETE (audio direct)	Standard LLM token pricing — varies by input size and output length

AI_TRANSCRIBE bills linearly with audio duration, so it shines when you want "just the transcript" or stable, predictable costs in big batch jobs. AI_COMPLETE varies with prompt and output size, which can swing costs depending on the task. In return, you explicitly choose the model in the first argument, giving you direct control over the cost-vs-quality trade-off in your application.

Quality Comparison

Scenario	Recommendation
Meeting minutes drafting (need accurate transcript)	AI_TRANSCRIBE → AI_COMPLETE chain
Call center sentiment analysis (with vocal tone)	AI_COMPLETE direct
Large-scale, fast batch transcription	AI_TRANSCRIBE
Video (visual + audio) summarization & moderation	AI_COMPLETE direct
JSON-structured extraction from audio	AI_COMPLETE direct

Rule of thumb: "Need text to keep on file" → AI_TRANSCRIBE. "Need the model to interpret content" → AI_COMPLETE. The "AI_TRANSCRIBE → AI_COMPLETE" pipeline is also the most cost-effective and high-quality option for many use cases.

Business Use Cases

1. Call Center Quality Scoring

Hand call recordings directly to AI_COMPLETE and get a multi-dimensional quality score — agent professionalism, customer anger level, escalation signals — that takes vocal delivery into account.

Tone-aware sentiment analysis: Catches subtle nuances that transcription would miss
Structured output integration: Returns each metric as JSON for dashboarding
Agent training: Identify improvement areas by comparing against high-performer recordings

2. Structured Meeting Minutes from Video

Pass a long meeting recording and get summaries, decisions, and action items broken down by topic and participant.

Topic-aware structuring: Prompt-driven chapter segmentation
Action item extraction: Get owner and due date as JSON
Cross-modal context: Whiteboards and on-screen materials are also picked up

3. Compliance Checks for Promotional Videos

Run pre-publication checks for marketing videos to catch potential regulatory issues or brand-guideline violations.

Custom condition detection: Define check criteria via prompt
Cross-modal consistency: Spot inconsistencies between on-screen captions and spoken content
Multi-language QA: Verify localized versions in batch via SQL

4. Auto-Chaptering for Educational Content

Analyze online learning content with AI_COMPLETE to generate chapter titles, summaries, and timestamped indexes for viewers.

Topic-shift detection: AI identifies content boundaries
Audience-specific summaries: Generate multiple versions per skill level
Search-enable content: Embed extracted segments with AI_EMBED for cross-corpus retrieval

5. Social / UGC Moderation

UGC (User Generated Content) refers to videos, audio, and images posted by general users on social media or video platforms. You can implement harmful or policy-violating content detection as part of your SQL workflow.

Batch moderation: Auto-process newly arriving videos via Streams + Tasks
Reviewer notification: Send results to moderators via Snowflake Notifications
Multi-axis scoring: Score for violence, discriminatory speech, copyright concerns in parallel

Sample App: Video Analysis with Streamlit in Snowflake

To give you an instant feel for AI_COMPLETE's video support, here's a simple Streamlit in Snowflake app that lets you upload a video and analyze it on the spot.

What the App Does

Video upload: Upload directly via browser; auto-saved to a stage
Video preview: Play the uploaded video in place
AI analysis: Run AI_COMPLETE with video input for summarization, extraction, classification
Editable prompt: Tweak the analysis prompt freely from the sidebar

Prerequisites

A Snowflake account where Streamlit in Snowflake is available
gemini-3.1-pro available natively, or cross-region inference enabled
Video files smaller than 100 MB
Streamlit version 1.52.2 (you need 1.26.0+ to use st.file_uploader; this article uses Streamlit in Snowflake's latest available version, 1.52.2)

Specifying the Streamlit Version (Important)

Streamlit in Snowflake's default version may be older than expected, so I recommend pinning the latest version explicitly. Just place an environment.yml next to your app files in the stage:

# environment.yml
name: app_environment
channels:
  - snowflake
dependencies:
  - streamlit=1.52.2

If you create the SiS app from Snowsight, you can change the streamlit version to 1.52.2 from the "Packages" dropdown in the upper right of the editor.

How to Set It Up

Create a new SiS app from the Snowsight left pane → "Streamlit" → "+ Streamlit App"
Paste the sample code below; the stage is created automatically on first run
Click "Run" in the upper right, upload a video, and click "Analyze with AI"

Sample Code

import streamlit as st
import io
import os
import re
import uuid
from datetime import datetime
from snowflake.snowpark.context import get_active_session

session = get_active_session()
STAGE_NAME = "VIDEO_ANALYSIS_STAGE"

db = session.get_current_database().strip('"')
schema = session.get_current_schema().strip('"')
STAGE_FQN = f"{db}.{schema}.{STAGE_NAME}"

st.set_page_config(layout="wide")
st.title("🎬 Video Analyzer")
st.caption("Analyze videos on the spot with AI_COMPLETE × gemini-3.1-pro")

st.sidebar.title("⚙️ Settings")
st.sidebar.info("Model for video analysis: gemini-3.1-pro")

prompt_template = st.sidebar.text_area(
    "📝 Analysis prompt",
    value=(
        "Summarize this video in about 300 words. "
        "Include the people involved, key scenes, and the emotion conveyed by the audio."
    ),
    height=180
)

@st.cache_resource
def setup_stage():
    try:
        session.sql(f"DESC STAGE {STAGE_FQN}").collect()
    except Exception:
        session.sql(f"""
            CREATE STAGE IF NOT EXISTS {STAGE_FQN}
            ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
            DIRECTORY = (ENABLE = TRUE)
        """).collect()

setup_stage()


def clean_ai_response(text):
    """Strip wrapping double quotes and unescape newlines from AI response."""
    if not isinstance(text, str):
        return text
    text = text.strip()
    if len(text) >= 2 and text.startswith('"') and text.endswith('"'):
        text = text[1:-1]
    text = text.replace("\\n", "\n").replace('\\"', '"')
    return text


uploaded_file = st.file_uploader(
    "Upload a video file (under 100MB)",
    type=["mp4", "mov", "webm", "avi", "mpeg", "mpg", "flv", "wmv"]
)

if uploaded_file is not None:
    st.subheader("▶️ Video Preview")
    st.video(uploaded_file)

    if st.button("🤖 Analyze with AI", use_container_width=True, type="primary"):
        try:
            ext = os.path.splitext(uploaded_file.name)[1].lower() or ".mp4"
            safe_ext = ext if re.fullmatch(r"\.[a-z0-9]{2,5}", ext) else ".mp4"
            video_filename = (
                f"video_{datetime.now().strftime('%Y%m%d_%H%M%S')}_"
                f"{uuid.uuid4().hex[:8]}{safe_ext}"
            )

            with st.spinner("Uploading video to stage..."):
                video_stream = io.BytesIO(uploaded_file.getvalue())
                session.file.put_stream(
                    video_stream,
                    f"@{STAGE_FQN}/{video_filename}",
                    auto_compress=False,
                    overwrite=True
                )

            with st.spinner("AI is analyzing the video..."):
                query = f"""
                    SELECT AI_COMPLETE(
                        'gemini-3.1-pro',
                        ?,
                        TO_FILE('@{STAGE_FQN}', ?)
                    ) AS analysis
                """
                result = session.sql(query, params=[prompt_template, video_filename]).collect()

            st.subheader("📊 Analysis Result")
            if result and len(result) > 0 and result[0]["ANALYSIS"]:
                st.markdown(clean_ai_response(result[0]["ANALYSIS"]))
            else:
                st.warning("Couldn't get an analysis result. Try a different prompt or video.")

        except Exception as e:
            st.error(f"Something went wrong: {str(e)}")
else:
    st.info("👆 Upload a video file to start analyzing.")

Implementation Notes

No extra packages: Works with default Streamlit in Snowflake packages
Auto-created stage: Created with Server Side Encryption + Directory Table on first run
Editable prompt: Update analysis prompt instantly from the sidebar
Video preview: Use st.video to play the uploaded file in-browser

App Screenshots

The demo video shown is Pixabay: Meadow, Grass, Spring Meadow, Wind (Pixabay Content License).

Cost

AI_COMPLETE billing follows token-based pricing, like other AI Functions. For direct video/audio input, factor in:

Input tokens: Vary with file length and content
Output tokens: Vary with the requested output length
Model rate: gemini-3.1-pro's credit rate applies

See the Snowflake Service Consumption Table for current pricing.

Cost optimization tip: When you only need transcription, use AI_TRANSCRIBE; when you need interpretation, use AI_COMPLETE. A two-stage pipeline that filters out clearly irrelevant videos with a cheap prompt before deeper analysis is also effective.

Sample Materials Used in This Article

All free assets used here are commercially licensed so you can reproduce the examples yourself.

Video: Pexels: Pouring Fresh Coffee into Ceramic Mug (Pexels License)
Video: Pexels: Person Having an Interview (Pexels License)
Video: Pixabay: Meadow, Grass, Spring Meadow, Wind (Pixabay Content License)
Audio: LibriVox: The Adventures of Sherlock Holmes (Arthur Conan Doyle) (Public Domain)
Audio: LibriVox: A Christmas Carol (Charles Dickens) (Public Domain)
Audio: Mozilla Common Voice Japanese dataset (CC0)

Conclusion

AI_COMPLETE's video and audio support is a meaningful step toward simpler multimodal analytics on Snowflake. With video joining images, documents, and audio, nearly every major unstructured data format can now be queried straight from SQL.

Key Benefits

Pass media as-is: Skip transcription — the model picks up vocal tone and visual context
Composes with other AI Functions: Chain with AI_TRANSCRIBE / AI_CLASSIFY / AI_AGG for flexible pipelines
Governance preserved: Your media data never leaves Snowflake
SQL-native: Drops cleanly into your existing data pipelines

Supported models and limits will keep expanding, opening up even more use cases over time. Whether it's call centers, meeting analysis, content moderation, education, or marketing, I hope you'll use AI_COMPLETE's video/audio support to unlock new value from media that previously sat untouched!