pulkitgovrani

Posted on May 24

Gemma 4's Audio and Video Inputs: A Hands-On Guide Nobody Has Written Yet

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Most coverage of Gemma 4's multimodal capabilities stops at images. That's understandable — image input is the most obvious thing to demo. But Gemma 4 E2B and E4B ship with two more input modalities that are genuinely novel for a local, open-weight model: native audio input (up to 30 seconds) and video input (up to 60 seconds via frame sampling).

This guide covers what these actually support, how to use them in code, and what practical tasks they open up — with honest notes on where the current implementation has limits.

The Architecture: What Makes Audio Work

Audio input in E2B and E4B is handled by a dedicated encoder — a USM-style conformer with approximately 300M parameters, trained separately and connected to the language model via a projection layer.

This is not "transcribe the audio then feed the transcript." The audio encoder produces continuous representations that the language model processes alongside text tokens. The model can reason about tone, pace, and audio quality — not just words.

The vision encoder (~150M params for E2B/E4B) handles both images and video frames using the same architecture, with variable aspect ratio support.

The 26B A4B and 31B models do not have audio input — only image and video. Audio is E2B/E4B exclusive.

Setup

pip install transformers>=4.50 torch accelerate pillow librosa soundfile

For video frame extraction:

pip install opencv-python

Audio Input: Transcription, QA, and More

Basic audio question answering

import torch
import librosa
import numpy as np
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load audio — librosa resamples to 16kHz automatically
audio_array, sr = librosa.load("interview.wav", sr=16000, mono=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text", "text": "What is the main topic of this audio clip? Summarize in 3 bullet points."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=512)

response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Transcription with speaker context

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text", "text": "Transcribe this audio verbatim. Mark speaker changes with [Speaker A] and [Speaker B]."}
        ]
    }
]

Sentiment and tone analysis

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text", "text": "Analyze the emotional tone of this audio. Is the speaker confident, uncertain, frustrated? What specific vocal cues indicate this?"}
        ]
    }
]

This is where native audio beats transcript-based approaches — the model can respond to pacing, hesitation, pitch changes, and delivery, not just words.

Video Input: Understanding Motion and Sequence

Video is handled by sampling frames at a configurable rate and processing them through the vision encoder. The model receives frames as a sequence, giving it temporal context.

Frame extraction helper

import cv2
import numpy as np
from PIL import Image

def extract_frames(video_path: str, num_frames: int = 16) -> list[Image.Image]:
    """Sample `num_frames` evenly spaced frames from a video."""
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = []

    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame_rgb))

    cap.release()
    return frames

Video QA

frames = extract_frames("cooking_tutorial.mp4", num_frames=16)

# Build content with all frames + question
content = [{"type": "image", "image": frame} for frame in frames]
content.append({"type": "text", "text": "What dish is being prepared? List the ingredients used in order of appearance."})

messages = [{"role": "user", "content": content}]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=1024)

print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Step-by-step action description

content = [{"type": "image", "image": frame} for frame in frames]
content.append({
    "type": "text",
    "text": "Describe the sequence of actions in this video as numbered steps. Be specific about what happens between each frame."
})

Combining Audio and Video

For videos with audio tracks, you can pass both:

audio_array, sr = librosa.load("presentation.mp4", sr=16000, mono=True)
frames = extract_frames("presentation.mp4", num_frames=12)

content = [{"type": "image", "image": frame} for frame in frames]
content.append({"type": "audio", "audio": audio_array})
content.append({
    "type": "text",
    "text": "The audio is from the same presentation shown in the frames. Does the speaker's verbal explanation match what's shown on the slides? Note any discrepancies."
})

This is the most novel use case — cross-modal consistency checking. You can ask: "Does the speaker's tone match the content?" or "What's happening visually when the speaker sounds uncertain?"

Practical Applications

Customer support QA: Feed audio recordings of support calls. Ask: "Was this issue resolved? What was the customer's emotional state at the start vs end?"

Video content moderation: Sample frames from user-uploaded video. Ask: "Does this video contain [category]? Describe what you see."

Lecture summarization: Audio of a lecture + slides as frames. Ask: "Summarize the key points from this lecture, connecting what the speaker said with what was shown."

Meeting notes from recordings: Audio transcription with speaker attribution and topic segmentation — without sending audio to an external API.

Security camera analysis: Frame sequence from a clip. Ask: "Describe the sequence of events. Is this activity consistent with normal patterns?"

Honest Limitations

The 30-second audio cap is real. For anything longer, you need to chunk the audio and either process chunks independently or summarize and chain the outputs. There's no built-in long-audio support in the current E2B/E4B checkpoint.

Frame count affects VRAM significantly. Each video frame is processed through the vision encoder and consumes context window budget. At 16 frames, a 60-second clip is fine on 16GB. At 32 frames, you'll push E4B's limits. Be conservative with frame counts.

No timestamp awareness. When processing frames, the model doesn't receive explicit timestamps. You can embed timestamps in text captions alongside frames, but it's manual:

content = []
duration = 60  # seconds
for i, frame in enumerate(frames):
    timestamp = (i / len(frames)) * duration
    content.append({"type": "image", "image": frame})
    content.append({"type": "text", "text": f"[{timestamp:.1f}s]"})
content.append({"type": "text", "text": "Your question here"})

Audio quality matters more than you'd expect. The encoder handles background noise reasonably well, but heavily compressed audio (low-bitrate voice memos, phone call recordings) reduces transcription quality noticeably compared to clean recordings.

The Part That's Actually New

Audio input in a locally-runnable, Apache 2.0, open-weight model is new. Before Gemma 4, building a pipeline that processes audio through a local model meant separate transcription (Whisper) → text → language model. Two models, two inference steps, text-only reasoning.

Gemma 4 E4B collapses that into one model that can reason across modalities simultaneously. The cross-modal consistency check (does the audio match the video?) simply wasn't possible with the separate-model approach.

That's not a small thing.

DEV Community