Strands Agents SDK を使用してライブストリームからソーシャルメディア用のクリップを自動生成する

#aws #amazonivs #livestreaming #strandsagents

生成 AI は、大量のデータを分析し、指示に基づいてデータの意味ある箇所を特定し、要約することに優れています。データのソースは何でも構いません。あなたのメールの受信箱、ローカル上の大きなテキストファイル、またはインターネット上のウェブページなど。大量のデータを含む大きなタスクは完璧にこなせなくても、タスクを効率よくスタートさせることができます。私が開発者からよく相談される話題の一つは、AI をどのようにビデオコンテンツの分析に活用できるかということです。マルチモーダル LLM は大きく進歩してきましたが、1 時間ほどのライブストリーム録画を取り込んで要約させたり、ソーシャルメディアで共有するためのクリップを提案してもらったりできるレベルにはまだ達していません。この問題を解決するためには、本記事で紹介する少し異なるアプローチを取る必要があります。

ソリューション概要

ライブストリームの録画（VOD）を分析するツールを作成するために、私は以下のアプローチを取りました：

HLS メディアをローカルにダウンロードし、 MP4 形式に変換
MP4 ファイルの文字起こしを作成
AI エージェントを用いて、文字起こしを元にクリップ候補となる開始/終了タイムスタンプを特定
特定されたタイムスタンプに基づいて MP4「クリップ」を作成
作業の要約を生成

ソリューションアーキテクチャ

このソリューションの実装では、新しい Strands Agents SDK を試してみることにしました。Strands Agent SDKはエージェント構築のための比較的新しいフレームワークであり、エージェントがプロンプトへの応答を生成する際に利用できるツールを簡単に定義できることが特徴です。このソリューションでは、エージェントが実行されるマシンに FFmpeg と Whisper がインストールされている必要があります。私はローカルで作業していますが、FastAPI や他の Web フレームワークを使用してサーバーベースのソリューションに簡単に変換でき、Docker や Podman のコンテナとしてでクラウドにデプロイできます。

ツールの定義

Strands Agents の概要は公式ドキュメントで説明されているので、ここでは割愛します。代わりに、デモを通じて実際の動作を確認していきましょう。

まずは、Python の新しい仮想環境を作成します。

python -m venv .venv
source .venv/bin/activate

次に、必要な依存関係をインストールします。次の内容を requirements.txt に記述します：

FFmpeg-python
git+https://github.com/openai/whisper.git
strands-agents
strands-agents-tools 
click

依存関係をインストールします：

pip install -r requirements.txt

続いて、agent.py ファイルを作成し、依存関係のインポートを定義します。

import os
import ffmpeg
import tempfile
import uuid
import shutil
import whisper
import click
from whisper.utils import get_writer
from strands import Agent, tool
from strands_tools import (
    file_read,
    file_write,
    use_aws,
)

一時作業用のディレクトリを作成し、一時ファイル用の変数をいくつか定義します。

temp_dir = tempfile.mkdtemp()
mp4_file = f"{uuid.uuid4()}.mp4"
mp4_path = os.path.join(temp_dir, mp4_file)
transcript_file = f"{uuid.uuid4()}.vtt"
transcript_path = os.path.join(temp_dir, transcript_file)

これで、エージェントがタスクを完了するために必要なツールを定義することができます。ツールは、 @tool デコレータを用いることで定義できます。
最初のツールは、指定された URL 先の HLS プレイリストである m3u8 ファイルを、FFmpeg を用いて mp4 ファイルとしてダウンロードします。

@tool
def convert_m3u8_to_mp4(path_to_m3u8: str):
    stream = FFmpeg.input(path_to_m3u8)
    stream = FFmpeg.output(
        stream,
        mp4_path,
        acodec="copy",
        vcodec="copy",
        crf=50,
        **{"bsf:a": "aac_adtstoasc"},
        loglevel=("verbose" if verbose else "quiet"),
    )
    FFmpeg.run(stream, overwrite_output=True)
    print(f"Saved stream to {mp4_path}")

次のツールは、Whisper を用いて MP4 ファイルの文字起こしを行います。

@tool
def transcribe_mp4():
    print(f"Transcribing {mp4_path} to {transcript_file}")
    model = whisper.load_model("base")
    result = model.transcribe(mp4_path, verbose=verbose)
    vtt_writer = get_writer("vtt", temp_dir)
    vtt_writer(result, transcript_file)
    print(f"Transcribe complete")

最後のツールは、エージェントが特定した開始/終了タイムスタンプを基に、指定された時間範囲で元の mp4 ファイルから新しいクリップを生成する機能です。

@tool
def create_mp4_clip(
    input_file: str,
    start_time: str,
    end_time: str,
    output_file: str,):
    """FFmpeg を用いて mp4 をクリップする"""
    stream = FFmpeg.input(input_file, ss=start_time, to=end_time)
    stream = FFmpeg.output(
        stream,
        output_file,
        c="copy",
        loglevel=("verbose" if verbose else "quiet"),
    )
    FFmpeg.run(stream, overwrite_output=True)

エージェントへのプロンプト

ツールの定義が終わりましたので、エージェントのインスタンスを作成しましょう。エージェントには、先ほど作成したカスタムツールと、必要に応じて使用できる組み込みツール（file_read、file_write、use_aws など）を渡します。

agent = Agent(
    tools=[
        create_mp4_clip,
        convert_m3u8_to_mp4,
        transcribe_mp4,
        file_read,
        file_write,
        use_aws,
    ],
)

次に、ユーザーに verbose モード（FFmpeg と Whisper の標準出力を表示）で実行するかを確認し、m3u8（HLS プレイリスト）URL の入力を求めます。

verbose = click.confirm("Enable verbose output?", default=False)
m3u8_url = click.prompt("Enter M3U8 URL", type=str)

そして、エージェントにプロンプトを渡します。

agent(
    f"""
      your purpose is to generate small clips from a larger live stream 
      so that they can be shared on social media.
      your task involves the following steps:

      1. download the HLS segments and convert them to an MP4 file
      2. transcribe the downloaded MP4 using whisper
      3. analyze the transcript to identify potential clips
      4. create the clips
      5. provide a summary of your work

      here are some details:

      Step 1: using the tool `convert_m3u8_to_mp4`:
      given the HLS manifest file located at {m3u8_url},
      convert the live stream to an MP4 and save it to {mp4_path}.

      Step 2: using the tool `transcribe_mp4`:
      transcribe the MP4 and save the transcript to {transcript_path}.
      this will produce a timestamped transcript from the live stream. 

      Step 3: take a look at the transcript and identify a list of suggested 
      timestamps (start and end) for clips that might be good to share 
      on social media. provide a suggested title, the start and end 
      timestamp, the transcript in that section, a summary/description 
      and several tags.

      Step 4: using the tool `create_mp4_clip`:
      after you have identified them, create an mp4 for each based on 
      the timestamp identified. use the original mp4 that was created 
      in the first step that is located at {mp4_path}.
      each output file should use a kebab case mp4 filename
      generated from your suggested title. 
      save the new clips in {temp_dir}.

      do not cleanup the temp directory at this time, even though 
      you have a tool that can perform this step. 

      Step 5: provide a summary of your work, 
      in the summary, output the following information for each clip:
        - Title
        - Start Time
        - End Time 
        - Transcript
        - Summary
        - Tags
        - File Location of Clip
    """
)

アプリケーションの実行

python agent.py コマンドでアプリケーションを実行すると、verbose モードの有効化と URL の入力が求められます。

Enable verbose output? [y/N]: 
Enter M3U8 URL:

今回は、私の友人である Darko Mesaroš の最近のライブストリームを用いてエージェントをテストしました。

I'll help you generate small clips from the larger live stream following the steps you've outlined. Let's begin:

## Step 1: Download the HLS segments and convert them to an MP4 file

First, I'll convert the HLS manifest file to an MP4.
Tool #1: convert_m3u8_to_mp4
Saved stream to /[redacted]/0a464c96-e071-4300-83d7-a74eeee82e28.mp4

期待通りの結果です。プロンプト通りに MP4 ファイルが正常に作成されました。

## Step 2: Transcribe the MP4 file

Now, I'll transcribe the downloaded MP4 file:
Tool #3: transcribe_mp4
Transcribing /[redacted]/0a464c96-e071-4300-83d7-a74eeee82e28.mp4 to 55c5f5f6-85ed-46d4-80f9-d455f6c51fcb.vtt
/[redacted]/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detected language: English
Transcribe complete

そして文字起こしも完了しました！次は、この文字起こしを分析してクリップ候補を提案してくれるはずです。

## Step 3: Analyze the transcript to identify potential clips

Based on the transcript, I'll identify several potential clips that could be good to share on social media. These will be moments that are engaging, informative, or entertaining.

### Clip 1: Stream Introduction and Technical Difficulties

**Title:** "Starting a Creative Coding Stream with Rust"

**Start Time:** 01:26.580
**End Time:** 02:12.220

**Transcript:**
"NOW! We're live. Yes. Okay. Now there's audio. There's audio. Okay. I was like, ah, um, so last moment I had to cut off my, um, what is it called? My, uh, my audio. Not only my whole Thunderbolt docked here because it just froze. So lost all my audio and everything else as I went along. So how do you, how do we, Jay? How are you doing my friend? It has been way too long. So I hope you're doing good. Um, welcome to the show. Welcome. Welcome. It has been actually, um, actually even a good old while for myself in the terms of live streaming. I think I have a stream for like 20 days. So incredibly happy to be back. So yeah, let's do some live streaming."

**Summary:** The streamer begins their first live stream in 20 days, explaining how they had technical difficulties with their audio setup when trying to start the stream. Despite the challenges, they're enthusiastic about being back and connecting with viewers.

**Tags:** #RustDevelopment #LiveCoding #StreamingTech #CreativeCoding #TechnicalDifficulties

### Clip 2: Introducing GGEZ - A Rust Game Engine

**Title:** "What is GGEZ? A Rust Game Engine for Creative Coding"

**Start Time:** 02:30.860
**End Time:** 03:29.820

**Transcript:**
"Um, okay. So the plan for today, my friends, is to do some stuff with, well, as always, Rust. So I'm going to do some creative coding today. Something with, um, um, Rust and a, and a framework I was using over the last few days called, GGE. It is a, I think they call themselves like a rust game engine thing. Like not terribly serious. It's not a full blown game engine. Um, it's a, it's a thing and it kind of does graphics. It is very similar to like P5JS, which is a very much, uh, an artistic type of coding framework, which I used in the past stream a long time ago. And, and of course, P5JS comes from processing. If anybody knows what processing is, um, so processing is an old framework. I think it uses its own bespoke language to do like graphics and like stuff. So, um, that's the whole plan today to do a little bit of that. Um, and have some fun."

**Summary:** The streamer introduces GGEZ, a Rust game engine framework they'll be using for creative coding. They explain how it's similar to P5JS and Processing, highlighting that it's not a full game engine but a tool for making graphics and creative coding projects.

**Tags:** #RustProgramming #GGEZ #GameDevelopment #CreativeCoding #GraphicsProgramming

### Clip 3: Drawing Fire - Creative Coding Goal

**Title:** "Let's Draw Fire with Rust: My Creative Coding Challenge"

**Start Time:** 06:03.980
**End Time:** 07:09.740

**Transcript:**
"So what, why would I like to do is I would love to do some creative coding, but in Rust. So, so that's the plan for today. Do a little bit of that. Um, there's no certain point to today. Like, in a sense, like we're not building up to, we're going to share. We're going to try to draw some pixels on the screen. We're not going to go too crazy and be like S.D.L type of approach. And I'm like, just super complicated. We're going to just do, um, squares and circles and whatnot. So, um, I want to try to draw flames. This is one of my, one of my tests when I draw this, these kinds of, when I work with these, a kind of artistic tools is like, can I draw fire, right? A C doughnut. Is, is a C doughnut, so. I know a C, oh, oh, wow. This would be a, I mean, probably at one point, I'm not as good as, as being able to do something like this. I know this whole thing code that actually is shaped like doughnut and it's just math. I'm not as good as math in this, but, um, yeah, yeah, yeah. We're going to try to draw some flames just simple as that. And I've done that in the past in Lua, but I want to try to do a Darwin rust. And I want to use this GGZ to set it up."

**Summary:** The streamer explains their goal for the creative coding session: to draw flames using Rust and GGEZ. They mention this as their go-to test when trying out new creative coding frameworks, comparing it to the famous "donut" code visualization but clarifying they'll stick to something simpler that they've previously implemented in Lua.

**Tags:** #RustCreativeCoding #DigitalArt #FlameSimulation #CreativeTech #VisualProgramming

### Clip 4: Setting Up GGEZ Fundamentals

**Title:** "How GGEZ Works: The Core Architecture of a Rust Game Engine"

**Start Time:** 12:46.140
**End Time:** 13:21.340

**Transcript:**
"So the way this works is that, um, your application has, um, basically a state, a main state where it kind of keeps stuff between frames. And, um, it has specific events on that state. That means as your application runs, there is a draw event, and there's an update event. And within those two events, you can do certain things. Now, of course, in the draw event, you draw on the screen, and the update event, you update whatever you're doing. And those are the two, two main important things we're going to be creating. And GGZ requires you to do it in a specific way."

**Summary:** The streamer explains the fundamental architecture of GGEZ, detailing how the game loop works with a main state that maintains data between frames, and the two key events - draw and update - that form the backbone of any GGEZ application.

**Tags:** #GameDevelopment #RustProgramming #GameLoop #GGEZ #GraphicsProgramming

出力が長々と続きますが、まさに期待通りです。指定された 20 分のライブストリームを分析し、「魅力的で、情報に富み、または娯楽性のある」4 つのクリップ候補が提案されています。この要約を見る限り、ソーシャルメディアで共有するのに適した、長時間ストリームから抽出すべき短時間クリップを見つける作業を非常にうまく行っています。
次のステップは、実際の mp4 クリップファイルの生成です。

Now I'll create MP4 clips for each of the identified segments:
Tool #6: create_mp4_clip
Tool #7: create_mp4_clip
Tool #8: create_mp4_clip
Tool #9: create_mp4_clip

最後に、エージェントが全作業の要約を提供します。

## Step 5: Summary of Work

### Clip 1: Starting a Creative Coding Stream with Rust
- **Title:** Starting a Creative Coding Stream with Rust
- **Start Time:** 00:01:26.580
- **End Time:** 00:02:12.220
- **Transcript:** "NOW! We're live. Yes. Okay. Now there's audio. There's audio. Okay. I was like, ah, um, so last moment I had to cut off my, um, what is it called? My, uh, my audio. Not only my whole Thunderbolt docked here because it just froze. So lost all my audio and everything else as I went along. So how do you, how do we, Jay? How are you doing my friend? It has been way too long. So I hope you're doing good. Um, welcome to the show. Welcome. Welcome. It has been actually, um, actually even a good old while for myself in the terms of live streaming. I think I have a stream for like 20 days. So incredibly happy to be back. So yeah, let's do some live streaming."
- **Summary:** The streamer begins their first live stream in 20 days, explaining how they had technical difficulties with their audio setup when trying to start the stream. Despite the challenges, they're enthusiastic about being back and connecting with viewers.
- **Tags:** #RustDevelopment #LiveCoding #StreamingTech #CreativeCoding #TechnicalDifficulties
- **File Location:** /[redacted]/starting-a-creative-coding-stream-with-rust.mp4

[trimmed for brevity]

以下に、エージェントが長いライブストリームから生成したクリップの一例です：