Building a Media Tool with LLM: A Step-by-Step Guide

#engineering #oxlo #ai

We are building a media transcription and analysis CLI that turns raw audio into structured show notes. It is useful for podcast producers, journalists, and media archivists who need to extract summaries, key quotes, and topics from long recordings without manual review.

What you'll need

Python 3.10 or newer
An Oxlo.ai API key from https://portal.oxlo.ai
The OpenAI SDK: pip install openai
A sample audio file (MP3 or WAV) to test

Step 1: Configure the Oxlo.ai client

First, I initialize the OpenAI SDK to point at Oxlo.ai. I use Llama 3.3 70B for the analysis steps because it handles long context reliably.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY")
)

CHAT_MODEL = "llama-3.3-70b"
TRANSCRIPTION_MODEL = "whisper-large-v3"

Step 2: Transcribe the audio file

I send the audio to Oxlo.ai's Whisper endpoint. The response contains the full transcript text that I will analyze in the next step.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY")
)

TRANSCRIPTION_MODEL = "whisper-large-v3"

def transcribe_audio(file_path: str) -> str:
    with open(file_path, "rb") as audio_file:
        response = client.audio.transcriptions.create(
            model=TRANSCRIPTION_MODEL,
            file=audio_file,
        )
    return response.text

if __name__ == "__main__":
    transcript = transcribe_audio("episode_42.mp3")
    print(f"Transcript length: {len(transcript)} characters")

Step 3: Define the analysis system prompt

I want consistent, structured output, so I write a detailed system prompt that tells the model exactly what fields to return.

SYSTEM_PROMPT = """You are a media research assistant. Analyze the provided transcript and produce structured show notes.

Follow these rules:
- Write a 3-sentence summary of the main topic.
- Extract exactly 3 key quotes with speaker names.
- List up to 5 topics as single-word tags.
- Identify any factual claims that need verification.
- Output strictly in JSON format.

Required JSON schema:
{
  "summary": string,
  "key_quotes": [{"speaker": string, "quote": string}],
  "topics": [string],
  "claims_to_verify": [string]
}"""

Step 4: Generate structured show notes with JSON mode

I call the chat endpoint with JSON mode enabled so the response is guaranteed valid JSON. Oxlo.ai's flat per-request pricing means this long transcript does not inflate the cost based on token count.

from openai import OpenAI
import json
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY")
)

CHAT_MODEL = "llama-3.3-70b"

SYSTEM_PROMPT = """You are a media research assistant. Analyze the provided transcript and produce structured show notes.

Follow these rules:
- Write a 3-sentence summary of the main topic.
- Extract exactly 3 key quotes with speaker names.
- List up to 5 topics as single-word tags.
- Identify any factual claims that need verification.
- Output strictly in JSON format.

Required JSON schema:
{
  "summary": string,
  "key_quotes": [{"speaker": string, "quote": string}],
  "topics": [string],
  "claims_to_verify": [string]
}"""

def analyze_transcript(transcript: str) -> dict:
    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": transcript},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

if __name__ == "__main__":
    sample = "Host: Welcome to the show. Dr. Chen: Thank you for having me."
    print(json.dumps(analyze_transcript(sample), indent=2))

Step 5: Build the CLI wrapper

I tie both stages together into a single script that accepts a file path and writes the results to a JSON file.

from openai import OpenAI
import argparse
import json
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY")
)

CHAT_MODEL = "llama-3.3-70b"
TRANSCRIPTION_MODEL = "whisper-large-v3"

SYSTEM_PROMPT = """You are a media research assistant. Analyze the provided transcript and produce structured show notes.

Follow these rules:
- Write a 3-sentence summary of the main topic.
- Extract exactly 3 key quotes with speaker names.
- List up to 5 topics as single-word tags.
- Identify any factual claims that need verification.
- Output strictly in JSON format.

Required JSON schema:
{
  "summary": string,
  "key_quotes": [{"speaker": string, "quote": string}],
  "topics": [string],
  "claims_to_verify": [string]
}"""

def transcribe_audio(file_path: str) -> str:
    with open(file_path, "rb") as audio_file:
        response = client.audio.transcriptions.create(
            model=TRANSCRIPTION_MODEL,
            file=audio_file,
        )
    return response.text

def analyze_transcript(transcript: str) -> dict:
    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": transcript},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

def main():
    parser = argparse.ArgumentParser(description="Generate show notes from audio")
    parser.add_argument("audio_file", help="Path to the audio file")
    parser.add_argument("-o", "--output", default="show_notes.json", help="Output JSON path")
    args = parser.parse_args()

    print("Transcribing audio...")
    transcript = transcribe_audio(args.audio_file)

    print("Analyzing transcript...")
    notes = analyze_transcript(transcript)

    with open(args.output, "w", encoding="utf-8") as f:
        json.dump(notes, f, indent=2, ensure_ascii=False)

    print(f"Done. Wrote results to {args.output}")

if __name__ == "__main__":
    main()

Run it

I run the tool against a 30-minute interview MP3. The transcript stage takes a few seconds, and the analysis stage returns structured JSON immediately because Oxlo.ai serves Llama 3.3 70B with no cold starts.

$ python media_tool.py interview.mp3
Transcribing audio...
Analyzing transcript...
Done. Wrote results to show_notes.json

$ cat show_notes.json
{
  "summary": "Dr. Jane Chen discusses the challenges of battery recycling for electric vehicles, covering current chemical processes and emerging closed-loop technologies.",
  "key_quotes": [
    {"speaker": "Dr. Jane Chen", "quote": "We recover over 95 percent of lithium and cobalt in our pilot plant."},
    {"speaker": "Dr. Jane Chen", "quote": "The real bottleneck is not chemistry, it is logistics."},
    {"speaker": "Host", "quote": "So consumers should keep those old phones in a drawer?"}
  ],
  "topics": ["recycling", "batteries", "chemistry", "logistics", "EVs"],
  "claims_to_verify": [
    "95 percent recovery rate in pilot plant",
    "Logistics are the primary bottleneck in battery recycling"
  ]
}

Wrap-up

Next steps: wire the tool into a watched folder so new uploads are processed automatically, or swap Llama 3.3 70B for Kimi K2.6 if you need vision support for video thumbnails. For pricing details on high-volume transcription workloads, see https://oxlo.ai/pricing.