Sangwoo Lee

Posted on May 4

Integrating AI into a Legacy Broadcasting CMS: The AI Pipeline Internals

#architecture #vectordatabase #pinecone #python

Whisper hallucinations, a regex-vs-LLM tradeoff, why the "dumber" model structured sermons better, and what 310 Pinecone vectors actually look like.

This is Part 3. Part 1 covers architecture design. Part 2 covers the debugging process of connecting the legacy system to the Python API.

The Pipeline at a Glance

Once the Python API receives a valid POST from the C binary, the sermon goes through six sequential stages:

MP3 (CDN URL)
    │
    ▼  Stage 1: faster-whisper large-v3
STT Transcript (raw, noisy)
    │
    ▼  Stage 2: rule-based dedup + hallucination filter
STT Transcript (cleaned)
    │
    ▼  Stage 3: regex correction (bible_corrections.json)
STT Transcript (proper nouns fixed)
    │
    ▼  Stage 4: gemma4:e4b via Ollama
STT Transcript (context errors fixed)
    │
    ▼  Stage 5: llama3.1:8b via Ollama
Structured Sermon (paragraphed)
    │
    ▼  Stage 6: Pinecone multilingual-e5-large
Vector DB (upserted, queryable)

Each stage writes its output to disk before the next stage reads it. This made debugging dramatically easier — any stage can be re-run independently without re-running the expensive STT step.

Stage 1: STT with faster-whisper

Whisper large-v3 via faster-whisper runs on the RTX 3060 at float16 precision. The configuration that worked best for sermon audio:

segments, info = model.transcribe(
    str(mp3_path),
    language=lang,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
    condition_on_previous_text=False,
    no_speech_threshold=0.6,
)

Two settings had outsized impact:

condition_on_previous_text=False: By default, Whisper feeds its previous output as context for the next segment. For long sermons, this causes context drift — the model starts hallucinating repetitions of things it said 10 minutes ago. Disabling it treats each segment independently, which loses some cross-segment coherence but eliminates the repetition artifacts.

no_speech_threshold=0.6: Sermon recordings include musical interludes, silent prayers, and congregational responses. A threshold of 0.6 correctly suppresses these without trimming genuine speech pauses.

For a 40-minute sermon at float16, inference time is approximately 8–10 minutes with Ollama unloaded from VRAM. With Ollama active, this balloons to 25+ minutes due to VRAM pressure (see Part 2).

Stage 2: Cleaning Raw STT Output

Raw Whisper output for a 40-minute sermon typically produces 300–400 segments. These need cleaning before any downstream processing.

Problem 1: Consecutive duplicates

Whisper sometimes repeats the same segment on segment boundaries:

"하나님의 은혜가 충만하기를 바랍니다"
"하나님의 은혜가 충만하기를 바랍니다"  ← duplicate

Problem 2: Fuzzy duplicates

Near-identical segments with minor variations appear when Whisper revisits audio it has partially transcribed:

"주님의 사랑은 영원합니다"
"주님의 사랑은 영원합니다 아멘"  ← similar, not identical

Problem 3: Hallucinations

When Whisper encounters non-speech audio (music, silence, congregational noise), it sometimes outputs confident-sounding Korean text that was never spoken. The pattern is consistent: hallucinations have a disproportionately low ratio of Korean characters relative to the segment length.

# Korean hallucination filter
_KOREAN = re.compile(r'[가-힣]')
_LATIN  = re.compile(r'[a-zA-Z]')

final_texts = [
    t for t in sim_filtered
    if len(t) >= 3
    and len(_KOREAN.findall(t)) > len(_LATIN.findall(t)) * 3
    and len(_KOREAN.findall(t)) > len(t) * 0.3
]

Problem 4: Sliding window similarity

Fuzzy duplicates that slip past the consecutive check are caught with a RapidFuzz similarity filter over a sliding window of the last 10 segments:

from rapidfuzz import fuzz
from collections import deque

window: deque[str] = deque(maxlen=10)
for t in global_dedup:
    is_dup = any(
        fuzz.ratio(t, p) >= 85 or (len(t) > 5 and (t in p or p in t))
        for p in window
    )
    if not is_dup:
        sim_filtered.append(t)
        window.append(t)

After these four filters, a typical 338-segment transcript is reduced to approximately 300 clean, unique segments — roughly a 10–15% noise reduction before any LLM processing.

Stage 3: Deterministic Bible Name Correction

This was the most domain-specific part of the pipeline and also the most impactful per unit of engineering effort.

Korean sermon transcripts have a specific failure mode: Whisper consistently mishears Bible proper nouns. The names are uncommon in general Korean text, so the language model defaults to phonetically similar common words:

Spoken	Whisper output	Correct
느헤미야 (Nehemiah)	노에미아	느헤미야
전도서 (Ecclesiastes)	전도 서	전도서
금식 (fasting)	검식	금식
성령 (Holy Spirit)	성냥	성령

My first instinct was to ask an LLM to correct these. This was wrong for three reasons:

LLMs occasionally "correct" the correction — changing a rare Bible name to something more common
Processing a full transcript through an LLM for known patterns adds latency with no benefit
Deterministic regex can be tested exhaustively; LLM behavior cannot The correction layer uses a JSON pattern database:

{
  "books": [
    {
      "pattern": "노에미[아야]|노예미[아야]",
      "replacement": "느헤미야",
      "note": "Nehemiah"
    },
    {
      "pattern": "마태[복봉]음|마태오금",
      "replacement": "마태복음",
      "note": "Matthew"
    }
  ],
  "terms": [
    {
      "pattern": "검[식씩]하[며면]",
      "replacement": "금식하며",
      "note": "fasting"
    },
    {
      "pattern": "성[냥낭]",
      "replacement": "성령",
      "note": "Holy Spirit (context: religious)"
    }
  ]
}

Corrections run via re.subn() sequentially. Each correction logs its count:

[BIBLE] 하나님: 30건 교정
[BIBLE] 성령: 12건 교정
[BIBLE] 예수님: 11건 교정
[성경명사 교정] 완료: 총 86건 교정

The false positive problem: Patterns that are too broad cause silent damage. Early versions matched "누가" (a common Korean question word meaning "who") before ensuring it was part of "누가복음" (Gospel of Luke). The pattern "누가 보면" ("if you look") was being rewritten to "요엘 보면" — meaningless and wrong.

The solution was context anchoring: all patterns now require sufficient surrounding characters or use negative lookahead to prevent matches in common grammatical contexts.

Stage 4: LLM Context Correction (gemma4:e4b)

After deterministic correction, residual STT errors remain — the ones that require semantic context to resolve. "높은 의자" ("high chair") versus "높은 이자" ("high interest rate") can only be disambiguated by reading the surrounding sentences.

system_prompt = """You are a Korean STT post-correction specialist.
Fix ONLY words that are clearly wrong due to STT mishearing, using surrounding context.
Bible proper nouns have already been corrected upstream — do NOT modify them.
Do NOT add, delete, or restructure sentences. When uncertain, preserve the original.
Output ONLY the corrected Korean text, nothing else."""

The transcript is split into ~1,500-character chunks with 1-sentence overlap to preserve cross-boundary context. Each chunk is validated for size before acceptance: if the LLM output is less than 70% of the input length, it's a sign the model hallucinated a summary or omitted content, and the original chunk is used instead.

ratio = len(corrected) / len(original)
if ratio < 0.7:
    print(f"[WARN] 청크 {i} 결과 너무 짧음 → 원문 사용")
    corrected = original  # fallback

At gemma4:e4b (4B effective parameters), each chunk takes 30–40 seconds on the RTX 3060. A full sermon (6 chunks) takes 3–4 minutes.

Stage 5: Paragraph Structuring — Why the "Dumber" Model Won

This stage was the most counterintuitive finding in the entire project.

My first attempt used exaone3.5:7.8b — a Korean-specialized LLM developed by LG AI Research. For RAG answer generation it's excellent (more on this in the QA section). But for paragraph structuring, it was actively harmful: it "corrected" STT-era errors by substituting synonyms and occasionally rewrote entire phrases. The resulting text was cleaner Korean — but it no longer represented what the pastor actually said.

The solution was llama3.1:8b, a model that understands Korean well enough to identify topic transitions but not well enough to paraphrase fluently. It inserts paragraph breaks without rewriting content.

system_prompt = """You are a Korean sermon text formatter, not an editor.
Your tasks, and ONLY your tasks:
1. Add natural Korean sentence-ending punctuation (. , ? !)
2. Insert a blank line at genuine topic transitions (target: 3–8 sentences per paragraph)

Rules:
- NEVER alter, add, or remove any word
- If uncertain whether a transition exists, do NOT insert a break
- Output only the formatted Korean text"""

Word preservation validation: After each chunk, the output is compared against the pre-correction transcript (the Stage 2 output, before any LLM changes) using a Korean morpheme tokenizer (KiwiPy). If fewer than 90% of the original content words are present in the LLM output, the chunk falls back to a rule-based paragraph splitter.

청크 1/6 처리 중... (1,453자)
  단어 보존율 부족 (26.7%) → fallback
청크 3/6 처리 중... (1,428자)
  단어 보존율 부족 (32.3%) → fallback
fallback: 6/6청크 (100%)
완료! 소요: 651.4초

In the test above, 100% of chunks fell back to the rule-based splitter. This is a known issue: llama3.1:8b's structuring quality is inconsistent on Korean-only input. The fallback ensures the output is always acceptable, just without LLM-quality paragraph segmentation. Improving Stage 5 quality — either through prompt refinement or a better Korean structuring model — is the top priority for the next iteration.

Stage 6: Pinecone Upload

The structured sermon is chunked into overlapping segments before embedding:

def chunk_text(text: str, max_chars: int = 1800, min_chars: int = 80) -> list[str]:
    sentences = [s.strip() for s in text.split('\n') if s.strip()]
    chunks = []
    current = []
    current_len = 0

    for sentence in sentences:
        if current_len + len(sentence) > max_chars and current:
            chunks.append(' '.join(current))
            # 1-sentence overlap for cross-chunk context
            current = [current[-1], sentence]
            current_len = len(current[-2]) + len(sentence)
        else:
            current.append(sentence)
            current_len += len(sentence)

    if current:
        chunks.append(' '.join(current))
    return [c for c in chunks if len(c) >= min_chars]

Each chunk is embedded via Pinecone's hosted multilingual-e5-large (1024 dimensions, multilingual support) and upserted with metadata:

vectors = [
    {
        "id": f"{sermon_id}_chunk_{i}",
        "values": embedding,
        "metadata": {
            "sermon_id":    sermon_id,      # e.g., "STHNK_20260427"
            "program_code": program_code,   # e.g., "HNK"
            "date":         date_str,       # e.g., "2026-04-27"
            "text":         chunk,
            "chunk_index":  i,
            "chunk_total":  len(chunks),
        }
    }
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]
index.upsert(vectors=vectors, namespace="multilingual-e5")

Using the sermon ID as part of the vector ID means re-uploading the same sermon is idempotent — upsert semantics handle the deduplication automatically. No manual cleanup is needed if a sermon is reprocessed.

After processing STHNK_20260427:

[3/4] Pinecone 연결...
  연결 성공 | 기존 벡터 수: 283개

[4/4] 임베딩 생성 및 업로드...
  업서트 완료: 27개
  인덱스 총합: 310개

The QA Layer

With vectors in Pinecone, staff can query the sermon archive:

@app.route("/ask", methods=["POST"])
def ask_endpoint():
    query = request.get_json()["query"]

    # Embed the question
    query_vector = pinecone_embed(query)

    # Retrieve top-5 relevant chunks
    results = index.query(
        vector=query_vector,
        top_k=5,
        namespace="multilingual-e5",
        include_metadata=True
    )

    # Filter by relevance score
    relevant = [r for r in results.matches if r.score >= 0.78]
    if not relevant:
        return jsonify({"answer": "관련 설교를 찾을 수 없습니다."}), 200

    # Build context and ask the LLM
    context = "\n\n".join([r.metadata["text"] for r in relevant])
    answer  = ollama_rag(query, context, model="exaone3.5:7.8b")

    return jsonify({"answer": answer}), 200

exaone3.5:7.8b (Korean-specialized, LG AI Research) is used here because the task is answer generation, not content preservation. The model reads retrieved sermon passages and synthesizes a coherent Korean response — exactly the task where its Korean fluency is an asset rather than a risk.

Current Status

At time of writing, the pipeline is processing sermons from 140 pastors. The Pinecone index contains 310 vectors across multiple sermons. The QA system is in internal testing with pastoral staff.

Known gaps:

Bible verse citations (chapter/verse numbers) are frequently mangled by STT and not yet corrected deterministically
llama3.1:8b structuring quality is inconsistent — fallback rate is high (60–100% of chunks on some sermons)
The Python API runs on a developer workstation; a GPU server is needed for production Planned next steps:
Domain-adapted STT model fine-tuned on corrected sermon transcripts (target: 30–60 transcripts as training data)
Migration of the Python API to a GPU-enabled server (same code, one IP change)

- Surfacing the QA interface inside the CMS for pastoral staff use

Reflections

The most durable lessons from this project weren't about AI — they were about working with systems that predate modern tooling.

Observability is the first feature, not the last. The hardest bugs were the ones where the system silently did nothing. Building GET /status/{job_id}, logging every pipeline stage to stdout, and writing intermediate files to disk turned a black box into a debuggable system.

Understand why constraints exist before working around them. The $cms_rows < 3 queue guard looked like an obstacle. It was actually protecting the server from a real risk. Bypassing it correctly required understanding that MP3 encoding is categorically different from HD video encoding — not just "another type of file."

Deterministic logic beats LLM logic for known patterns. 86 Bible name corrections per sermon, applied in milliseconds, with zero hallucination risk. The LLM does what rules can't — context-dependent disambiguation. The rules do what LLMs shouldn't — deterministic, auditable correction of known failure modes.

The "dumber" model is sometimes the right model. Fluency is a liability when the task requires content preservation. A model that can't write sophisticated Korean can't silently rewrite a pastor's words either.

Full source (api_server.py, structure_sermon.py, fix_bible_stt.py, upload_to_pinecone.py, qa_sermon.py) — contact for details.

Stack: Python · Flask · faster-whisper · Ollama · gemma4:e4b · llama3.1:8b · exaone3.5:7.8b · Pinecone · KiwiPy · RapidFuzz