Maksims Gavrilovs

Posted on Jun 11

Zero to Autopilot, Part 6: A Thompson-Sampling Bandit That Picks the Next Video

#ai #machinelearning #python #reinforcementlearning

Series: Zero to Autopilot — Building a Self-Improving AI Media Channel. Part 6 of 7. Part 5 gave the channel a memory. This part gives it a decision — the explore/exploit engine that picks what to make next.

Data status (Part 6): real-now (mechanism). The bandit, its math, and the real bets it's choosing among are shown below. Which arms won (the quantitative payoff) lands in Part 7, once the data matures.

The dilemma, made concrete

Part 5 ends with the channel remembering that the "heretic mathematician" format won big. So… just make that forever? No — that's how a channel flatlines. But chasing novelty every time throws away everything you learned. This is the explore/exploit dilemma, and for a small channel it bites hard: you have maybe one video a day of budget, so every pick is expensive. Over-exploit and you plateau; over-explore and you never compound.

The honest first version of this in my code was a fixed 60/40 split — 60% of the time make something like a known winner, 40% try something new. It works, but it's dumb in two specific ways:

It over-explores weak arms — a 40% explore rate keeps spending on themes that have already proven mediocre.
It's context-blind — it treats "make a winner" as one bucket, ignoring which features of past videos actually drove the wins.

A contextual bandit fixes both. But first, a phase gate.

Phase 1: cold-start (you have no baseline yet)

You can't run a bandit with zero data — and worse, on a brand-new channel even your "good" videos get tiny numbers, so absolute scores lie. So the channel runs a cold-start phase first: the first 10 deployed videos are pure exploration, deliberately spread across themes, with no winner/loser judgment at all.

@property
def in_cold_start(self) -> bool:
    return self.deployed_count < self.bootstrap_target   # default 10

Relative scoring (the portfolio percentile from Part 5) only unlocks once there are enough videos to be a portfolio. Until then: explore, gather, don't pretend you know anything. After that, the bandit takes over.

There's a hidden footgun here: the seed set teaches the bandit what the universe looks like. If the first ten videos are all the same shape, or all weak scripts, the posterior doesn't learn "audience taste" — it learns your bad sampling strategy. Cold-start needs varied but hooky seed videos: different themes, different emotional promises, different formats, each still a real falsifiable bet. You are not feeding it random content. You are giving it enough distinct arms that "exploit the winner" will mean something later.

Phase 2: a warm-started contextual Thompson bandit

Once there's a baseline, picking the next bet becomes a Thompson-sampling problem. Three design decisions make it fit this domain:

1. Context = what's knowable before production. A bet's features are its theme and tags. Not its effects or animators — those only exist after rendering, so they're a learning/attribution concern, not a selection signal.

def _features(e: Entry) -> list[tuple[str, str]]:
    """Selection context known at planning time: theme + tags."""
    feats = []
    if e.theme: feats.append(("theme", e.theme.strip().lower()))
    feats += [("tag", t.strip().lower()) for t in e.tags if t.strip()]
    return feats

2. Per-feature Beta-Bernoulli posteriors, warm-started from the channel's base rate. Each feature (theme:infinity, tag:heretic-format, …) gets its own Beta(α, β) win-probability posterior. The key trick: instead of an optimistic flat Beta(1,1) prior — which makes every brand-new arm look amazing and causes over-exploration — I warm-start the prior from the channel's actual base win rate, with a weak pseudo-count so real data dominates fast:

def posteriors(measured, prior_strength=2.0):
    base = _base_rate(_evidence(measured))                  # channel's actual win rate
    pa, pb = max(base*prior_strength, 0.5), max((1-base)*prior_strength, 0.5)
    stats = defaultdict(lambda: [pa, pb])                   # every feature starts here
    for e, win in _evidence(measured):
        for f in _features(e):
            stats[f][0 if win else 1] += 1.0               # +win → α, +loss → β
    return stats

A win on a feature pushes its α up; a loss pushes β up. Wins and losses are the relative outcomes from Part 5 — only measured, non-cold-start bets with a real percentile count as evidence.

3. Score a candidate by Thompson-sampling its features and averaging. For each planned bet, draw a sample from each of its features' posteriors and average them. Arms with little history have wide posteriors, so they sometimes draw high — that's exploration emerging naturally from the uncertainty, no explicit explore-rate knob needed:

def score(e, stats, prior, rng):
    feats = _features(e)
    if not feats: return rng.betavariate(*prior)
    samples = [rng.betavariate(*stats.get(f, prior)) for f in feats]
    return sum(samples) / len(samples)

def pick(planned, measured, ...):
    # highest Thompson draw wins; well-proven arms usually win,
    # but uncertain arms self-explore via their wide posteriors
    return rank(planned, measured, ...)[0]

A proven feature (tag:heretic-format with lots of wins) has a tight, high posterior and usually wins the draw — exploit. A fresh theme has a wide posterior and occasionally spikes — explore. The split is adaptive and per-feature, not a global 60/40.

One practical detail: pick() is stochastic (that's the whole point), but the caller passes a state-seeded RNG, so the same journal state yields the same pick. That matters because the autonomous driver Part 7 calls this from two places per cycle and they must agree.

Where do new candidates come from? `ideate`

The bandit chooses among planned bets — but something has to generate them, or it'd just reshuffle the same backlog. That's ideate: an LLM proposes new bets from three inputs — the learned Strategy, the most relevant past episodes (via recall from Part 5), and live trend signals gathered by web search:

# ideate.generate(): build the prompt from learned state + recalled winners + trends
query = " ".join([*j.strategy.next_seeds, *j.strategy.winning_patterns, ...])
episodes = memory.recall_block(j, query, k=6)   # the relevant past, not the recent past
# → LLM returns new bets: {idea, hook, assumption, goal, theme, tags}

So exploration isn't random either — it's informed exploration: new bets that rhyme with what's working and with what's currently trending, each still a falsifiable hypothesis. (No LLM key? It falls back to deterministic seeds from the strategy.)

The loop, end to end

Put together, the decision engine is a closed cycle:

ideate ──► backlog of planned bets (each: idea + hook + assumption + tags)
   ▲                │
   │                ▼
 learn        bandit.pick()  ── exploit proven theme+tags, explore uncertain ones
   ▲                │
   │                ▼
 measure ◄──── produce + publish  (the cheap pipeline from Parts 2–4)

One guard rail sits inside that produce step. The bandit picks what to make, but a topic isn't a script — so before any money is spent, the chosen bet's scenario passes through a content critic that can send it back for a rewrite if the writing is hollow. The bandit chooses the bet; the critic guards the execution. That gate is its own Part 7 story (it exists because the autopilot, unsupervised, shipped an uninformative video); here it's enough to know the loop won't spend on a good pick with a bad script.

And it's running on real bets right now. The journal's winning pattern — the heretic-mathematician format, tag:heretic-format — means the bandit favors arms carrying that feature, which is why the backlog filled with Cantor (infinity → asylum), Galois (algebra → fatal duel), Russell (one sentence breaks math), Gödel (math can't prove itself). Each is the same proven feature (heretic + tragedy + paradox) on a new theme (set theory, group theory, logic) — textbook exploit-the-feature-while-exploring-the-instance. The bandit didn't invent the format; the memory learned it and the bandit is pressing it, while leaving room for the occasional wildcard to keep finding new winners.

And those wildcards are real, not hypothetical. Alongside the math-mystery core, the loop has spent explore-picks on genuinely different lanes: deadpan academic humor ("how mathematicians catch a lion"), science-horror (a 100%-fatal-virus explainer), and a run of atmospheric Persian poetry. Each carries a theme+tags combination the posteriors had never seen, so their wide priors occasionally win the Thompson draw and buy a probe into fresh territory. Which of those probes hardened into new winning arms is the quantitative reveal I'm saving for Part 7 — the point here is that the exploration is informed and deliberate, emerging from each arm's uncertainty, not a blind 40% dice roll.

What I'd tell another AI engineer

Takeaway: A fixed explore/exploit split is a code smell — it's a constant where you want a posterior. Make exploration emerge from uncertainty: per-feature Beta-Bernoulli posteriors, Thompson-sampled, and the wide posteriors of under-tried arms self-explore for free. Two domain details earned their keep: warm-start the prior from your own base rate (a flat optimistic prior over-explores), and only use features knowable at decision time as context (everything else is post-hoc attribution). Seed the RNG from state so an autonomous caller is reproducible. The result is a picker with one honest knob (prior_strength) instead of a magic split.

Next — Part 7: Autopilot. Every piece now exists — cheap production, memory, scoring, a bandit, ideation. The finale wires them into a scheduler that runs the whole loop unattended (handling the 48–72h measurement wait), and — finally — reveals the real numbers: what the channel did, what the autonomous loop decided, and what actually worked.

▶ Live effects gallery: dasein108.github.io/slope-studio
⭐ Star the repo: github.com/dasein108/slope-studio
🔔 Subscribe to watch the experiment grow from zero: the Lobachevsky Short

DEV Community

Zero to Autopilot, Part 6: A Thompson-Sampling Bandit That Picks the Next Video

The dilemma, made concrete

Phase 1: cold-start (you have no baseline yet)

Phase 2: a warm-started contextual Thompson bandit

Where do new candidates come from? `ideate`

The loop, end to end

What I'd tell another AI engineer

Top comments (0)

The dilemma, made concrete

Phase 1: cold-start (you have no baseline yet)

Phase 2: a warm-started contextual Thompson bandit

Where do new candidates come from? ideate

The loop, end to end

What I'd tell another AI engineer

Where do new candidates come from? `ideate`