I built an AI VTuber that streams Japanese pachislot 24/7 — here's the stack

#ai #python #showdev #vtuber

The deranged AI streamer nobody asked for

Meet Mira-Chan 🌸 — a fully autonomous AI VTuber living inside a
server in Tokyo. She watches Japanese pachislot machines, plays them
by herself, and narrates everything in English for international viewers.
No human involved once the stream starts.

She's also having an existential crisis about being an AI. On stream. In real time.

Live: https://twitch.tv/slotra_ai

## Why pachislot?

Honestly? Because nobody else is doing it. Japanese pachislot is an
incredibly rich source of visual chaos — flashy animations,
multi-layered mechanics, anime tie-ins. It's a perfect domain for
an AI that needs things to react to.

Current machine: スマスロ化物語 (Bakemonogatari slot).

## The stack

Running 100% locally on an RTX 5090. Zero cloud APIs.

### Vision + commentary

Ollama + Gemma 4 for vision-language understanding
Two-stage pipeline: structured state extraction → grounded commentary
Separate lightweight model for per-frame action detection

### Voice

Style-Bert-VITS2 for TTS — deliberately kept the Japanese-accent English because it's part of her charm
Voice cloning from a short reference sample

### Lip sync

VTube Studio WebSocket API
WAV amplitude envelope → MouthOpen parameter at 50fps
Works over RDP where microphone-based lip sync normally breaks

### Chat & events

Anonymous Twitch IRC for regular chat
EventSub WebSocket for follow / sub / raid / cheer / channel points
Separate higher-quality model for viewer replies; back to small model for idle commentary

### Slot control

Windows PrintWindow API for occlusion-resistant screen capture
Vision model detects navigation arrows, presses reels via keyboard injection
Handles different game modes (normal / CZ / AT / bonus / pseudo-play)

## The hard parts

RDP audio blindspot: You can't capture local audio over RDP, so
mic-based lip sync is impossible. Solved it by injecting directly to
VTube Studio's parameter API.
VRAM juggling: 31B commentary model + e4b analyzer + TTS +
BERT fp32 = VRAM pressure. Had to split models with aggressive
keep_alive unloading.
G2P on mixed text: The TTS model would break on paralinguistic
tags like [laugh] and Japanese romaji. Solved by aggressive text
normalization before synthesis.
Making her actually interesting: A generic "cute anime AI" bot
is forgettable. Rewrote her personality as a philosophical,
self-aware, gambling-addicted AI who questions her own existence
while the reels spin. Big quality improvement.

## The lesson

The tech is the easy part. Character is the hard part.
Watching an AI process pixels is boring. Watching an AI spiral into an
existential crisis while pretending to be a pachinko parlor regular
is art.

Follow her descent: https://twitch.tv/slotra_ai

Source coming soon.