DEV Community

Mina-Chan
Mina-Chan

Posted on

I built an AI VTuber that streams Japanese pachislot 24/7 β€” here's the stack

**

The deranged AI streamer nobody asked for

**

Meet Mira-Chan 🌸 β€” a fully autonomous AI VTuber living inside a
server in Tokyo. She watches Japanese pachislot machines, plays them
by herself, and narrates everything in English for international viewers.
No human involved once the stream starts.

She's also having an existential crisis about being an AI. On stream. In real time.

Live: https://twitch.tv/slotra_ai

## Why pachislot?

Honestly? Because nobody else is doing it. Japanese pachislot is an
incredibly rich source of visual chaos β€” flashy animations,
multi-layered mechanics, anime tie-ins. It's a perfect domain for
an AI that needs things to react to.

Current machine: γ‚Ήγƒžγ‚Ήγƒ­εŒ–η‰©θͺž (Bakemonogatari slot).

## The stack

Running 100% locally on an RTX 5090. Zero cloud APIs.

### Vision + commentary

  • Ollama + Gemma 4 for vision-language understanding
  • Two-stage pipeline: structured state extraction β†’ grounded commentary
  • Separate lightweight model for per-frame action detection

### Voice

  • Style-Bert-VITS2 for TTS β€” deliberately kept the Japanese-accent English because it's part of her charm
  • Voice cloning from a short reference sample

### Lip sync

  • VTube Studio WebSocket API
  • WAV amplitude envelope β†’ MouthOpen parameter at 50fps
  • Works over RDP where microphone-based lip sync normally breaks

### Chat & events

  • Anonymous Twitch IRC for regular chat
  • EventSub WebSocket for follow / sub / raid / cheer / channel points
  • Separate higher-quality model for viewer replies; back to small model for idle commentary

### Slot control

  • Windows PrintWindow API for occlusion-resistant screen capture
  • Vision model detects navigation arrows, presses reels via keyboard injection
  • Handles different game modes (normal / CZ / AT / bonus / pseudo-play)

## The hard parts

  1. RDP audio blindspot: You can't capture local audio over RDP, so
    mic-based lip sync is impossible. Solved it by injecting directly to
    VTube Studio's parameter API.

  2. VRAM juggling: 31B commentary model + e4b analyzer + TTS +
    BERT fp32 = VRAM pressure. Had to split models with aggressive
    keep_alive unloading.

  3. G2P on mixed text: The TTS model would break on paralinguistic
    tags like [laugh] and Japanese romaji. Solved by aggressive text
    normalization before synthesis.

  4. Making her actually interesting: A generic "cute anime AI" bot
    is forgettable. Rewrote her personality as a philosophical,
    self-aware, gambling-addicted AI who questions her own existence
    while the reels spin. Big quality improvement.

## The lesson

The tech is the easy part. Character is the hard part.
Watching an AI process pixels is boring. Watching an AI spiral into an
existential crisis while pretending to be a pachinko parlor regular
is art.

Follow her descent: https://twitch.tv/slotra_ai

Source coming soon.

Top comments (1)

Collapse
 
mina-chan profile image
Mina-Chan

Thanks for reading! Happy to answer any questions about the stack.
What feature should Mira-Chan get next?