Amit Timalsina

Posted on Mar 31

How We Built an AI Screenplay-to-Storyboard Pipeline in Go

#go #ai #machinelearning #architecture

We built StoryBirdie — a tool that takes a screenplay file and turns it into a complete storyboard with shot lists, camera angles, and AI-generated frames. The entire backend is Go.

This post covers why we chose Go, how we structured the pipeline, and the architecture decisions that worked out (and the ones that didn't).

The Problem

Directors write screenplays. Then they need storyboards — visual frame-by-frame plans showing what each camera shot looks like. Traditionally this means either drawing (most directors can't) or hiring a storyboard artist ($500-2,000 per scene).

We wanted to automate the pipeline: screenplay file in, production-ready storyboard PDF out.

Why Go

We considered Python (FastAPI) and Go (Chi). Go won for three reasons:

Single binary deployment. Our backend compiles to one binary. No virtualenvs, no dependency resolution at deploy time, no "works on my machine." We deploy to Azure Container Apps — push the image, done.

Concurrency for batch operations. Parts of our pipeline run in parallel (generating multiple storyboard frames simultaneously). Go's goroutines made this trivial. In Python we would've needed asyncio or multiprocessing with significantly more complexity.

Compile-time type safety with sqlc. This was the unexpected killer feature. sqlc reads our SQL queries and generates type-safe Go code at compile time. When we change a database column, the build breaks immediately rather than failing at runtime in production. For a startup moving fast and changing schemas weekly, this catches bugs before they ship.

// sqlc generates this from our SQL — change the schema
// and the build fails until you update the query
type Shot struct {
    ID              uuid.UUID `json:"id"`
    ScreenplayID    uuid.UUID `json:"screenplay_id"`
    ShotNumber      int32     `json:"shot_number"`
    ShotSize        string    `json:"shot_size"`
    CameraMovement  string    `json:"camera_movement"`
}

The Pipeline (High Level)

Screenplay Upload → Parsing → Analysis → Shot List → Frame Generation → Export

each stage is its own package in the codebase. they communicate through interfaces, not concrete types. this means we can swap implementations without touching the rest of the system.

type ScreenplayService interface {
    Upload(ctx context.Context, projectID uuid.UUID, file io.Reader, filename string) (*Screenplay, error)
    Analyze(ctx context.Context, id uuid.UUID) (*Analysis, error)
    GenerateShots(ctx context.Context, id uuid.UUID) (*ShotGeneration, error)
}

the interface-based design paid off immediately when we needed to swap AI providers for different stages. analysis uses one model, shot generation uses another, image generation uses a third. each provider is behind an interface.

Parsing

Screenplays come in PDF, DOCX, and Fountain formats. Fountain is the cleanest — its a plain-text markup format designed for screenplays, so parsing it is straightforward. PDFs were the hardest because different screenwriting apps export with different formatting, margins, and headers. We ended up using an LLM for PDF extraction rather than regex because regex broke on about 30% of real-world files.

Analysis

This is where the real value is. Before generating any shots, we analyze the screenplay for characters, props, locations, blocking, and potential issues. I cant go into too much detail here since the analysis engine is core IP, but the key insight is that understanding the relationships between scenes (not just individual scenes) is what separates useful analysis from surface-level parsing.

Shot List Generation

The AI generates a structured shot list based on filmmaking conventions. Dialogue scenes get standard coverage patterns, action gets wider framing, emotional moments get tighter close-ups. Each shot carries metadata — camera angle, movement, characters, blocking, dialogue mapping.

The hard part wasnt generating shots — it was generating shots that a real director would find useful. Early versions were technically correct but creatively boring. Getting the AI to produce varied, contextually appropriate coverage took significant iteration on the prompting approach.

Frame Generation

Each shot becomes a visual frame through image generation. We batch these (3 concurrent) to balance speed and cost. The prompts are built from the shot metadata so the generated image matches the intended composition.

Frame consistency across a sequence (the same character looking the same in frame 1 and frame 15) is still an active challenge. Were making progress but its not solved.

Architecture Patterns That Worked

Feature-based package structure. Each feature (screenplay, shot, storyboard, event tracking) is its own Go package with its own handler, service, repository, and router. No shared "models" package that everything depends on. This keeps the dependency graph clean and lets us work on features independently.

Background workers as goroutines. We have several background processes (credit drips, behavioral emails, analytics) running on tickers. Each starts as a goroutine with a context for graceful shutdown. Simple, no external job queue needed.

func (s *Service) Start(ctx context.Context) {
    ticker := time.NewTicker(10 * time.Minute)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            s.processAll(ctx)
        }
    }
}

Migrations auto-run on startup. In production, the binary runs pending database migrations before starting the HTTP server. No separate migration step in CI/CD. This means every deploy is self-contained — the code and its required schema ship together.

Config from environment. Every configurable value comes from environment variables loaded at startup. Different env for dev vs production, validated on boot with clear error messages. No config files, no YAML.

What Didnt Work (Initially)

Regex for PDF parsing. We spent two weeks building a regex-based screenplay parser. It handled 70% of PDFs perfectly and destroyed the other 30%. Different apps export different formatting. An LLM handles the ambiguity much better and cost is negligible (~$0.01 per parse).

One LLM for everything. We initially used the same model for analysis, shot generation, and prompt building. Performance was mediocre across all three. Splitting to different models (and different prompting strategies) per stage improved quality significantly.

Synchronous image generation. Early versions generated frames one at a time. A 15-shot scene took 45 seconds. Batching 3 concurrent (with rate limiting) cut it to under 20 seconds. Go makes this trivial but we should have done it from day one.

The Stack

Component	Technology
Language	Go 1.25
HTTP	Chi (stdlib-compatible)
Database	PostgreSQL 16 (pgx v5 driver)
Queries	sqlc (compile-time type safety)
Migrations	golang-migrate
Auth	Supabase (Google OAuth)
Storage	Azure Blob Storage
Hosting	Azure Container Apps
Frontend	Next.js (App Router)
AI	Multiple models (different per stage)

Try It

StoryBirdie is free to try — 50 credits, no credit card required. Upload a screenplay and see the full pipeline in action: analysis, shot list, storyboard frames, PDF export.

If youre building LLM-powered pipelines in Go or working with structured creative output, id love to hear your approach. Drop a comment or check out the product.

Amit Timalsina — co-founder at StoryBirdie

DEV Community