DEV Community

Cover image for AI-Orchestrated 3D Asset Pipeline: From JPEG to Game-Ready GLB Without Touching Blender
Aleksandr Kossarev
Aleksandr Kossarev

Posted on

AI-Orchestrated 3D Asset Pipeline: From JPEG to Game-Ready GLB Without Touching Blender

AI-Orchestrated 3D Asset Pipeline: From JPEG to Game-Ready GLB Without Touching Blender

TL;DR: I built a pipeline where an AI agent operates Blender through MCP (Model Context Protocol), while a vision model validates every step by looking at screenshots. I never opened Blender's GUI for modeling. Here's what worked, what broke, and the patterns that emerged after rigging 6+ animated models for a Godot 4 project.


The Setup

I needed animated 3D fish for a virtual aquarium in Godot 4. I don't know Blender. Instead of learning it, I built a pipeline where AI does the work and I supervise.

The stack:

  • CLI — my entry point, natural language instructions
  • AI coding agent (via MCP) — writes and executes Blender Python code
  • Blender MCP addon — exposes Blender operations as MCP tools over a local socket
  • Vision model (VLM) — looks at viewport screenshots and validates results
  • Meshy.ai — converts reference photos to 3D models with textures
  • Godot 4 — final destination for rigged, animated GLB files

The architecture:

Human (instructions)
  → AI Agent (generates bpy code)
    → MCP Protocol (JSON-RPC over stdio)
      → Blender Addon (socket :9876, executes Python)
        → Viewport Screenshot
          → Vision Model (validates result)
            → AI Agent (adjusts or proceeds)
              → Export GLB → Godot
Enter fullscreen mode Exit fullscreen mode

The human speaks problems. The AI translates them into Blender Python. The vision model confirms whether the result looks correct. Nobody clicks anything in Blender.


Why This Approach

Traditional 3D pipeline: learn Blender (weeks), model manually (hours per asset), rig by hand (more hours), debug in Godot (pain).

AI-orchestrated pipeline: describe what you want, AI executes, vision model validates, iterate until correct. First model takes a couple of hours of prompt debugging. By the tenth model, you're done in 10 minutes.

The key insight: you don't automate Blender by writing a perfect script once. You automate it by teaching an AI agent to handle failures through a vision feedback loop.


Pattern 1: One Action, One Verification

This is the most important pattern. Everything else depends on it.

1. AI executes ONE Blender operation
2. Take a screenshot of the viewport
3. Vision model checks the result
4. If OK → next step. If FAIL → undo → try different approach.
Enter fullscreen mode Exit fullscreen mode

Why not batch operations? If the AI executes 6 bone extrusions in sequence and something breaks at step 2, neither the AI nor you can tell where it went wrong. One action per cycle means deterministic rollback.

Why vision validation? Blender's Python API doesn't always tell you the truth about visual results. A bone might report correct coordinates but visually overlap with another bone. Weights might be "assigned" but produce garbage deformation. The viewport screenshot is ground truth.

Anti-stuck rule: if the same approach fails 3 times in a row, the AI must switch strategy. Extrude not working? Try moving the bone directly. Auto-weights failing? Switch to manual Gaussian assignment.


Pattern 2: Structured Prompts for the Vision Model

A naive prompt to a vision model produces naive answers. "Look at this Blender screenshot" gets you "I see some orange lines." You need structured, domain-specific prompts.

Bad:

"Check the skeleton"
Enter fullscreen mode Exit fullscreen mode

Good:

"You are a rigging tech lead. Count the bones in the armature. 
Check: 1) All bone heads connect to previous bone tails? 
2) Last bone reaches the end of the mesh?
Answer strictly: bones=N|chain_ok=true/false|tail_reach=true/false"
Enter fullscreen mode Exit fullscreen mode

Three prompt templates that cover 90% of validation:

Mode Prompt format When to use
Skeleton check `bones=N\ chain_ok=true/false\
Rigging check {% raw %}`weights_painted=true/false\ only_tip_deforms=true/false\
State check {% raw %}`mode=EDIT/POSE/OBJECT\ selected=Bone.006\

Critical tips:

  • Never ask the VLM to count precisely. It hallucinates numbers on complex scenes. Instead, ask it to compare: "Are there MORE, FEWER, or SAME number of bones as the reference (7)?"
  • Use multiple-choice format: "What bent? A) Only the tip B) Whole tail C) Entire body. Answer with one letter." Comparisons work better than open-ended questions.
  • Force the viewport angle before taking screenshots. Side view for spine/tail, front view for gills. The AI must set the camera programmatically before each screenshot.
  • Force a redraw before screenshotting: {% raw %}bpy.ops.wm.redraw_timer(type='DRAW_WIN_SWAP', iterations=1). Without this, the screenshot captures a stale frame.

Pattern 3: Clean Scene Between Models

Blender retains actions, armature data, and mesh data even after deleting objects from the scene. If you rig Fish A, then import Fish B without cleaning, Fish A's bone animations leak into Fish B's export.

Real incident: Koi bone names appeared in Pterophyllum's GLB export, causing "Animation target not found" warnings in Godot.

Mandatory cleanup script before each new model:

import bpy

# Delete all scene objects
for obj in list(bpy.context.scene.objects):
    bpy.data.objects.remove(obj, do_unlink=True)

# Purge all orphan data blocks
bpy.ops.outliner.orphans_purge(
    do_local_ids=True, 
    do_linked_ids=False, 
    do_recursive=True
)

# Verify: everything should be zero
print(f"Objects: {len(bpy.data.objects)}, "
      f"Actions: {len(bpy.data.actions)}, "
      f"Armatures: {len(bpy.data.armatures)}, "
      f"Meshes: {len(bpy.data.meshes)}")
Enter fullscreen mode Exit fullscreen mode

Rule: one model at a time. Import → rig → weight → test → export → clean. Only then start the next one.


Pattern 4: Auto-Weights Will Fail on Complex Geometry

Blender's ARMATURE_AUTO weight assignment calculates distance from each bone to each vertex. This works for simple meshes. For thin geometry (fins, veils, tails), all bones appear "close" to all vertices, and the algorithm produces garbage.

Symptoms:

  • "No solution found for one or more bones"
  • Root bone influences 100% of vertices
  • Entire body deforms when you rotate one fin

What works instead: manual Gaussian weight assignment.

import math

sigma = 0.03  # adjust per bone size
for v in mesh.data.vertices:
    v_local = arm.matrix_world.inverted() @ mesh.matrix_world @ v.co
    d = (v_local - bone_head).length
    if d < sigma * 3:
        w = math.exp(-d*d / (2*sigma*sigma))
        if w > 0.05:
            group.add([v.index], w, 'REPLACE')
Enter fullscreen mode Exit fullscreen mode

Follow with normalization and smoothing (vertex_group_smooth(factor=0.3, repeat=1)). Then validate with the vision model.

Another common trap: neutral_bone or Root eating all weights. If a bone sits at origin with use_deform=True, auto-weights assign it to everything. Fix: bone.use_deform = False for utility bones, then re-bind.


Pattern 5: Blender → Godot Translation Gotchas

Many things that work in Blender break silently in Godot. These cost the most debugging time.

Rotation mode

Blender defaults to Quaternion for armatures after GLB import. If your AI writes bone.rotation_euler.x = -0.5, nothing happens. The bone ignores Euler when in Quaternion mode.

Fix: always set bone.rotation_mode = 'XYZ' before animating with Euler, or work in Quaternion throughout.

Rest pose must be identity

If a bone's rest pose isn't aligned to world axes, Godot applies animation offsets relative to a non-identity transform. Result: the jaw nods the entire head instead of opening the mouth.

Fix: in Edit Mode, align all bones strictly along X/Y/Z axes. Set roll = 0 for every bone. After posing, clear all transforms — the mesh should not move. If it moves, rest pose is wrong.

Scale on bones is unreliable

Godot 4.x sometimes ignores bone scale if rest pose doesn't match skeleton rest. Gill breathing animated via scale.x on a bone worked in Blender but did nothing in Godot.

Fix: use Shape Keys (blend shapes) instead of bone scale for facial/gill animation. Shape Keys work deterministically in both Blender and Godot. Bone animation is only for rotation-based movement (swimming, tail wagging).

Constraints don't export

Godot doesn't understand Blender constraints (Copy Rotation, etc). They must be baked before export.

bpy.ops.nla.bake(
    frame_start=1, frame_end=60,
    visual_keying=True,      # bake constraint results
    clear_constraints=True,  # remove constraints from export
    bake_types={'POSE'}
)
Enter fullscreen mode Exit fullscreen mode

Forward axis mismatch

Body axis in Blender is X, in Godot is -Z. All models need a 90° rotation on import. Apply transforms before export: bpy.ops.object.transform_apply(location=True, rotation=True, scale=True).

Animation speed

Blender animation at 30 FPS plays at half speed in Godot's 60 FPS physics. Set AnimationPlayer.speed_scale = 2.0 or bake at 60 FPS from the start.


Pattern 6: The AI Agent Has Limits

One task per call

The coding AI cannot handle multi-step instructions reliably. "Animate Tail1, Tail2, Tail3 and both pectoral fins" produces bpy.ops.pose.select_all and breaks everything.

Fix: one bone per call. Animate Tail1 → vision check → animate Tail2 → vision check → ... → bake all together at the end.

Context mode matters

Blender's API is context-sensitive. Most bpy.ops calls fail with "poll() failed, context is incorrect" if you're in the wrong mode.

Rules the AI must follow:

  • Before mode_set(mode='POSE') → set active = armature
  • Before mode_set(mode='WEIGHT_PAINT') → set active = mesh
  • Before mode_set(mode='EDIT') for armature → first go to OBJECT, then set active, then EDIT
  • select_all(action='DESELECT') only works in OBJECT mode

The AI will get stuck

After 3 failed attempts with the same approach, force a strategy change. This must be an explicit rule in the agent's instructions, not a hope.


Pattern 7: Post-Solution Patterns (PSP)

After each model, document what broke and how you fixed it. This creates a growing knowledge base that makes each subsequent model faster.

Format:

Symptom: [what you observed]
Cause: [root cause]
Fix: [code or procedure]
Applies to: [which model types]
Enter fullscreen mode Exit fullscreen mode

Examples from real production:

# Symptom Cause Fix
1 rotation_euler has no effect rotation_mode='QUATERNION' Set rotation_mode='XYZ' first
2 Entire body moves when rotating fin use_connect=True on fin bone Set use_connect=False, parent to Spine1
3 Orphan animations in exported GLB Previous model's data not purged Full cleanup script between models
4 Jaw nods the head in Godot Rest pose not identity Align bones to world axes, roll=0
5 Gills don't animate in Godot Scale on bones ignored by Godot 4 Use Shape Keys instead of bone scale
6 Vision model says FAIL but code says PASS Wrong viewport angle Set camera to RIGHT/FRONT view before screenshot

After ~10 models, PSP becomes your real pipeline. The AI reads it before starting each new model and avoids known pitfalls. First model: 3 hours. Tenth model: 20 minutes.


Pattern 8: Assert Vision — Tests for 3D

The most powerful pattern that emerged: using the vision model as a test framework.

def assert_vision(question, expected_answer):
    result = vlm_ask(screenshot(), question)
    if expected_answer.lower() not in result.lower():
        raise AssertionError(
            f"Vision assert failed: expected '{expected_answer}', got '{result}'"
        )
Enter fullscreen mode Exit fullscreen mode

Usage:

# After rigging
assert_vision("Tail3 rotated 45°. What bent? A) Only tip B) Whole tail C) Entire body", "A")

# After weight painting  
assert_vision("Head changed position?", "NO")

# After animation bake
assert_vision("Frame 1 and frame 60. Same pose?", "YES")

# After export and Godot import
assert_vision("Skeleton visible? Tail bends?", "YES")
Enter fullscreen mode Exit fullscreen mode

This is CI/CD for 3D. If you change weights tomorrow, run the assert suite. If anything breaks, you know immediately.


The Complete Workflow for One Model

1.  Clean Blender scene (purge orphans)
2.  Import GLB from Meshy.ai
3.  Orient body along X axis (rotate Z -90°, apply transforms)
4.  Decimate to target polycount (ratio 0.15-0.3)
5.  Create armature: spine chain + fins + jaw
6.  Parent mesh to armature with empty vertex groups
7.  Assign weights: Gaussian for each bone, normalize, smooth
8.  Vision check: rotate each bone → "only target deforms?"
9.  Selective zero: remove weight leaks from body to face bones
10. Vision check: jaw/gills move independently?
11. Create swim animation: sin wave on spine chain, 60 frames
12. Vision check: frame 1 = frame 60? Natural motion?
13. Bake action: visual_keying=True, clear_constraints=True
14. Export GLB with animations and Shape Keys
15. Import in Godot, verify animation plays correctly
16. Clean Blender scene for next model
Enter fullscreen mode Exit fullscreen mode

Between steps 7-10, expect 2-5 iterations per bone. This is normal. The feedback loop (AI executes → vision validates → AI adjusts) converges quickly once PSP covers common failure modes.


Results

Metric First model After PSP (latest models)
Time to rigged GLB ~2 hours ~10 minutes
Manual Blender work Occasional weight painting Zero
Vision checks per model 15-20 3-5
Export failures 3-4 attempts Usually first try

The bottleneck shifted from "learning Blender" to "debugging AI prompts." When the AI makes a mistake, 90% of the time it's because the vision model gave bad feedback. Fix one line in the VLM prompt — the entire system gets smarter.

Evolution: Unified Vision+Coding Model

An important optimization emerged during the project. The initial architecture used a small local vision model (Qwen3VL-4B) purely for validation, while a separate coding AI generated the Blender Python. This meant two models, two contexts, two sets of prompts, and a manual bridge between them.

Later, I switched to a larger Qwen model accessed through MCP that could both see the viewport and write code. One model that understands what it's looking at AND knows how to fix it. The feedback loop collapsed from "AI writes code → screenshot → VLM checks → human relays feedback → AI adjusts" to "AI writes code → looks at result → adjusts itself."

This cut iteration time significantly. The patterns in this article still apply — one action per check, structured prompts, PSP — but the architecture becomes simpler when vision and coding live in the same model.


Key Takeaways

  1. One action, one check. Never let the AI chain operations blindly. Deterministic rollback requires deterministic steps.

  2. Vision validation is non-negotiable. Code can report success while the viewport shows garbage. The screenshot is ground truth.

  3. Auto-weights fail on thin geometry. Plan for manual Gaussian assignment on fins, veils, and facial features.

  4. Blender and Godot speak different languages. Rest pose identity, quaternion rotation, Shape Keys over bone scale, baked constraints — learn these once, document in PSP, never debug again.

  5. PSP is the real product. The pipeline isn't the code. It's the accumulated knowledge of what breaks and how to fix it. Each model teaches the system.

  6. The human role is supervisor, not operator. You describe problems in natural language. The AI translates to code. The VLM validates visually. You make decisions when the system gets stuck.


What's Next

The same architecture — AI agent + MCP tool + vision validation — applies beyond Blender. Any GUI-heavy professional tool that exposes an API can be orchestrated this way. The patterns (one action/one check, structured VLM prompts, PSP accumulation) are universal.

The agents aren't replacing 3D artists. They're making 3D accessible to people who have ideas but not the specialized skills to execute them. The quality ceiling is still set by human judgment — but the floor has risen dramatically.


Tested on: Linux Mint 22.3, Blender 4.0+, Godot 4.x, NVIDIA RTX 5060 Ti (eGPU via Thunderbolt 4)

MCP Server: BlenderMCP 1.27.1

Vision Models: Qwen3VL-4B (local, llama.cpp) → later Qwen (larger, unified vision+coding via MCP)

Author: Aleksandr Kossarev, Jõgeva, Estonia

Project: Arche Iscrin


This article is based on 2300+ lines of production notes from rigging 6 animated fish models for a Godot virtual aquarium, using an AI-orchestrated pipeline without manual Blender operation.

Top comments (0)