DEV Community: yha9806

Same `gpt-image-2` API. Two totally different results. The difference is 3 markdown files.

yha9806 — Sun, 26 Apr 2026 10:00:04 +0000

TL;DR

We took a Glasgow street photo and tried to add Northern Song gongbi (工笔重彩) painterly elements — red lanterns, Chinese signage, muted-gold trim — without dissolving the photo's existing pixels.

Two paths, same gpt-image-2:

Bare API + naive prompt ("repaint in gongbi style"): the entire image gets washed into a unified painterly filter. Photo gone.
Vulca-mediated structured prompt: photo anchors preserved, gongbi elements painted INTO the scene as discrete additions.

The difference between path 1 and path 2 is three markdown files that an agent produces by walking the brainstorm → spec → plan triad.

This post walks through what that "structured prompt composition" actually is, the silent input_fidelity parameter drift we caught between design.md and the live gpt-image-2 GA endpoint (and the v0.17.12 fix shipped today), and how the same triad evaluates cross-cultural AI generation honestly (including an L2 hard-fail at 0.65 that we surface plus a user-override-accept we ALSO record).

Repo: https://github.com/vulca-org/vulca
Install: pip install "vulca[mcp]==0.17.14"

The setup

Vulca is an MCP-native toolkit for cultural-art generation. The agent owns the brain (proposal → design → plan); the SDK owns the hands and eyes (prompt composition, layer decomposition, L1-L5 scoring). We were dogfooding our own triad on a real-world brief: "add Chinese gongbi cultural elements to a Scottish street photo, but preserve every photographic anchor".

User-supplied source: a Glasgow street with red brick Victorian buildings, a Gothic cathedral spire, Stagecoach buses, a woman in a purple jacket walking — your standard urban photograph.

Target: visible Chinese street culture (lanterns, calligraphy signage, muted-gold trim) painted INTO the scene at gongbi technical fidelity. Not a Photoshop filter; closer to a Wang Ximeng (王希孟, c.1096–1119) Thousand Li of Rivers and Mountains (《千里江山图》, Northern Song, 1113) palette discipline applied as overlay.

Path 1 — naive bare API

curl https://api.openai.com/v1/images/edits \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "model=gpt-image-2" \
  -F "prompt=Add Chinese gongbi painterly elements to this Glasgow street photo" \
  -F "image=@source.png"

Result (slide 1, left): the model produced a unified-wash painterly filter applied to the entire image. Red brick became flat color, building edges softened, photographic detail lost. The output reads as a style transfer, not an additive overlay. The lanterns are there, but so is everything else, all in the same painterly register.

This is the failure mode you get when "gongbi" is interpreted globally rather than as a discipline applied to specific painted-in elements.

Path 2 — Vulca-mediated structured prompt

Same gpt-image-2. Same source photo. Same intent. The difference is what the prompt looks like by the time it reaches OpenAI.

Vulca's compose_prompt_from_design() is a small, deliberately-boring helper: it reads a resolved design.md artifact, parses the YAML frontmatter + the ## C. Prompt composition block, and concatenates three pieces in order:

C.base_prompt — the user-authored prose that names MUST-PRESERVE anchors (Gothic spire, red brick wall, woman in purple jacket, bus, traffic light, sky, distant pedestrians, right-side building silhouette) and ADD-as-gongbi elements (lanterns, calligraphy signage, muted-gold trim, 千里江山图 palette echoes), plus the style_treatment: additive discipline clause.
C.tradition_tokens — terminology copied from the chinese_gongbi cultural registry: meticulous heavy-color painting (工笔重彩) · triple alum nine washes (三矾九染) · plain line drawing (白描) · outline and fill color (勾勒填彩) · boneless technique (没骨法) · court academy painting (院体画).
C.color_constraint_tokens — cinnabar red (朱砂红) · muted gold (泥金) · stone blue (石青) · malachite green (石绿) — and an explicit forbid: neon saturation, CNY plastic red, cartoon rainbow.

That's it. The "value" isn't a clever compiler; it's that the prose, the terminology, and the color discipline are all archived in design.md — version-controlled, reviewable, and reproducible from disk. A new agent two weeks later can call compose_prompt_from_design("design.md") and get the same prompt string back. That replayability is what bare-API workflows lose.

design.md also asks for input_fidelity=high on the OpenAI call. What actually happened is documented in plan.md Notes [param-drift]: gpt-image-2 GA shipped without input_fidelity support and rejected the parameter outright. The pre-v0.17.12 openai_provider was sending it unconditionally and would have failed; we caught the drift mid-session, gated the param by per-model capability (issue #12, fixed in v0.17.12 shipped to PyPI an hour before this post), and re-ran iter 0 without input_fidelity. The image still succeeded — style_treatment: additive plus the prompt-level "do NOT apply unified filter" clause carried preservation discipline through prose alone. The provenance trail is in plan.md.

Result (slide 1, right): the lanterns are painted, the calligraphy signage is gongbi 白描 (plain line drawing) on cinnabar panels — but the brick wall, spire, woman in purple jacket, and bus are recognizably the source photograph. The painterly elements read as intent (画意), not filter.

Decompose: 1 image → 10 editable semantic layers

mcp_vulca.layers_split(
    image_path="iters/7/gen_bfbbacd2.png",
    output_dir="decompose/iter1",
    mode="orchestrated",
    plan='{"domain":"photograph","entities":[...]}'
)

Pipeline: YOLO + Grounding DINO + SAM + SegFormer face-parsing. Returns a manifest with per-entity status, alpha-sparse RGBA layers, and a residual layer.

iter1 entities:

All numbers below come from manifest.json detection_report.per_entity[].pct_after_resolve (the deduplicated area share that resolves overlap to a single owning entity):

Layer	pct_after_resolve	sam_score	Detector
person	5.65%	1.01	dino (`woman`)
lanterns	8.05%	0.61	dino (`row`)
sign_top	1.17%	0.99	dino (`red panel calligraphy plaque`)
sign_right	0.47%	0.97	dino (`tall vertical golden plaque`)
spire	2.08%	0.96	dino (`gothic cathedral spire pointed roof`)
bus	3.59%	0.98	dino (`blue double decker bus`)
left_buildings	24.45%	0.98	dino (`red brick row`)
right_buildings	7.51%	0.93	dino (`cathedral ornate stone facade`)
sky	15.50%	0.98	dino (`blue sky upper region`)
9 entities sum	68.47%
residual (deduped)	31.53%		leftover (shadows, awning, road, traffic light)

success_rate: 1.0, no suspect detections, no missed entities. (Note: the manifest.json also stores layers[9].area_pct = 40.69% for the residual layer using overlap-permissive accounting — both numbers are exposed; the deduped 31.53% is the canonical "share of canvas not assigned to any named entity".)

The lanterns layer's sam_score is conspicuously low (0.61 vs the others' 0.93–1.01). That's not a bug — it's the pipeline doing something honest: SAM was given one bbox for "row of red paper lanterns" and asked to mask the whole row as a single object. With six dispersed lanterns + tassels + occluding awning ropes, SAM returns a fragmented streak rather than six clean silhouettes. Multi-instance entity detection (per-lantern bbox + NMS-multi-output) is a v0.18 backlog item; today's pipeline is 1 entity = 1 bbox = 1 mask. Slide 3's lanterns thumbnail looks "noisy" because that's the real shape of the mask.

A practical workflow note: DINO open-vocabulary detection has a "phrase contamination" failure mode where one entity's label tokens bleed into another entity's matched_phrase. If you ask for both "Chinese calligraphy sign on red panel" and "large Chinese calligraphy signage" in the same plan, DINO may union the bbox into a single region. The defense is: give each entity a phrase-distinct label. We renamed sign_top to "red panel calligraphy plaque" and sign_right to "tall vertical golden plaque" — both detect cleanly.

Redraw: same layer, two paths

mcp_vulca.layers_redraw(
    artwork_dir="decompose/iter1/",
    layer="lanterns",
    instruction="朱砂 cinnabar saturation +15%, 三矾九染 depth richer, "
                "preserve lantern shapes and tassel positions exactly, "
                "keep gongbi outline-and-fill discipline, no global filter",
    provider="openai",
    tradition="chinese_gongbi",
)

layers_redraw sends the alpha-sparse layer through gpt-image-2's edit endpoint with the cultural tradition's prompt-composition layer applied. The slide-4-right artifact you see — a four-lantern + spire reinterpretation in full 工笔重彩 with deeper cinnabar saturation and gongbi-canonical line discipline (outline-and-fill, 勾勒填彩) — was authored via a fresh generate_image call seeded by this same gongbi prompt scaffold, not the literal layers_redraw output above. The native layers_redraw path on this row-of-six-lanterns alpha gives a stylistically incoherent result because cream-flat reference loses per-instance geometry (see decompose/v0_17_14_native/NOTES.md for the v0.17.14 end-to-end MCP run and the v0.18 backlog item). The layers_redraw verb still works as documented; the carousel's slide-4-right just exercises the related generate_image path for visual coherence. The model approximates the visual register of gongbi — but as the L1-L5 scorecard below makes explicit, single-pass diffusion can't simulate the multi-pass alum-wash physics of true 三矾九染. The redraw looks gongbi-flavored; it isn't gongbi-correct. Both can be true.

The agent now has two paths for the lanterns layer:

alpha-isolated original (preservation, composite-friendly)
gongbi-reinterpreted output (concept exploration, hero asset)

Vulca exposes both paths via MCP. The choosing happens in the agent, not in a static pipeline.

Honest scoring — `mode="rubric_only"` + agent self-grade

v0.17.12 shipped a new evaluate mode this morning. Important: it does not score the image. It returns the rubric so the agent can score:

result = mcp_vulca.evaluate_artwork(
    image_path=".../gen_bfbbacd2.png",
    tradition="chinese_gongbi",
    mode="rubric_only",
)
# result.score == None
# result.rubric == { weights, terminology, taboos, tradition_layers }
# result.score_schema == { L1: null, L2: null, ..., L5: null }

rubric_only returns the rubric template (L1-L5 weights from chinese_gongbi.yaml, six terminology entries, taboos, tradition_layers) and an empty score_schema. No VLM call. The agent — which already has vision — applies the rubric to the image and fills the scores itself. The split is deliberate: consumer agents already see the pixels; an extra VLM round-trip would be redundant cost. Vulca supplies the rubric; the agent self-grades.

Our agent self-grade for the iter 0 image, recorded verbatim in plan.md ## Results:

Dim	Weight	Score	Rationale
L1 Visual	0.15	0.78	Gongbi additions read as deliberate; line discipline visible
L2 Technical	0.30	0.65 ✗	三矾九染 depth shallow; 石青/石绿 under-represented
L3 Cultural	0.25	0.72	千里江山图 palette intent honored; 朱砂/泥金 read true
L4 Critical	0.15	0.75	Additive treatment honored; photo anchors preserved
L5 Philosophical	0.15	0.65	Cross-cultural intent legible; literati-naming convention borrowed
Weighted		0.702

L2 hard-fails 0.70 because triple-alum-nine-washes (三矾九染) is a multi-pass physical technique — alum fixative applied between successive translucent washes to build depth and luminosity. A single forward pass through any diffusion model — gpt-image-2, stable-diffusion-xl, anything — cannot simulate alum-wash layering. This is a category-level ceiling, not a Vulca regression. The model approximates the visual register of depth; a trained gongbi reviewer will catch the absence of true alum-wash physics instantly.

The strict rubric verdict was reject. The maintainer (the human in the loop) decided to accept for showcase use anyway — and plan.md records BOTH judgments:

verdict: reject ✗ → user-override-accept (the table cell)
the override reason in the Notes block: "L2 hard-fail: 三矾九染 depth shallow; 石青/石绿 under-represented. User accepted for showcase use."

This is the dual-judgment provenance pattern. The strict rubric retains technical honesty; the human retains veto. Both are archived. A skeptic running the same pipeline gets the same rubric verdict; a different maintainer might decide to not override. The artifact captures both.

design.md's rollback_trigger is a separate concern: it fires only when all 3 main seeds score L1<0.6 OR L3<0.6. Neither condition was met (L1=0.78, L3=0.72), so the L2 hard-fail surfaces as honest disclosure, not as a rollback signal. Different gates for different purposes.

The triad — three markdown files

docs/visual-specs/2026-04-23-scottish-chinese-fusion/
├── proposal.md              ← /visual-brainstorm output (8K)
├── design.md                ← /visual-spec output (10K)
├── plan.md                  ← /visual-plan output (11K)
├── source.png               ← user-supplied Glasgow photo
├── iters/
│   ├── _baseline_bare/
│   │   └── bare_gpt2_edit.png       ← naive API control
│   └── 7/
│       └── gen_bfbbacd2.png         ← Vulca-mediated
├── decompose/
│   ├── lanterns_before.png          ← alpha-iso (slide 4 left)
│   ├── lanterns_after.png           ← gongbi reinterp (slide 4 right)
│   └── iter1/                       ← 9 entities + residual
└── carousel/                        ← this 6-slide deck

Three markdown files lock the entire decision trail:

proposal.md — the user's intent in human terms. Style treatment, anchor list, budget, deadline.
design.md — the technical translation. Provider, model, input_fidelity, prompt composition, L1-L5 weights & thresholds, spike plan, cost budget.
plan.md — the execution flow. Phase order, batch size, evaluation gates, fail-fast rules.

Each file is produced by a /visual-* skill (brainstorm / spec / plan), each gated by a finalize handshake (finalize / done / ready / lock it / approve). The agent doesn't free-form image-generate; it walks the triad.

This is what "agent-mediated prompting" actually means in code. Not magic. A markdown contract.

Try it

pip install "vulca[mcp]==0.17.14"

Add to your Claude Code MCP config:

{
  "mcpServers": {
    "vulca": {
      "command": "vulca-mcp"
    }
  }
}

Then in your Claude Code session, type /visual-brainstorm to start a fresh visual project, or /decompose <image> to break an existing image into editable layers. The full 22-tool MCP surface is documented in docs/.

GitHub: https://github.com/vulca-org/vulca

Closing note

The two images at the top of this post differ by three markdown files. They're not magic; they're version-controllable, reviewable, replayable contracts. If your AI-art workflow today is "type a prompt, hope, retry" — try the markdown-trio approach once. The first time you get the same image back from a fresh agent two weeks later because the markdown is still there, you'll see why.

— shipped today as vulca==0.17.14. The v0.17.14 patches make the
layers_redraw + layers_paste_back mechanism fully native: the
background_strategy="cream" flag stops the alpha-sparse hallucination
that pre-v0.17.14 tainted redraw of sparse layers, preserve_alpha=True
re-applies the source layer's alpha, and layers_paste_back is a new
glue verb for compositing an edited layer back into a foreign source
image. Visual parity with the slide-4-right artifact specifically is
a different goal: that artifact was authored via generate_image with a
gongbi text prompt, not via layers_redraw on the lanterns layer alone.
The v0.17.14 patch closes the out-of-band Python gap for the canonical
edit-and-paste-back flow; per-instance multi-lantern redraw with full
visual parity remains a v0.18 backlog item. Reproducible MCP-only
validation of the mechanism is archived in
decompose/v0_17_14_native/NOTES.md.

I Built a Free Local AI Art Pipeline on My Mac — Here's What Broke

yha9806 — Mon, 13 Apr 2026 13:09:33 +0000

What if you could run a complete AI art creation pipeline — 13 cultural traditions, 5-dimension scoring, structured layer generation — entirely on your MacBook, for free?

No cloud API key. No GPU server. Just pip install vulca.

Three traditions, one SDK — generated locally via ComfyUI/SDXL on Apple Silicon, zero cloud API cost.

These images were generated on an Apple Silicon Mac running ComfyUI locally. No Midjourney subscription. No Replicate credits. No DALL-E API calls. The evaluation scores below come from a VLM (Gemma 4 via Ollama) running on the same machine:

$ vulca evaluate art.png -t chinese_xieyi --mode reference

  Score:     90%    Tradition: chinese_xieyi    Risk: low

    L1 Visual Perception         ██████████████████░░ 90%  ✓
    L2 Technical Execution       █████████████████░░░ 85%  ✓
    L3 Cultural Context          ██████████████████░░ 90%  ✓
    L4 Critical Interpretation   ████████████████████ 100%  ✓
    L5 Philosophical Aesthetics  ██████████████████░░ 90%  ✓

This post is not a product announcement. It is a technical deep dive into what it took to build VULCA — the bugs we hit, the architectural decisions we made, and the code that holds it together.

1. What is VULCA + The Local Stack

VULCA is an AI-native cultural art creation SDK. It generates, evaluates, decomposes, and evolves visual art across 13 cultural traditions. It runs locally (ComfyUI + Ollama) or in the cloud (Gemini).

Not another Midjourney wrapper or ComfyUI plugin — a standalone SDK for cultural art intelligence.

The project started as academic research. The VULCA Framework was published at EMNLP 2025 Findings, and VULCA-Bench provides 7,410 annotated samples with L1-L5 cultural scoring definitions. The SDK implements this research as a production tool.

Architecture

┌─────────────────────────────────────────────┐
│                  vulca CLI                   │
├─────────────┬──────────┬────────────────────┤
│   create    │ evaluate │  layers / studio   │
├─────────────┴──────────┴────────────────────┤
│              Cultural Engine                 │
│   13 traditions × L1-L5 scoring rubrics     │
├──────────────────┬──────────────────────────┤
│  Image Providers │      VLM Providers       │
│  ┌────────────┐  │  ┌────────────────────┐  │
│  │  ComfyUI   │  │  │  Ollama (Gemma 4)  │  │
│  │  (local)   │  │  │  (local)           │  │
│  ├────────────┤  │  ├────────────────────┤  │
│  │  Gemini    │  │  │  Gemini            │  │
│  │  (cloud)   │  │  │  (cloud)           │  │
│  └────────────┘  │  └────────────────────┘  │
└──────────────────┴──────────────────────────┘

Quickstart

pip install vulca

# Point at your local ComfyUI + Ollama
export VULCA_IMAGE_BASE_URL=http://localhost:8188
export VULCA_VLM_MODEL=ollama_chat/gemma4

# Generate
vulca create "Misty mountains after spring rain" -t chinese_xieyi --provider comfyui -o art.png

# Evaluate
vulca evaluate art.png -t chinese_xieyi --mode reference

Provider Architecture: Pluggable, Not Locked In

VULCA does not depend on any single backend. Image providers are pluggable classes. ComfyUI is one provider. Gemini is another. You can add your own.

The key design insight: providers declare their capabilities as a frozen set. VULCA uses these capabilities to decide how to format prompts, whether to pass CJK text directly, and whether RGBA output is available.

# ComfyUI: CLIP-based encoder, English-only, returns raw RGBA
class ComfyUIImageProvider:
    capabilities: frozenset[str] = frozenset({"raw_rgba"})

# Gemini: LLM-based encoder, understands CJK natively, returns raw RGBA
class GeminiImageProvider:
    capabilities: frozenset[str] = frozenset({"raw_rgba", "multilingual_prompt"})

The multilingual_prompt capability is the difference between a 120-token structured prompt (Gemini can handle it) and a compressed 60-token flat prompt (CLIP will truncate anything beyond 77 tokens). More on this in section 5.

When you ask ComfyUI to generate an image, VULCA constructs a complete ComfyUI workflow as a JSON dict and submits it via the REST API. No ComfyUI nodes to install. No custom workflows to import. The entire workflow is built programmatically:

workflow = {
    "prompt": {
        "3": {"class_type": "KSampler", "inputs": {
            "seed": secrets.randbelow(2**63), "steps": 20, "cfg": 7.0,
            "sampler_name": "euler", "scheduler": "normal", "denoise": 1.0,
            "model": ["4", 0], "positive": ["6", 0],
            "negative": ["7", 0], "latent_image": ["5", 0]}},
        "4": {"class_type": "CheckpointLoaderSimple",
              "inputs": {"ckpt_name": kwargs.get("checkpoint",
                         "sd_xl_base_1.0.safetensors")}},
        "5": {"class_type": "EmptyLatentImage",
              "inputs": {"width": width, "height": height, "batch_size": 1}},
        "6": {"class_type": "CLIPTextEncode",
              "inputs": {"text": full_prompt, "clip": ["4", 1]}},
        "7": {"class_type": "CLIPTextEncode",
              "inputs": {"text": negative_prompt or "", "clip": ["4", 1]}},
        "8": {"class_type": "VAEDecode",
              "inputs": {"samples": ["3", 0], "vae": ["4", 2]}},
        "9": {"class_type": "SaveImage",
              "inputs": {"filename_prefix": "vulca", "images": ["8", 0]}},
    }
}

That is from src/vulca/providers/comfyui.py lines 42-62. It constructs a standard SDXL pipeline: checkpoint loader, empty latent, two CLIP text encoders (positive + negative), KSampler, VAE decode, save. The workflow is submitted as a single POST to /prompt, and VULCA polls /history/{prompt_id} until the image is ready.

After the image comes back, VULCA validates it is actually a valid PNG before accepting it:

if len(raw_bytes) < 1000 or raw_bytes[:4] != b'\x89PNG':
    raise ValueError(
        f"ComfyUI returned invalid image "
        f"({len(raw_bytes)} bytes, header={raw_bytes[:4]!r})"
    )

That validation was added in commit fdc0e45 after we discovered that certain PyTorch MPS bugs cause ComfyUI to return 4KB files with valid PNG headers but all-zero pixel data.

2. L1-L5 Cultural Evaluation

Most AI art tools generate. VULCA evaluates.

The evaluation framework scores artwork across five dimensions, each measuring a different aspect of cultural and artistic quality:

Dimension	What It Measures
L1 Visual Perception	Composition, color harmony, spatial arrangement
L2 Technical Execution	Rendering quality, technique fidelity, craftsmanship
L3 Cultural Context	Tradition-specific motifs, canonical conventions
L4 Critical Interpretation	Cultural sensitivity, contextual framing
L5 Philosophical Aesthetics	Artistic depth, emotional resonance, spiritual qualities

These are not arbitrary categories. They come from the VULCA-Bench paper, which defines L1-L5 across 7,410 annotated samples.

13 Traditions, Custom Weights

Each tradition is defined as a YAML file with its own L1-L5 weight distribution. Chinese freehand ink painting (xieyi) weights philosophical aesthetics (L5) at 30% and cultural context (L3) at 25%, because the tradition values spiritual resonance and canonical motifs above raw technical rendering. A brand design tradition would weight L2 (technical execution) much higher.

# src/vulca/cultural/data/traditions/chinese_xieyi.yaml
name: chinese_xieyi
display_name:
  en: "Chinese Freehand Ink (Xieyi)"
  zh: "中国写意"

weights:
  L1: 0.10
  L2: 0.15
  L3: 0.25
  L4: 0.20
  L5: 0.30

terminology:
  - term: spirit resonance and vitality
    term_zh: "气韵生动"
    definition:
      en: "The first of Xie He's Six Principles of Chinese Painting..."
    category: aesthetics
    l_levels: [L4, L5]

The 13 supported traditions are: chinese_xieyi, chinese_gongbi, japanese_traditional, western_academic, islamic_geometric, watercolor, african_traditional, south_asian, brand_design, photography, contemporary_art, ui_ux_design, and default.

Three Evaluation Modes

Strict (judge): Conformance scoring. How well does the art meet the tradition's standards?
Reference (mentor): Cultural guidance with professional terminology. Not a judge, a mentor.
Fusion: Multi-tradition comparison. Pass comma-separated traditions and get cross-cultural analysis.

The API: Three Lines to Score Any Image

import vulca

result = await vulca.aevaluate(
    "artwork.png",
    tradition="chinese_xieyi",
    mode="reference",
)
print(result.score, result.l1, result.l2, result.l3, result.l4, result.l5)

The full aevaluate() signature from src/vulca/evaluate.py:

async def aevaluate(
    image: str | Path,
    *,
    intent: str = "",
    tradition: str = "",
    subject: str = "",
    skills: list[str] | None = None,
    api_key: str = "",
    mock: bool = False,
    mode: str = "strict",
    sparse: bool = False,
) -> EvalResult:

The sparse parameter is worth calling out. When sparse=True, VULCA runs a BriefIndexer that determines which L1-L5 dimensions are most relevant to the given intent. All five dimensions are still scored (consistency matters), but the sparse_activation metadata tells callers which dimensions were most salient. This is useful in pipeline mode where you want to focus review on the dimensions that matter for a specific prompt.

3. Deep Dive: Structured Layer Generation

VULCA does not generate images. It generates layers.

The pipeline works like this:

Intent parsing — user prompt is analyzed for tradition, subject, and composition intent
VLM planning — Gemma 4 (via Ollama) decomposes the prompt into a layer plan: background, mid-ground elements, foreground, calligraphy/text
Per-layer generation — each layer is generated as a separate image with transparent background
Luminance keying — non-background layers are keyed to remove canvas color, producing clean alpha
Alpha composite — layers are composited in order to produce the final artwork

Layer decomposition: paper, distant mountains, forest, calligraphy, composite

Serial-First Style Anchoring

The first layer generates serially as a style anchor. Its raw RGB output becomes the visual reference (style_ref) for all subsequent layers, which generate in parallel. This is Defense 3 from v0.14 — without it, each layer would independently interpret "Chinese xieyi style" and you would get five different visual interpretations in the same artwork.

The Prompt Builder

The core of layer generation is build_anchored_layer_prompt() in src/vulca/layers/layered_prompt.py. This function wraps the plan's regeneration prompt in four mandatory anchor blocks: canvas, content (with negative list), spatial, style.

@dataclass(frozen=True)
class LayerPromptResult:
    """Prompt + negative prompt pair for a layer."""
    prompt: str
    negative_prompt: str


def build_anchored_layer_prompt(
    layer: LayerInfo,
    *,
    anchor: TraditionAnchor,
    sibling_roles: list[str],
    position: str = "",
    coverage: str = "",
    english_only: bool = False,
) -> str | LayerPromptResult:

The function has two code paths, controlled by english_only:

When english_only=False (Gemini path): Returns a structured multi-section string with [CANVAS], [CONTENT], [SPATIAL], [STYLE], and [USER INTENT] blocks. Gemini's LLM-based encoder can parse these sections and follow the instructions.

blocks = [
    "[CANVAS]",
    f"The image MUST be drawn on {canvas_description}.",
    f"The background MUST be the pure canvas color {anchor.canvas_color_hex},",
    "with absolutely no other elements, textures, shading, or borders.",
    "",
    "[CONTENT — exclusivity]",
    "This image ONLY contains the element specified in USER INTENT.",
    f"Do NOT include any of: {others_text}.",
    "",
    "[SPATIAL]",
    f"MUST occupy {pos}, covering approximately {cov} of the canvas area.",
    "",
    "[STYLE]",
    style_keywords,
    "",
    "[USER INTENT]",
    user_intent,
]

When english_only=True (ComfyUI/SDXL path): Returns a LayerPromptResult with a flat, CLIP-friendly prompt under 70 tokens and a separate negative_prompt. This is the path that took the most engineering to get right. More on why in section 5.

if english_only:
    parts = [user_intent, style_keywords, f"on {canvas_description}"]
    pos = position or ""
    if pos:
        parts.append(pos)
    prompt = ", ".join(p for p in parts if p)
    negative = ", ".join(others) if others else ""
    return LayerPromptResult(prompt=prompt, negative_prompt=negative)

CJK-Aware Prompt Handling

VULCA accepts prompts in Chinese, Japanese, and Korean. When the target provider has the multilingual_prompt capability (Gemini), CJK text passes through natively. When the provider does not have that capability (ComfyUI/SDXL with CLIP), VULCA strips CJK characters and falls back to English equivalents:

_CJK_RE = re.compile(r"[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]")

def _strip_cjk_parenthetical(text: str) -> str:
    """Strip CJK parenthetical annotations, e.g. 'cooked silk (熟绢)' -> 'cooked silk'."""
    return _CJK_PAREN_RE.sub("", text).strip()

So vulca create "水墨山水" -t chinese_xieyi --provider comfyui works — VULCA translates the prompt for CLIP behind the scenes.

4. Deep Dive: Making SDXL Work Locally

This is where things got interesting. Two traps nearly derailed the local ComfyUI path.

Trap 1: The ANCHOR Hallucination

Our structured layer prompts originally used section headers like [CANVAS ANCHOR], [STYLE ANCHOR], and [CONTENT ANCHOR]. The word "ANCHOR" was there to signal to the LLM that these were fixed constraints, not suggestions.

SDXL's CLIP encoder is not an LLM. It is a text encoder that treats every token as content. When it saw "ANCHOR", it interpreted it as a request to paint an anchor — the nautical kind.

The result: literal ship anchors appearing on rice paper backgrounds in Chinese ink wash paintings. Misty mountains with a ship anchor in the corner. Bamboo forests with an anchor hovering over them.

The fix was trivial once diagnosed. Rename the headers to [CANVAS], [STYLE], [CONTENT], [SPATIAL]. No word that could be interpreted as visual content.

Commit: b168178 — fix(layers): remove ANCHOR from prompt headers — SDXL paints literal anchors

The lesson: CLIP-based models do not have a concept of "metadata" or "instructions" in a prompt. Every token is content. If your prompt engineering uses structured headers, every header word will influence the generated image.

Trap 1b: The 77-Token CLIP Ceiling

Fixing the anchor hallucination revealed a second, subtler problem. Our structured prompt — even without "ANCHOR" — was 120+ tokens. CLIP truncates at 77 tokens. The actual subject description ("misty mountains after spring rain") was buried past the 77-token boundary and never reached the encoder.

Gallery images (simple prompts, ~30 tokens) worked perfectly. Layered generation (structured prompts, 120+ tokens) produced generic, unfocused results. The debugging was confusing because the same code path worked for simple creates but failed for layered creates.

The fix: the english_only branch in build_anchored_layer_prompt(). Instead of a structured multi-section prompt, VULCA builds a flat, subject-first prompt under 70 tokens:

misty mountains after spring rain, traditional brushwork, ink wash, on aged xuan paper

Plus a separate negative_prompt field (other layer roles to avoid). The subject comes first so it is guaranteed to be within CLIP's 77-token window.

Commit: 74f9952 — fix(layers): CLIP-aware prompt compression for SDXL — flat <70 token prompt

The LayerPromptResult dataclass was added specifically for this:

@dataclass(frozen=True)
class LayerPromptResult:
    """Prompt + negative prompt pair for a layer."""
    prompt: str
    negative_prompt: str

The structured string (Gemini path) returns a single str. The CLIP path returns a LayerPromptResult with both positive and negative prompts separated. The caller checks isinstance(result, LayerPromptResult) to decide which ComfyUI workflow nodes to populate.

Trap 2: PyTorch MPS — A Version Minefield

With prompt engineering fixed, we hit the hardware layer. SDXL generation via ComfyUI on Apple Silicon (MPS backend) with PyTorch 2.11.0 produces black (all-zero, ~4KB) or noise (~2MB random pixels) images.

Key observations that made this hard to diagnose:

KSampler diffusion runs to completion — 20 steps, progress bars, no errors
VAEDecode output is corrupt despite successful sampling
--force-fp32 does NOT fix it — this is a correctness bug, not a precision issue

Three compounding PyTorch MPS bugs cause the failure:

Bug 1: SDPA Non-Contiguous Tensor Regression (pytorch/pytorch#163597)

Introduced in PyTorch 2.8.0. MPS SDPA kernels produce wildly incorrect results when given non-contiguous tensors. SDXL's cross-attention performs transpose operations that create non-contiguous views, feeding garbage embeddings into the U-Net. Error magnitude: ~34.0 vs normal ~0.000006.

Bug 2: Conv2d Chunk Correctness Bug (pytorch/pytorch#169342)

Affects PyTorch 2.9.0+. The chunk() -> conv() pattern produces correct results only for the first batch element. Single-image generation (batch=1) is unaffected. Multi-image batch workflows will hit it.

Bug 3: Metal Kernel Migration Regressions (pytorch/pytorch#155797)

PyTorch 2.10-2.11 introduced additional MPS regressions during internal operator migrations. Identical symptoms reported on M3 Ultra via ComfyUI#10681.

Why VAEDecode Is the Failure Point

The VAE decoder is uniquely vulnerable:

Uses Conv2d with large channel counts (hit by Bug 2)
Uses GroupNorm with float16 inputs (NaN propagation)
Single-pass decoder with no self-correction like iterative KSampler
Intermediate values explode to 9.5e+25, GroupNorm cannot recover, output is all-zero or random

The Version Matrix

PyTorch Version	SDXL on MPS	Notes
2.4.1	Working	Last fully validated version
2.5.x	Degraded	Memory +50%, speed -60%
2.6.x	Partial	Some SDPA issues, `--force-fp32` can help
2.7.x	Partial	Similar to 2.6
2.8.0	Broken	SDPA non-contiguous bug introduced
2.9.0	Working	Sweet spot: pre-Metal migration, SDPA bug masked by ComfyUI's attention slicing
2.10.0	Broken	Black images on M3 Ultra
2.11.0	Broken	Black/noise on Apple Silicon

The Fix

# In ComfyUI venv
cd ~/dev/ComfyUI
./venv/bin/pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0
./venv/bin/python main.py --listen 0.0.0.0 --port 8188

Pin torch==2.9.0. That is the entire fix. We wrote a complete Apple Silicon MPS + ComfyUI/SDXL Compatibility Guide that covers diagnosis, workarounds (CPU VAE, force-fp32), environment variables, and verification steps.

The guide is at docs/apple-silicon-mps-comfyui-guide.md in the repo.

5. Inpainting and Layer Editing

Once you have layers, you can edit them individually without regenerating the entire artwork.

# Redraw a specific layer with a new instruction
vulca layers redraw ./layers/ --layer sky -i "warm golden sunset"

# Region-based inpaint on the composite
vulca inpaint art.png --region "the sky" --instruction "stormy clouds"

The inpaint path uses the same provider architecture. ComfyUI receives an inpaint workflow with a mask, Gemini receives the image + mask + instruction as a multipart prompt. The same capabilities system determines prompt formatting.

6. What's Working, What's Next

Current State (v0.15.1)

All 13 traditions generating locally on Apple Silicon via ComfyUI + SDXL
Full E2E pipeline validated: intent parsing, VLM planning, per-layer generation, keying, composite
8 E2E phase tests passing in 2.4 seconds (mock mode)
CJK prompts working end-to-end with automatic CLIP compression
PNG response validation catches corrupt MPS output
Structured layer generation with serial-first style anchoring

The Commit Trail

The local provider path was stabilized across these commits:

b168178 — remove ANCHOR from prompt headers
42e0e3d — skip keying for background layers
fdc0e45 — validate ComfyUI PNG response
74f9952 — CLIP-aware prompt compression
e840496 — MPS compatibility guide
485067e — v0.15.1 release

Roadmap

Gemini cloud path: Currently blocked on free-tier billing limits (image generation returns limit: 0). Text + VLM vision work. Once billing is enabled, Gemini becomes the zero-setup cloud alternative.
SAM3 text-prompted segmentation: Replace luminance keying with SAM3 for cleaner layer extraction.
Web UI / Gradio demo: A browser-based interface for non-CLI users.

7. Get Started

5-Minute Local Setup

# 1. Install VULCA
pip install vulca

# 2. Install ComfyUI (if you don't have it)
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python -m venv venv
./venv/bin/pip install -r requirements.txt
# CRITICAL: pin PyTorch for Apple Silicon
./venv/bin/pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0

# 3. Download SDXL checkpoint
# Place sd_xl_base_1.0.safetensors in ComfyUI/models/checkpoints/

# 4. Start ComfyUI
./venv/bin/python main.py --listen 0.0.0.0 --port 8188

# 5. Install Ollama + Gemma 4 (for VLM evaluation)
brew install ollama
ollama pull gemma4

# 6. Generate and evaluate
export VULCA_IMAGE_BASE_URL=http://localhost:8188
export VULCA_VLM_MODEL=ollama_chat/gemma4

vulca create "Misty mountains after spring rain" \
  -t chinese_xieyi --provider comfyui -o art.png

vulca evaluate art.png -t chinese_xieyi --mode reference

Python API

import vulca

# Evaluate any image
result = await vulca.aevaluate(
    "artwork.png",
    tradition="chinese_xieyi",
    mode="reference",
)

# Access individual dimension scores
print(f"L1 Visual: {result.l1}")
print(f"L5 Philosophy: {result.l5}")
print(f"Overall: {result.score}")

What VULCA Is

VULCA is an open-source SDK for AI-native cultural art creation. It brings cultural intelligence to AI art generation. 13 traditions, each with its own L1-L5 scoring rubric, terminology, and taboos.

It is built on peer-reviewed research (EMNLP 2025 Findings), tested against 7,410 annotated samples (VULCA-Bench), and runs entirely on your local machine if you want it to.

What VULCA Is Not

Not a ComfyUI plugin. ComfyUI is one of several image providers.
Not a Midjourney alternative. VULCA does not host image generation — it orchestrates it.
Not a wrapper around any single model. Swap ComfyUI for Gemini (or your own provider) with one config change.

Links

GitHub: https://github.com/vulca-org/vulca
PyPI: https://pypi.org/project/vulca/
MPS Guide: docs/apple-silicon-mps-comfyui-guide.md
Research: VULCA Framework (EMNLP 2025 Findings) | VULCA-Bench (arXiv)

If this resonates, star us on GitHub. Try it, break it, tell us what failed — issues welcome.

If you use VULCA in research, please cite:

@inproceedings{vulca2025,
  title={VULCA: A Framework for Cultural Art Evaluation},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
  year={2025}
}

13 traditions. One SDK. Your machine.