<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yha9806</title>
    <description>The latest articles on DEV Community by yha9806 (@yha9806).</description>
    <link>https://dev.to/yha9806</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3876647%2F912bd6a6-dd6f-421c-a7f5-23f7c93f90d7.jpeg</url>
      <title>DEV Community: yha9806</title>
      <link>https://dev.to/yha9806</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yha9806"/>
    <language>en</language>
    <item>
      <title>Same `gpt-image-2` API. Two totally different results. The difference is 3 markdown files.</title>
      <dc:creator>yha9806</dc:creator>
      <pubDate>Sun, 26 Apr 2026 10:00:04 +0000</pubDate>
      <link>https://dev.to/yha9806/same-gpt-image-2-api-two-totally-different-results-the-difference-is-3-markdown-files-3ipj</link>
      <guid>https://dev.to/yha9806/same-gpt-image-2-api-two-totally-different-results-the-difference-is-3-markdown-files-3ipj</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;We took a Glasgow street photo and tried to add Northern Song gongbi (工笔重彩) painterly elements — red lanterns, Chinese signage, muted-gold trim — without dissolving the photo's existing pixels.&lt;/p&gt;

&lt;p&gt;Two paths, same &lt;code&gt;gpt-image-2&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bare API + naive prompt&lt;/strong&gt; ("repaint in gongbi style"): the entire image gets washed into a unified painterly filter. Photo gone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vulca-mediated structured prompt&lt;/strong&gt;: photo anchors preserved, gongbi elements painted INTO the scene as discrete additions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The difference between path 1 and path 2 is &lt;strong&gt;three markdown files&lt;/strong&gt; that an agent produces by walking the brainstorm → spec → plan triad.&lt;/p&gt;

&lt;p&gt;This post walks through what that "structured prompt composition" actually is, the silent &lt;code&gt;input_fidelity&lt;/code&gt; parameter drift we caught between &lt;code&gt;design.md&lt;/code&gt; and the live &lt;code&gt;gpt-image-2&lt;/code&gt; GA endpoint (and the &lt;code&gt;v0.17.12&lt;/code&gt; fix shipped today), and how the same triad evaluates cross-cultural AI generation honestly (including an L2 hard-fail at &lt;code&gt;0.65&lt;/code&gt; that we surface plus a &lt;code&gt;user-override-accept&lt;/code&gt; we ALSO record).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Repo: &lt;a href="https://github.com/vulca-org/vulca" rel="noopener noreferrer"&gt;https://github.com/vulca-org/vulca&lt;/a&gt;&lt;br&gt;
Install: &lt;code&gt;pip install "vulca[mcp]==0.17.14"&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmukvo05zwae14e1uw0wh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmukvo05zwae14e1uw0wh.png" alt="Same gpt-image-2 API. Two totally different results. The difference is 3 markdown files." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Vulca is an MCP-native toolkit for cultural-art generation. The agent owns the brain (proposal → design → plan); the SDK owns the hands and eyes (prompt composition, layer decomposition, L1-L5 scoring). We were dogfooding our own triad on a real-world brief: &lt;em&gt;"add Chinese gongbi cultural elements to a Scottish street photo, but preserve every photographic anchor"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;User-supplied source: a Glasgow street with red brick Victorian buildings, a Gothic cathedral spire, Stagecoach buses, a woman in a purple jacket walking — your standard urban photograph.&lt;/p&gt;

&lt;p&gt;Target: visible Chinese street culture (lanterns, calligraphy signage, muted-gold trim) painted INTO the scene at gongbi technical fidelity. &lt;em&gt;Not&lt;/em&gt; a Photoshop filter; closer to a Wang Ximeng (王希孟, c.1096–1119) &lt;em&gt;Thousand Li of Rivers and Mountains&lt;/em&gt; (《千里江山图》, Northern Song, 1113) palette discipline applied as overlay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 1 — naive bare API
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://api.openai.com/v1/images/edits &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$OPENAI_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"model=gpt-image-2"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"prompt=Add Chinese gongbi painterly elements to this Glasgow street photo"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"image=@source.png"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result (slide 1, left): the model produced a unified-wash painterly filter applied to the entire image. Red brick became flat color, building edges softened, photographic detail lost. The output reads as a &lt;em&gt;style transfer&lt;/em&gt;, not an &lt;em&gt;additive overlay&lt;/em&gt;. The lanterns are there, but so is everything else, all in the same painterly register.&lt;/p&gt;

&lt;p&gt;This is the failure mode you get when "gongbi" is interpreted globally rather than as a discipline applied to specific painted-in elements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 2 — Vulca-mediated structured prompt
&lt;/h2&gt;

&lt;p&gt;Same &lt;code&gt;gpt-image-2&lt;/code&gt;. Same source photo. Same intent. The difference is what the prompt looks like by the time it reaches OpenAI.&lt;/p&gt;

&lt;p&gt;Vulca's &lt;code&gt;compose_prompt_from_design()&lt;/code&gt; is a small, deliberately-boring helper: it reads a resolved &lt;code&gt;design.md&lt;/code&gt; artifact, parses the YAML frontmatter + the &lt;code&gt;## C. Prompt composition&lt;/code&gt; block, and &lt;strong&gt;concatenates three pieces&lt;/strong&gt; in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;C.base_prompt&lt;/code&gt; — the user-authored prose that names MUST-PRESERVE anchors (Gothic spire, red brick wall, woman in purple jacket, bus, traffic light, sky, distant pedestrians, right-side building silhouette) and ADD-as-gongbi elements (lanterns, calligraphy signage, muted-gold trim, 千里江山图 palette echoes), plus the &lt;code&gt;style_treatment: additive&lt;/code&gt; discipline clause.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C.tradition_tokens&lt;/code&gt; — terminology copied from the &lt;code&gt;chinese_gongbi&lt;/code&gt; cultural registry: &lt;em&gt;meticulous heavy-color painting (工笔重彩) · triple alum nine washes (三矾九染) · plain line drawing (白描) · outline and fill color (勾勒填彩) · boneless technique (没骨法) · court academy painting (院体画)&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C.color_constraint_tokens&lt;/code&gt; — cinnabar red (朱砂红) · muted gold (泥金) · stone blue (石青) · malachite green (石绿) — and an explicit &lt;em&gt;forbid: neon saturation, CNY plastic red, cartoon rainbow&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The "value" isn't a clever compiler; it's that &lt;strong&gt;the prose, the terminology, and the color discipline are all archived in &lt;code&gt;design.md&lt;/code&gt;&lt;/strong&gt; — version-controlled, reviewable, and reproducible from disk. A new agent two weeks later can call &lt;code&gt;compose_prompt_from_design("design.md")&lt;/code&gt; and get the same prompt string back. That replayability is what bare-API workflows lose.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;design.md&lt;/code&gt; also asks for &lt;code&gt;input_fidelity=high&lt;/code&gt; on the OpenAI call. &lt;strong&gt;What actually happened&lt;/strong&gt; is documented in &lt;code&gt;plan.md&lt;/code&gt; Notes &lt;code&gt;[param-drift]&lt;/code&gt;: &lt;code&gt;gpt-image-2&lt;/code&gt; GA shipped without &lt;code&gt;input_fidelity&lt;/code&gt; support and &lt;strong&gt;rejected the parameter outright&lt;/strong&gt;. The pre-&lt;code&gt;v0.17.12&lt;/code&gt; &lt;code&gt;openai_provider&lt;/code&gt; was sending it unconditionally and would have failed; we caught the drift mid-session, gated the param by per-model capability (issue #12, fixed in &lt;code&gt;v0.17.12&lt;/code&gt; shipped to PyPI an hour before this post), and re-ran iter 0 &lt;strong&gt;without&lt;/strong&gt; &lt;code&gt;input_fidelity&lt;/code&gt;. The image still succeeded — &lt;code&gt;style_treatment: additive&lt;/code&gt; plus the prompt-level "do NOT apply unified filter" clause carried preservation discipline through prose alone. The provenance trail is in &lt;code&gt;plan.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Result (slide 1, right): the lanterns are painted, the calligraphy signage is gongbi 白描 (plain line drawing) on cinnabar panels — but the brick wall, spire, woman in purple jacket, and bus are recognizably the source photograph. The painterly elements read as &lt;em&gt;intent&lt;/em&gt; (画意), not &lt;em&gt;filter&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyqpfa8iwft65yalfecao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyqpfa8iwft65yalfecao.png" alt="Glasgow street + Northern Song gongbi additive overlay — full Vulca-mediated result, openai/gpt-image-2 seed 7" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Decompose: 1 image → 10 editable semantic layers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mcp_vulca&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;layers_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iters/7/gen_bfbbacd2.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decompose/iter1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orchestrated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;photograph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:[...]}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipeline: YOLO + Grounding DINO + SAM + SegFormer face-parsing. Returns a manifest with per-entity status, alpha-sparse RGBA layers, and a &lt;code&gt;residual&lt;/code&gt; layer.&lt;/p&gt;

&lt;p&gt;iter1 entities:&lt;/p&gt;

&lt;p&gt;All numbers below come from &lt;code&gt;manifest.json&lt;/code&gt; &lt;code&gt;detection_report.per_entity[].pct_after_resolve&lt;/code&gt; (the deduplicated area share that resolves overlap to a single owning entity):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;pct_after_resolve&lt;/th&gt;
&lt;th&gt;sam_score&lt;/th&gt;
&lt;th&gt;Detector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;person&lt;/td&gt;
&lt;td&gt;5.65%&lt;/td&gt;
&lt;td&gt;1.01&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;woman&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lanterns&lt;/td&gt;
&lt;td&gt;8.05%&lt;/td&gt;
&lt;td&gt;0.61&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;row&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sign_top&lt;/td&gt;
&lt;td&gt;1.17%&lt;/td&gt;
&lt;td&gt;0.99&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;red panel calligraphy plaque&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sign_right&lt;/td&gt;
&lt;td&gt;0.47%&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;tall vertical golden plaque&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;spire&lt;/td&gt;
&lt;td&gt;2.08%&lt;/td&gt;
&lt;td&gt;0.96&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;gothic cathedral spire pointed roof&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bus&lt;/td&gt;
&lt;td&gt;3.59%&lt;/td&gt;
&lt;td&gt;0.98&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;blue double decker bus&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;left_buildings&lt;/td&gt;
&lt;td&gt;24.45%&lt;/td&gt;
&lt;td&gt;0.98&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;red brick row&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;right_buildings&lt;/td&gt;
&lt;td&gt;7.51%&lt;/td&gt;
&lt;td&gt;0.93&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;cathedral ornate stone facade&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sky&lt;/td&gt;
&lt;td&gt;15.50%&lt;/td&gt;
&lt;td&gt;0.98&lt;/td&gt;
&lt;td&gt;dino (&lt;code&gt;blue sky upper region&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;9 entities sum&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;68.47%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;residual (deduped)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31.53%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;leftover (shadows, awning, road, traffic light)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;success_rate: 1.0&lt;/code&gt;, no suspect detections, no missed entities. (Note: the &lt;code&gt;manifest.json&lt;/code&gt; also stores &lt;code&gt;layers[9].area_pct = 40.69%&lt;/code&gt; for the residual layer using &lt;em&gt;overlap-permissive&lt;/em&gt; accounting — both numbers are exposed; the deduped 31.53% is the canonical "share of canvas not assigned to any named entity".)&lt;/p&gt;

&lt;p&gt;The lanterns layer's &lt;code&gt;sam_score&lt;/code&gt; is conspicuously low (&lt;code&gt;0.61&lt;/code&gt; vs the others' 0.93–1.01). That's not a bug — it's the pipeline doing something honest: SAM was given &lt;strong&gt;one bbox&lt;/strong&gt; for "row of red paper lanterns" and asked to mask the whole row as a single object. With six dispersed lanterns + tassels + occluding awning ropes, SAM returns a fragmented streak rather than six clean silhouettes. Multi-instance entity detection (per-lantern bbox + NMS-multi-output) is a &lt;code&gt;v0.18&lt;/code&gt; backlog item; today's pipeline is &lt;code&gt;1 entity = 1 bbox = 1 mask&lt;/code&gt;. &lt;strong&gt;Slide 3's lanterns thumbnail looks "noisy" because that's the real shape of the mask.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fird3xjqk247zkx8fv3dp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fird3xjqk247zkx8fv3dp.png" alt="1 image → 10 editable semantic layers via YOLO + Grounding DINO + SAM + SegFormer" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A practical workflow note: DINO open-vocabulary detection has a "phrase contamination" failure mode where one entity's label tokens bleed into another entity's matched_phrase. If you ask for both "Chinese calligraphy sign on red panel" and "large Chinese calligraphy signage" in the same plan, DINO may union the bbox into a single region. The defense is: &lt;em&gt;give each entity a phrase-distinct label&lt;/em&gt;. We renamed &lt;code&gt;sign_top&lt;/code&gt; to "red panel calligraphy plaque" and &lt;code&gt;sign_right&lt;/code&gt; to "tall vertical golden plaque" — both detect cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Redraw: same layer, two paths
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mcp_vulca&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;layers_redraw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;artwork_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decompose/iter1/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lanterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;朱砂 cinnabar saturation +15%, 三矾九染 depth richer, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preserve lantern shapes and tassel positions exactly, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keep gongbi outline-and-fill discipline, no global filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tradition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chinese_gongbi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;layers_redraw&lt;/code&gt; sends the alpha-sparse layer through &lt;code&gt;gpt-image-2&lt;/code&gt;'s edit endpoint with the cultural tradition's prompt-composition layer applied. The slide-4-right artifact you see — a four-lantern + spire reinterpretation in full 工笔重彩 with deeper cinnabar saturation and gongbi-canonical line discipline (outline-and-fill, &lt;em&gt;勾勒填彩&lt;/em&gt;) — was authored via a fresh &lt;code&gt;generate_image&lt;/code&gt; call seeded by this same gongbi prompt scaffold, &lt;strong&gt;not&lt;/strong&gt; the literal &lt;code&gt;layers_redraw&lt;/code&gt; output above. The native &lt;code&gt;layers_redraw&lt;/code&gt; path on this row-of-six-lanterns alpha gives a stylistically incoherent result because cream-flat reference loses per-instance geometry (see &lt;a href="https://github.com/vulca-org/vulca/blob/master/docs/visual-specs/2026-04-23-scottish-chinese-fusion/decompose/v0_17_14_native/NOTES.md" rel="noopener noreferrer"&gt;&lt;code&gt;decompose/v0_17_14_native/NOTES.md&lt;/code&gt;&lt;/a&gt; for the v0.17.14 end-to-end MCP run and the v0.18 backlog item). The &lt;code&gt;layers_redraw&lt;/code&gt; verb still works as documented; the carousel's slide-4-right just exercises the related &lt;code&gt;generate_image&lt;/code&gt; path for visual coherence. The model approximates the &lt;em&gt;visual register&lt;/em&gt; of gongbi — but as the L1-L5 scorecard below makes explicit, &lt;strong&gt;single-pass diffusion can't simulate the multi-pass alum-wash physics&lt;/strong&gt; of true 三矾九染. The redraw looks gongbi-flavored; it isn't gongbi-correct. Both can be true.&lt;/p&gt;

&lt;p&gt;The agent now has two paths for the lanterns layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;alpha-isolated original (preservation, composite-friendly)&lt;/li&gt;
&lt;li&gt;gongbi-reinterpreted output (concept exploration, hero asset)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vulca exposes both paths via MCP. The choosing happens in the agent, not in a static pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11yaq4ats5mhnrpmdcuu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11yaq4ats5mhnrpmdcuu.png" alt="Same layer, two paths — alpha-isolated lantern silhouettes vs gpt-image-2 + gongbi prompt reinterpretation" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest scoring — &lt;code&gt;mode="rubric_only"&lt;/code&gt; + agent self-grade
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;v0.17.12&lt;/code&gt; shipped a new evaluate mode this morning. Important: it does &lt;strong&gt;not&lt;/strong&gt; score the image. It returns the rubric so the &lt;strong&gt;agent&lt;/strong&gt; can score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcp_vulca&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate_artwork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.../gen_bfbbacd2.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tradition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chinese_gongbi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rubric_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# result.score == None
# result.rubric == { weights, terminology, taboos, tradition_layers }
# result.score_schema == { L1: null, L2: null, ..., L5: null }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;rubric_only&lt;/code&gt; returns the rubric template (L1-L5 weights from &lt;code&gt;chinese_gongbi.yaml&lt;/code&gt;, six terminology entries, taboos, tradition_layers) and an empty &lt;code&gt;score_schema&lt;/code&gt;. &lt;strong&gt;No VLM call.&lt;/strong&gt; The agent — which already has vision — applies the rubric to the image and fills the scores itself. The split is deliberate: consumer agents already see the pixels; an extra VLM round-trip would be redundant cost. Vulca supplies the rubric; the agent self-grades.&lt;/p&gt;

&lt;p&gt;Our agent self-grade for the iter 0 image, recorded verbatim in &lt;code&gt;plan.md&lt;/code&gt; &lt;code&gt;## Results&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dim&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L1 Visual&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.78&lt;/td&gt;
&lt;td&gt;Gongbi additions read as deliberate; line discipline visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 Technical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.30&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.65 ✗&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;三矾九染 depth shallow; 石青/石绿 under-represented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3 Cultural&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;0.72&lt;/td&gt;
&lt;td&gt;千里江山图 palette intent honored; 朱砂/泥金 read true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L4 Critical&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;Additive treatment honored; photo anchors preserved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L5 Philosophical&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;Cross-cultural intent legible; literati-naming convention borrowed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weighted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.702&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;L2 hard-fails 0.70 because &lt;em&gt;triple-alum-nine-washes&lt;/em&gt; (三矾九染) is a &lt;strong&gt;multi-pass physical technique&lt;/strong&gt; — alum fixative applied between successive translucent washes to build depth and luminosity. A single forward pass through any diffusion model — &lt;code&gt;gpt-image-2&lt;/code&gt;, &lt;code&gt;stable-diffusion-xl&lt;/code&gt;, anything — cannot simulate alum-wash layering. &lt;strong&gt;This is a category-level ceiling, not a Vulca regression.&lt;/strong&gt; The model approximates the &lt;em&gt;visual register&lt;/em&gt; of depth; a trained gongbi reviewer will catch the absence of true alum-wash physics instantly.&lt;/p&gt;

&lt;p&gt;The strict rubric verdict was &lt;code&gt;reject&lt;/code&gt;. The maintainer (the human in the loop) decided to &lt;strong&gt;accept for showcase use anyway&lt;/strong&gt; — and &lt;code&gt;plan.md&lt;/code&gt; records BOTH judgments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;verdict: reject ✗ → user-override-accept&lt;/code&gt; (the table cell)&lt;/li&gt;
&lt;li&gt;the override reason in the Notes block: &lt;em&gt;"L2 hard-fail: 三矾九染 depth shallow; 石青/石绿 under-represented. User accepted for showcase use."&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the &lt;strong&gt;dual-judgment provenance&lt;/strong&gt; pattern. The strict rubric retains technical honesty; the human retains veto. Both are archived. A skeptic running the same pipeline gets the same rubric verdict; a different maintainer might decide to &lt;em&gt;not&lt;/em&gt; override. The artifact captures both.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;design.md&lt;/code&gt;'s &lt;code&gt;rollback_trigger&lt;/code&gt; is a separate concern: it fires only when &lt;em&gt;all 3 main seeds score L1&amp;lt;0.6 OR L3&amp;lt;0.6&lt;/em&gt;. Neither condition was met (L1=0.78, L3=0.72), so the L2 hard-fail surfaces as &lt;strong&gt;honest disclosure&lt;/strong&gt;, not as a rollback signal. Different gates for different purposes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzt6384mymk93424clvq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzt6384mymk93424clvq.png" alt="plan.md verdict trail — L1 0.78, L2 0.65 (hard-fail), L3 0.72, L4 0.75, L5 0.65, weighted 0.702 → strict reject → user-override-accept; both judgments archived" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The triad — three markdown files
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docs/visual-specs/2026-04-23-scottish-chinese-fusion/
├── proposal.md              ← /visual-brainstorm output (8K)
├── design.md                ← /visual-spec output (10K)
├── plan.md                  ← /visual-plan output (11K)
├── source.png               ← user-supplied Glasgow photo
├── iters/
│   ├── _baseline_bare/
│   │   └── bare_gpt2_edit.png       ← naive API control
│   └── 7/
│       └── gen_bfbbacd2.png         ← Vulca-mediated
├── decompose/
│   ├── lanterns_before.png          ← alpha-iso (slide 4 left)
│   ├── lanterns_after.png           ← gongbi reinterp (slide 4 right)
│   └── iter1/                       ← 9 entities + residual
└── carousel/                        ← this 6-slide deck
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftta5rh6o2dvrm5d8qemd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftta5rh6o2dvrm5d8qemd.png" alt="The whole project, in 3 markdown files: proposal.md / design.md / plan.md + a directory of artifacts. Pixels reproducible from markdown. The markdown is the product." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three markdown files lock the entire decision trail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;proposal.md&lt;/code&gt; — the user's intent in human terms. Style treatment, anchor list, budget, deadline.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;design.md&lt;/code&gt; — the technical translation. Provider, model, input_fidelity, prompt composition, L1-L5 weights &amp;amp; thresholds, spike plan, cost budget.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;plan.md&lt;/code&gt; — the execution flow. Phase order, batch size, evaluation gates, fail-fast rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each file is produced by a &lt;code&gt;/visual-*&lt;/code&gt; skill (brainstorm / spec / plan), each gated by a finalize handshake (&lt;code&gt;finalize&lt;/code&gt; / &lt;code&gt;done&lt;/code&gt; / &lt;code&gt;ready&lt;/code&gt; / &lt;code&gt;lock it&lt;/code&gt; / &lt;code&gt;approve&lt;/code&gt;). The agent doesn't free-form image-generate; it walks the triad.&lt;/p&gt;

&lt;p&gt;This is what "agent-mediated prompting" actually means in code. Not magic. A markdown contract.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"vulca[mcp]==0.17.14"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add to your Claude Code MCP config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"vulca"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vulca-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then in your Claude Code session, type &lt;code&gt;/visual-brainstorm&lt;/code&gt; to start a fresh visual project, or &lt;code&gt;/decompose &amp;lt;image&amp;gt;&lt;/code&gt; to break an existing image into editable layers. The full 22-tool MCP surface is documented in &lt;code&gt;docs/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/vulca-org/vulca" rel="noopener noreferrer"&gt;https://github.com/vulca-org/vulca&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing note
&lt;/h2&gt;

&lt;p&gt;The two images at the top of this post differ by three markdown files. They're not magic; they're version-controllable, reviewable, replayable contracts. If your AI-art workflow today is "type a prompt, hope, retry" — try the markdown-trio approach once. The first time you get the same image back from a fresh agent two weeks later because the markdown is still there, you'll see why.&lt;/p&gt;

&lt;p&gt;— shipped today as &lt;code&gt;vulca==0.17.14&lt;/code&gt;. The v0.17.14 patches make the&lt;br&gt;
&lt;code&gt;layers_redraw&lt;/code&gt; + &lt;code&gt;layers_paste_back&lt;/code&gt; &lt;em&gt;mechanism&lt;/em&gt; fully native: the&lt;br&gt;
&lt;code&gt;background_strategy="cream"&lt;/code&gt; flag stops the alpha-sparse hallucination&lt;br&gt;
that pre-v0.17.14 tainted redraw of sparse layers, &lt;code&gt;preserve_alpha=True&lt;/code&gt;&lt;br&gt;
re-applies the source layer's alpha, and &lt;code&gt;layers_paste_back&lt;/code&gt; is a new&lt;br&gt;
glue verb for compositing an edited layer back into a foreign source&lt;br&gt;
image. &lt;strong&gt;Visual parity&lt;/strong&gt; with the slide-4-right artifact specifically is&lt;br&gt;
a different goal: that artifact was authored via &lt;code&gt;generate_image&lt;/code&gt; with a&lt;br&gt;
gongbi text prompt, not via &lt;code&gt;layers_redraw&lt;/code&gt; on the lanterns layer alone.&lt;br&gt;
The v0.17.14 patch closes the &lt;em&gt;out-of-band Python&lt;/em&gt; gap for the canonical&lt;br&gt;
edit-and-paste-back flow; per-instance multi-lantern redraw with full&lt;br&gt;
visual parity remains a v0.18 backlog item. Reproducible MCP-only&lt;br&gt;
validation of the mechanism is archived in&lt;br&gt;
&lt;a href="https://github.com/vulca-org/vulca/blob/master/docs/visual-specs/2026-04-23-scottish-chinese-fusion/decompose/v0_17_14_native/NOTES.md" rel="noopener noreferrer"&gt;&lt;code&gt;decompose/v0_17_14_native/NOTES.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Built a Free Local AI Art Pipeline on My Mac — Here's What Broke</title>
      <dc:creator>yha9806</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:09:33 +0000</pubDate>
      <link>https://dev.to/yha9806/i-built-a-free-local-ai-art-pipeline-on-my-mac-heres-what-broke-3cip</link>
      <guid>https://dev.to/yha9806/i-built-a-free-local-ai-art-pipeline-on-my-mac-heres-what-broke-3cip</guid>
      <description>&lt;p&gt;What if you could run a complete AI art creation pipeline — 13 cultural traditions, 5-dimension scoring, structured layer generation — entirely on your MacBook, for free?&lt;/p&gt;

&lt;p&gt;No cloud API key. No GPU server. Just &lt;code&gt;pip install vulca&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkniad86le2jqhoz3z9as.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkniad86le2jqhoz3z9as.png" alt="Chinese Xieyi ink wash landscape" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9vcc3i7wi2ff9x8chph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9vcc3i7wi2ff9x8chph.png" alt="Japanese traditional snow temple" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ugmb70r1fiqy1buwx91.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ugmb70r1fiqy1buwx91.png" alt="Brand design tea packaging" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three traditions, one SDK — generated locally via ComfyUI/SDXL on Apple Silicon, zero cloud API cost.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These images were generated on an Apple Silicon Mac running ComfyUI locally. No Midjourney subscription. No Replicate credits. No DALL-E API calls. The evaluation scores below come from a VLM (Gemma 4 via Ollama) running on the same machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;vulca evaluate art.png &lt;span class="nt"&gt;-t&lt;/span&gt; chinese_xieyi &lt;span class="nt"&gt;--mode&lt;/span&gt; reference
&lt;span class="go"&gt;
  Score:     90%    Tradition: chinese_xieyi    Risk: low

    L1 Visual Perception         ██████████████████░░ 90%  ✓
    L2 Technical Execution       █████████████████░░░ 85%  ✓
    L3 Cultural Context          ██████████████████░░ 90%  ✓
    L4 Critical Interpretation   ████████████████████ 100%  ✓
    L5 Philosophical Aesthetics  ██████████████████░░ 90%  ✓
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This post is not a product announcement. It is a technical deep dive into what it took to build &lt;a href="https://github.com/vulca-org/vulca" rel="noopener noreferrer"&gt;VULCA&lt;/a&gt; — the bugs we hit, the architectural decisions we made, and the code that holds it together.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What is VULCA + The Local Stack
&lt;/h2&gt;

&lt;p&gt;VULCA is an AI-native cultural art creation SDK. It generates, evaluates, decomposes, and evolves visual art across 13 cultural traditions. It runs locally (ComfyUI + Ollama) or in the cloud (Gemini).&lt;/p&gt;

&lt;p&gt;Not another Midjourney wrapper or ComfyUI plugin — a standalone SDK for cultural art intelligence.&lt;/p&gt;

&lt;p&gt;The project started as academic research. The &lt;a href="https://aclanthology.org/2025.findings-emnlp/" rel="noopener noreferrer"&gt;VULCA Framework&lt;/a&gt; was published at EMNLP 2025 Findings, and &lt;a href="https://arxiv.org/abs/2601.07986" rel="noopener noreferrer"&gt;VULCA-Bench&lt;/a&gt; provides 7,410 annotated samples with L1-L5 cultural scoring definitions. The SDK implements this research as a production tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│                  vulca CLI                   │
├─────────────┬──────────┬────────────────────┤
│   create    │ evaluate │  layers / studio   │
├─────────────┴──────────┴────────────────────┤
│              Cultural Engine                 │
│   13 traditions × L1-L5 scoring rubrics     │
├──────────────────┬──────────────────────────┤
│  Image Providers │      VLM Providers       │
│  ┌────────────┐  │  ┌────────────────────┐  │
│  │  ComfyUI   │  │  │  Ollama (Gemma 4)  │  │
│  │  (local)   │  │  │  (local)           │  │
│  ├────────────┤  │  ├────────────────────┤  │
│  │  Gemini    │  │  │  Gemini            │  │
│  │  (cloud)   │  │  │  (cloud)           │  │
│  └────────────┘  │  └────────────────────┘  │
└──────────────────┴──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quickstart
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;vulca

&lt;span class="c"&gt;# Point at your local ComfyUI + Ollama&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VULCA_IMAGE_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8188
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VULCA_VLM_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama_chat/gemma4

&lt;span class="c"&gt;# Generate&lt;/span&gt;
vulca create &lt;span class="s2"&gt;"Misty mountains after spring rain"&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; chinese_xieyi &lt;span class="nt"&gt;--provider&lt;/span&gt; comfyui &lt;span class="nt"&gt;-o&lt;/span&gt; art.png

&lt;span class="c"&gt;# Evaluate&lt;/span&gt;
vulca evaluate art.png &lt;span class="nt"&gt;-t&lt;/span&gt; chinese_xieyi &lt;span class="nt"&gt;--mode&lt;/span&gt; reference
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Provider Architecture: Pluggable, Not Locked In
&lt;/h3&gt;

&lt;p&gt;VULCA does not depend on any single backend. Image providers are pluggable classes. ComfyUI is one provider. Gemini is another. You can add your own.&lt;/p&gt;

&lt;p&gt;The key design insight: providers declare their capabilities as a frozen set. VULCA uses these capabilities to decide how to format prompts, whether to pass CJK text directly, and whether RGBA output is available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ComfyUI: CLIP-based encoder, English-only, returns raw RGBA
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ComfyUIImageProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;capabilities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_rgba&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Gemini: LLM-based encoder, understands CJK natively, returns raw RGBA
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GeminiImageProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;capabilities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_rgba&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multilingual_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;multilingual_prompt&lt;/code&gt; capability is the difference between a 120-token structured prompt (Gemini can handle it) and a compressed 60-token flat prompt (CLIP will truncate anything beyond 77 tokens). More on this in section 5.&lt;/p&gt;

&lt;p&gt;When you ask ComfyUI to generate an image, VULCA constructs a complete ComfyUI workflow as a JSON dict and submits it via the REST API. No ComfyUI nodes to install. No custom workflows to import. The entire workflow is built programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KSampler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randbelow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cfg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;7.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sampler_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;euler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;denoise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;negative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latent_image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]}},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CheckpointLoaderSimple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ckpt_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sd_xl_base_1.0.safetensors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EmptyLatentImage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;height&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLIPTextEncode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;full_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLIPTextEncode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;negative_prompt&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VAEDecode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vae&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]}},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SaveImage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filename_prefix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vulca&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]}},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is from &lt;a href="https://github.com/vulca-org/vulca/blob/master/src/vulca/providers/comfyui.py#L42" rel="noopener noreferrer"&gt;&lt;code&gt;src/vulca/providers/comfyui.py&lt;/code&gt; lines 42-62&lt;/a&gt;. It constructs a standard SDXL pipeline: checkpoint loader, empty latent, two CLIP text encoders (positive + negative), KSampler, VAE decode, save. The workflow is submitted as a single POST to &lt;code&gt;/prompt&lt;/code&gt;, and VULCA polls &lt;code&gt;/history/{prompt_id}&lt;/code&gt; until the image is ready.&lt;/p&gt;

&lt;p&gt;After the image comes back, VULCA validates it is actually a valid PNG before accepting it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;raw_bytes&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\x89&lt;/span&gt;&lt;span class="s"&gt;PNG&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ComfyUI returned invalid image &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; bytes, header=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That validation was added in commit &lt;a href="https://github.com/vulca-org/vulca/commit/fdc0e45" rel="noopener noreferrer"&gt;&lt;code&gt;fdc0e45&lt;/code&gt;&lt;/a&gt; after we discovered that certain PyTorch MPS bugs cause ComfyUI to return 4KB files with valid PNG headers but all-zero pixel data.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. L1-L5 Cultural Evaluation
&lt;/h2&gt;

&lt;p&gt;Most AI art tools generate. VULCA evaluates.&lt;/p&gt;

&lt;p&gt;The evaluation framework scores artwork across five dimensions, each measuring a different aspect of cultural and artistic quality:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;L1&lt;/strong&gt; Visual Perception&lt;/td&gt;
&lt;td&gt;Composition, color harmony, spatial arrangement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;L2&lt;/strong&gt; Technical Execution&lt;/td&gt;
&lt;td&gt;Rendering quality, technique fidelity, craftsmanship&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;L3&lt;/strong&gt; Cultural Context&lt;/td&gt;
&lt;td&gt;Tradition-specific motifs, canonical conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;L4&lt;/strong&gt; Critical Interpretation&lt;/td&gt;
&lt;td&gt;Cultural sensitivity, contextual framing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;L5&lt;/strong&gt; Philosophical Aesthetics&lt;/td&gt;
&lt;td&gt;Artistic depth, emotional resonance, spiritual qualities&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are not arbitrary categories. They come from the &lt;a href="https://arxiv.org/abs/2601.07986" rel="noopener noreferrer"&gt;VULCA-Bench paper&lt;/a&gt;, which defines L1-L5 across 7,410 annotated samples.&lt;/p&gt;

&lt;h3&gt;
  
  
  13 Traditions, Custom Weights
&lt;/h3&gt;

&lt;p&gt;Each tradition is defined as a YAML file with its own L1-L5 weight distribution. Chinese freehand ink painting (xieyi) weights philosophical aesthetics (L5) at 30% and cultural context (L3) at 25%, because the tradition values spiritual resonance and canonical motifs above raw technical rendering. A brand design tradition would weight L2 (technical execution) much higher.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# src/vulca/cultural/data/traditions/chinese_xieyi.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;chinese_xieyi&lt;/span&gt;
&lt;span class="na"&gt;display_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;en&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chinese&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Freehand&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ink&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(Xieyi)"&lt;/span&gt;
  &lt;span class="na"&gt;zh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;中国写意"&lt;/span&gt;

&lt;span class="na"&gt;weights&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;L1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.10&lt;/span&gt;
  &lt;span class="na"&gt;L2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.15&lt;/span&gt;
  &lt;span class="na"&gt;L3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.25&lt;/span&gt;
  &lt;span class="na"&gt;L4&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.20&lt;/span&gt;
  &lt;span class="na"&gt;L5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.30&lt;/span&gt;

&lt;span class="na"&gt;terminology&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;term&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spirit resonance and vitality&lt;/span&gt;
    &lt;span class="na"&gt;term_zh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;气韵生动"&lt;/span&gt;
    &lt;span class="na"&gt;definition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;en&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;first&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Xie&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;He's&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Six&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Principles&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Chinese&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Painting..."&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aesthetics&lt;/span&gt;
    &lt;span class="na"&gt;l_levels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;L4&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;L5&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 13 supported traditions are: &lt;code&gt;chinese_xieyi&lt;/code&gt;, &lt;code&gt;chinese_gongbi&lt;/code&gt;, &lt;code&gt;japanese_traditional&lt;/code&gt;, &lt;code&gt;western_academic&lt;/code&gt;, &lt;code&gt;islamic_geometric&lt;/code&gt;, &lt;code&gt;watercolor&lt;/code&gt;, &lt;code&gt;african_traditional&lt;/code&gt;, &lt;code&gt;south_asian&lt;/code&gt;, &lt;code&gt;brand_design&lt;/code&gt;, &lt;code&gt;photography&lt;/code&gt;, &lt;code&gt;contemporary_art&lt;/code&gt;, &lt;code&gt;ui_ux_design&lt;/code&gt;, and &lt;code&gt;default&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Evaluation Modes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strict&lt;/strong&gt; (judge): Conformance scoring. How well does the art meet the tradition's standards?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference&lt;/strong&gt; (mentor): Cultural guidance with professional terminology. Not a judge, a mentor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fusion&lt;/strong&gt;: Multi-tradition comparison. Pass comma-separated traditions and get cross-cultural analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The API: Three Lines to Score Any Image
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;vulca&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;vulca&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aevaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;artwork.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tradition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chinese_xieyi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full &lt;code&gt;aevaluate()&lt;/code&gt; signature from &lt;a href="https://github.com/vulca-org/vulca/blob/master/src/vulca/evaluate.py#L12" rel="noopener noreferrer"&gt;&lt;code&gt;src/vulca/evaluate.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;aevaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tradition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sparse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;EvalResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;sparse&lt;/code&gt; parameter is worth calling out. When &lt;code&gt;sparse=True&lt;/code&gt;, VULCA runs a &lt;code&gt;BriefIndexer&lt;/code&gt; that determines which L1-L5 dimensions are most relevant to the given intent. All five dimensions are still scored (consistency matters), but the &lt;code&gt;sparse_activation&lt;/code&gt; metadata tells callers which dimensions were most salient. This is useful in pipeline mode where you want to focus review on the dimensions that matter for a specific prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Deep Dive: Structured Layer Generation
&lt;/h2&gt;

&lt;p&gt;VULCA does not generate images. It generates layers.&lt;/p&gt;

&lt;p&gt;The pipeline works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Intent parsing&lt;/strong&gt; — user prompt is analyzed for tradition, subject, and composition intent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VLM planning&lt;/strong&gt; — Gemma 4 (via Ollama) decomposes the prompt into a layer plan: background, mid-ground elements, foreground, calligraphy/text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-layer generation&lt;/strong&gt; — each layer is generated as a separate image with transparent background&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Luminance keying&lt;/strong&gt; — non-background layers are keyed to remove canvas color, producing clean alpha&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alpha composite&lt;/strong&gt; — layers are composited in order to produce the final artwork&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5qkfb9y14nkg7l36whh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5qkfb9y14nkg7l36whh1.png" alt="Layered exploded view" width="800" height="164"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Layer decomposition: paper, distant mountains, forest, calligraphy, composite&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Serial-First Style Anchoring
&lt;/h3&gt;

&lt;p&gt;The first layer generates serially as a style anchor. Its raw RGB output becomes the visual reference (&lt;code&gt;style_ref&lt;/code&gt;) for all subsequent layers, which generate in parallel. This is Defense 3 from v0.14 — without it, each layer would independently interpret "Chinese xieyi style" and you would get five different visual interpretations in the same artwork.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Prompt Builder
&lt;/h3&gt;

&lt;p&gt;The core of layer generation is &lt;code&gt;build_anchored_layer_prompt()&lt;/code&gt; in &lt;a href="https://github.com/vulca-org/vulca/blob/master/src/vulca/layers/layered_prompt.py#L47" rel="noopener noreferrer"&gt;&lt;code&gt;src/vulca/layers/layered_prompt.py&lt;/code&gt;&lt;/a&gt;. This function wraps the plan's regeneration prompt in four mandatory anchor blocks: canvas, content (with negative list), spatial, style.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frozen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LayerPromptResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Prompt + negative prompt pair for a layer.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;negative_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_anchored_layer_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LayerInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TraditionAnchor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sibling_roles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;english_only&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;LayerPromptResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function has two code paths, controlled by &lt;code&gt;english_only&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When &lt;code&gt;english_only=False&lt;/code&gt;&lt;/strong&gt; (Gemini path): Returns a structured multi-section string with &lt;code&gt;[CANVAS]&lt;/code&gt;, &lt;code&gt;[CONTENT]&lt;/code&gt;, &lt;code&gt;[SPATIAL]&lt;/code&gt;, &lt;code&gt;[STYLE]&lt;/code&gt;, and &lt;code&gt;[USER INTENT]&lt;/code&gt; blocks. Gemini's LLM-based encoder can parse these sections and follow the instructions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[CANVAS]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The image MUST be drawn on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;canvas_description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The background MUST be the pure canvas color &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canvas_color_hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with absolutely no other elements, textures, shading, or borders.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[CONTENT — exclusivity]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This image ONLY contains the element specified in USER INTENT.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do NOT include any of: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;others_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[SPATIAL]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MUST occupy &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, covering approximately &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cov&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of the canvas area.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[STYLE]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;style_keywords&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[USER INTENT]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When &lt;code&gt;english_only=True&lt;/code&gt;&lt;/strong&gt; (ComfyUI/SDXL path): Returns a &lt;code&gt;LayerPromptResult&lt;/code&gt; with a flat, CLIP-friendly prompt under 70 tokens and a separate &lt;code&gt;negative_prompt&lt;/code&gt;. This is the path that took the most engineering to get right. More on why in section 5.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;english_only&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;style_keywords&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;canvas_description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;pos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;negative&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;others&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;others&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LayerPromptResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;negative_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;negative&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CJK-Aware Prompt Handling
&lt;/h3&gt;

&lt;p&gt;VULCA accepts prompts in Chinese, Japanese, and Korean. When the target provider has the &lt;code&gt;multilingual_prompt&lt;/code&gt; capability (Gemini), CJK text passes through natively. When the provider does not have that capability (ComfyUI/SDXL with CLIP), VULCA strips CJK characters and falls back to English equivalents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_CJK_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_strip_cjk_parenthetical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Strip CJK parenthetical annotations, e.g. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cooked silk (熟绢)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cooked silk&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_CJK_PAREN_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So &lt;code&gt;vulca create "水墨山水" -t chinese_xieyi --provider comfyui&lt;/code&gt; works — VULCA translates the prompt for CLIP behind the scenes.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Deep Dive: Making SDXL Work Locally
&lt;/h2&gt;

&lt;p&gt;This is where things got interesting. Two traps nearly derailed the local ComfyUI path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 1: The ANCHOR Hallucination
&lt;/h3&gt;

&lt;p&gt;Our structured layer prompts originally used section headers like &lt;code&gt;[CANVAS ANCHOR]&lt;/code&gt;, &lt;code&gt;[STYLE ANCHOR]&lt;/code&gt;, and &lt;code&gt;[CONTENT ANCHOR]&lt;/code&gt;. The word "ANCHOR" was there to signal to the LLM that these were fixed constraints, not suggestions.&lt;/p&gt;

&lt;p&gt;SDXL's CLIP encoder is not an LLM. It is a text encoder that treats every token as content. When it saw "ANCHOR", it interpreted it as a request to paint an anchor — the nautical kind.&lt;/p&gt;

&lt;p&gt;The result: literal ship anchors appearing on rice paper backgrounds in Chinese ink wash paintings. Misty mountains with a ship anchor in the corner. Bamboo forests with an anchor hovering over them.&lt;/p&gt;

&lt;p&gt;The fix was trivial once diagnosed. Rename the headers to &lt;code&gt;[CANVAS]&lt;/code&gt;, &lt;code&gt;[STYLE]&lt;/code&gt;, &lt;code&gt;[CONTENT]&lt;/code&gt;, &lt;code&gt;[SPATIAL]&lt;/code&gt;. No word that could be interpreted as visual content.&lt;/p&gt;

&lt;p&gt;Commit: &lt;a href="https://github.com/vulca-org/vulca/commit/b168178" rel="noopener noreferrer"&gt;&lt;code&gt;b168178&lt;/code&gt;&lt;/a&gt; — &lt;code&gt;fix(layers): remove ANCHOR from prompt headers — SDXL paints literal anchors&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The lesson: CLIP-based models do not have a concept of "metadata" or "instructions" in a prompt. Every token is content. If your prompt engineering uses structured headers, every header word will influence the generated image.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 1b: The 77-Token CLIP Ceiling
&lt;/h3&gt;

&lt;p&gt;Fixing the anchor hallucination revealed a second, subtler problem. Our structured prompt — even without "ANCHOR" — was 120+ tokens. CLIP truncates at 77 tokens. The actual subject description ("misty mountains after spring rain") was buried past the 77-token boundary and never reached the encoder.&lt;/p&gt;

&lt;p&gt;Gallery images (simple prompts, ~30 tokens) worked perfectly. Layered generation (structured prompts, 120+ tokens) produced generic, unfocused results. The debugging was confusing because the same code path worked for simple creates but failed for layered creates.&lt;/p&gt;

&lt;p&gt;The fix: the &lt;code&gt;english_only&lt;/code&gt; branch in &lt;code&gt;build_anchored_layer_prompt()&lt;/code&gt;. Instead of a structured multi-section prompt, VULCA builds a flat, subject-first prompt under 70 tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;misty mountains after spring rain, traditional brushwork, ink wash, on aged xuan paper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a separate &lt;code&gt;negative_prompt&lt;/code&gt; field (other layer roles to avoid). The subject comes first so it is guaranteed to be within CLIP's 77-token window.&lt;/p&gt;

&lt;p&gt;Commit: &lt;a href="https://github.com/vulca-org/vulca/commit/74f9952" rel="noopener noreferrer"&gt;&lt;code&gt;74f9952&lt;/code&gt;&lt;/a&gt; — &lt;code&gt;fix(layers): CLIP-aware prompt compression for SDXL — flat &amp;lt;70 token prompt&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;LayerPromptResult&lt;/code&gt; dataclass was added specifically for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frozen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LayerPromptResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Prompt + negative prompt pair for a layer.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;negative_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structured string (Gemini path) returns a single &lt;code&gt;str&lt;/code&gt;. The CLIP path returns a &lt;code&gt;LayerPromptResult&lt;/code&gt; with both positive and negative prompts separated. The caller checks &lt;code&gt;isinstance(result, LayerPromptResult)&lt;/code&gt; to decide which ComfyUI workflow nodes to populate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 2: PyTorch MPS — A Version Minefield
&lt;/h3&gt;

&lt;p&gt;With prompt engineering fixed, we hit the hardware layer. SDXL generation via ComfyUI on Apple Silicon (MPS backend) with PyTorch 2.11.0 produces black (all-zero, ~4KB) or noise (~2MB random pixels) images.&lt;/p&gt;

&lt;p&gt;Key observations that made this hard to diagnose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KSampler diffusion runs to completion — 20 steps, progress bars, no errors&lt;/li&gt;
&lt;li&gt;VAEDecode output is corrupt despite successful sampling&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--force-fp32&lt;/code&gt; does NOT fix it — this is a correctness bug, not a precision issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three compounding PyTorch MPS bugs cause the failure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1: SDPA Non-Contiguous Tensor Regression&lt;/strong&gt; (&lt;a href="https://github.com/pytorch/pytorch/issues/163597" rel="noopener noreferrer"&gt;pytorch/pytorch#163597&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Introduced in PyTorch 2.8.0. MPS SDPA kernels produce wildly incorrect results when given non-contiguous tensors. SDXL's cross-attention performs transpose operations that create non-contiguous views, feeding garbage embeddings into the U-Net. Error magnitude: ~34.0 vs normal ~0.000006.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2: Conv2d Chunk Correctness Bug&lt;/strong&gt; (&lt;a href="https://github.com/pytorch/pytorch/issues/169342" rel="noopener noreferrer"&gt;pytorch/pytorch#169342&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Affects PyTorch 2.9.0+. The &lt;code&gt;chunk() -&amp;gt; conv()&lt;/code&gt; pattern produces correct results only for the first batch element. Single-image generation (batch=1) is unaffected. Multi-image batch workflows will hit it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 3: Metal Kernel Migration Regressions&lt;/strong&gt; (&lt;a href="https://github.com/pytorch/pytorch/issues/155797" rel="noopener noreferrer"&gt;pytorch/pytorch#155797&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;PyTorch 2.10-2.11 introduced additional MPS regressions during internal operator migrations. Identical symptoms reported on M3 Ultra via &lt;a href="https://github.com/Comfy-Org/ComfyUI/issues/10681" rel="noopener noreferrer"&gt;ComfyUI#10681&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why VAEDecode Is the Failure Point
&lt;/h3&gt;

&lt;p&gt;The VAE decoder is uniquely vulnerable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses Conv2d with large channel counts (hit by Bug 2)&lt;/li&gt;
&lt;li&gt;Uses GroupNorm with float16 inputs (NaN propagation)&lt;/li&gt;
&lt;li&gt;Single-pass decoder with no self-correction like iterative KSampler&lt;/li&gt;
&lt;li&gt;Intermediate values explode to 9.5e+25, GroupNorm cannot recover, output is all-zero or random&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Version Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PyTorch Version&lt;/th&gt;
&lt;th&gt;SDXL on MPS&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2.4.1&lt;/td&gt;
&lt;td&gt;Working&lt;/td&gt;
&lt;td&gt;Last fully validated version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.5.x&lt;/td&gt;
&lt;td&gt;Degraded&lt;/td&gt;
&lt;td&gt;Memory +50%, speed -60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.6.x&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Some SDPA issues, &lt;code&gt;--force-fp32&lt;/code&gt; can help&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.7.x&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Similar to 2.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.8.0&lt;/td&gt;
&lt;td&gt;Broken&lt;/td&gt;
&lt;td&gt;SDPA non-contiguous bug introduced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2.9.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Working&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Sweet spot&lt;/strong&gt;: pre-Metal migration, SDPA bug masked by ComfyUI's attention slicing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.10.0&lt;/td&gt;
&lt;td&gt;Broken&lt;/td&gt;
&lt;td&gt;Black images on M3 Ultra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.11.0&lt;/td&gt;
&lt;td&gt;Broken&lt;/td&gt;
&lt;td&gt;Black/noise on Apple Silicon&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In ComfyUI venv&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/dev/ComfyUI
./venv/bin/pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.9.0 &lt;span class="nv"&gt;torchvision&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.24.0 &lt;span class="nv"&gt;torchaudio&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.9.0
./venv/bin/python main.py &lt;span class="nt"&gt;--listen&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8188
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pin &lt;code&gt;torch==2.9.0&lt;/code&gt;. That is the entire fix. We wrote a &lt;a href="https://github.com/vulca-org/vulca/blob/master/docs/apple-silicon-mps-comfyui-guide.md" rel="noopener noreferrer"&gt;complete Apple Silicon MPS + ComfyUI/SDXL Compatibility Guide&lt;/a&gt; that covers diagnosis, workarounds (CPU VAE, force-fp32), environment variables, and verification steps.&lt;/p&gt;

&lt;p&gt;The guide is at &lt;a href="https://github.com/vulca-org/vulca/blob/master/docs/apple-silicon-mps-comfyui-guide.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/apple-silicon-mps-comfyui-guide.md&lt;/code&gt;&lt;/a&gt; in the repo.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Inpainting and Layer Editing
&lt;/h2&gt;

&lt;p&gt;Once you have layers, you can edit them individually without regenerating the entire artwork.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnuajrw0821aqdmdc6bf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnuajrw0821aqdmdc6bf.png" alt="Inpaint comparison" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Redraw a specific layer with a new instruction&lt;/span&gt;
vulca layers redraw ./layers/ &lt;span class="nt"&gt;--layer&lt;/span&gt; sky &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warm golden sunset"&lt;/span&gt;

&lt;span class="c"&gt;# Region-based inpaint on the composite&lt;/span&gt;
vulca inpaint art.png &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="s2"&gt;"the sky"&lt;/span&gt; &lt;span class="nt"&gt;--instruction&lt;/span&gt; &lt;span class="s2"&gt;"stormy clouds"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The inpaint path uses the same provider architecture. ComfyUI receives an inpaint workflow with a mask, Gemini receives the image + mask + instruction as a multipart prompt. The same &lt;code&gt;capabilities&lt;/code&gt; system determines prompt formatting.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. What's Working, What's Next
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Current State (v0.15.1)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;All 13 traditions generating locally on Apple Silicon via ComfyUI + SDXL&lt;/li&gt;
&lt;li&gt;Full E2E pipeline validated: intent parsing, VLM planning, per-layer generation, keying, composite&lt;/li&gt;
&lt;li&gt;8 E2E phase tests passing in 2.4 seconds (mock mode)&lt;/li&gt;
&lt;li&gt;CJK prompts working end-to-end with automatic CLIP compression&lt;/li&gt;
&lt;li&gt;PNG response validation catches corrupt MPS output&lt;/li&gt;
&lt;li&gt;Structured layer generation with serial-first style anchoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Commit Trail
&lt;/h3&gt;

&lt;p&gt;The local provider path was stabilized across these commits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/vulca-org/vulca/commit/b168178" rel="noopener noreferrer"&gt;&lt;code&gt;b168178&lt;/code&gt;&lt;/a&gt; — remove ANCHOR from prompt headers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/vulca-org/vulca/commit/42e0e3d" rel="noopener noreferrer"&gt;&lt;code&gt;42e0e3d&lt;/code&gt;&lt;/a&gt; — skip keying for background layers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/vulca-org/vulca/commit/fdc0e45" rel="noopener noreferrer"&gt;&lt;code&gt;fdc0e45&lt;/code&gt;&lt;/a&gt; — validate ComfyUI PNG response&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/vulca-org/vulca/commit/74f9952" rel="noopener noreferrer"&gt;&lt;code&gt;74f9952&lt;/code&gt;&lt;/a&gt; — CLIP-aware prompt compression&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/vulca-org/vulca/commit/e840496" rel="noopener noreferrer"&gt;&lt;code&gt;e840496&lt;/code&gt;&lt;/a&gt; — MPS compatibility guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/vulca-org/vulca/commit/485067e" rel="noopener noreferrer"&gt;&lt;code&gt;485067e&lt;/code&gt;&lt;/a&gt; — v0.15.1 release&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Roadmap
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini cloud path&lt;/strong&gt;: Currently blocked on free-tier billing limits (image generation returns &lt;code&gt;limit: 0&lt;/code&gt;). Text + VLM vision work. Once billing is enabled, Gemini becomes the zero-setup cloud alternative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAM3 text-prompted segmentation&lt;/strong&gt;: Replace luminance keying with SAM3 for cleaner layer extraction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web UI / Gradio demo&lt;/strong&gt;: A browser-based interface for non-CLI users.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Get Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5-Minute Local Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install VULCA&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;vulca

&lt;span class="c"&gt;# 2. Install ComfyUI (if you don't have it)&lt;/span&gt;
git clone https://github.com/comfyanonymous/ComfyUI
&lt;span class="nb"&gt;cd &lt;/span&gt;ComfyUI
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
./venv/bin/pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="c"&gt;# CRITICAL: pin PyTorch for Apple Silicon&lt;/span&gt;
./venv/bin/pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.9.0 &lt;span class="nv"&gt;torchvision&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.24.0 &lt;span class="nv"&gt;torchaudio&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.9.0

&lt;span class="c"&gt;# 3. Download SDXL checkpoint&lt;/span&gt;
&lt;span class="c"&gt;# Place sd_xl_base_1.0.safetensors in ComfyUI/models/checkpoints/&lt;/span&gt;

&lt;span class="c"&gt;# 4. Start ComfyUI&lt;/span&gt;
./venv/bin/python main.py &lt;span class="nt"&gt;--listen&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8188

&lt;span class="c"&gt;# 5. Install Ollama + Gemma 4 (for VLM evaluation)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
ollama pull gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 6. Generate and evaluate&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VULCA_IMAGE_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8188
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VULCA_VLM_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama_chat/gemma4

vulca create &lt;span class="s2"&gt;"Misty mountains after spring rain"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; chinese_xieyi &lt;span class="nt"&gt;--provider&lt;/span&gt; comfyui &lt;span class="nt"&gt;-o&lt;/span&gt; art.png

vulca evaluate art.png &lt;span class="nt"&gt;-t&lt;/span&gt; chinese_xieyi &lt;span class="nt"&gt;--mode&lt;/span&gt; reference
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python API
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;vulca&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate any image
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;vulca&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aevaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;artwork.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tradition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chinese_xieyi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Access individual dimension scores
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;L1 Visual: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;L5 Philosophy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l5&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Overall: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What VULCA Is
&lt;/h3&gt;

&lt;p&gt;VULCA is an open-source SDK for AI-native cultural art creation. It brings cultural intelligence to AI art generation. 13 traditions, each with its own L1-L5 scoring rubric, terminology, and taboos.&lt;/p&gt;

&lt;p&gt;It is built on peer-reviewed research (EMNLP 2025 Findings), tested against 7,410 annotated samples (VULCA-Bench), and runs entirely on your local machine if you want it to.&lt;/p&gt;

&lt;h3&gt;
  
  
  What VULCA Is Not
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Not a ComfyUI plugin. ComfyUI is one of several image providers.&lt;/li&gt;
&lt;li&gt;Not a Midjourney alternative. VULCA does not host image generation — it orchestrates it.&lt;/li&gt;
&lt;li&gt;Not a wrapper around any single model. Swap ComfyUI for Gemini (or your own provider) with one config change.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Links
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/vulca-org/vulca" rel="noopener noreferrer"&gt;https://github.com/vulca-org/vulca&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;a href="https://pypi.org/project/vulca/" rel="noopener noreferrer"&gt;https://pypi.org/project/vulca/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MPS Guide&lt;/strong&gt;: &lt;a href="https://github.com/vulca-org/vulca/blob/master/docs/apple-silicon-mps-comfyui-guide.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/apple-silicon-mps-comfyui-guide.md&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research&lt;/strong&gt;: &lt;a href="https://aclanthology.org/2025.findings-emnlp/" rel="noopener noreferrer"&gt;VULCA Framework (EMNLP 2025 Findings)&lt;/a&gt; | &lt;a href="https://arxiv.org/abs/2601.07986" rel="noopener noreferrer"&gt;VULCA-Bench (arXiv)&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/vulca/" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fpypi%2Fv%2Fvulca.svg" alt="PyPI" width="86" height="20"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://pypi.org/project/vulca/" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2Fpython-3.10%2B-blue.svg" alt="Python 3.10+" width="92" height="20"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/vulca-org/vulca/blob/master/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimg.shields.io%2Fbadge%2Flicense-Apache%25202.0-green.svg" alt="License: Apache 2.0" width="120" height="20"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this resonates, &lt;a href="https://github.com/vulca-org/vulca" rel="noopener noreferrer"&gt;star us on GitHub&lt;/a&gt;. Try it, break it, tell us what failed — &lt;a href="https://github.com/vulca-org/vulca/issues" rel="noopener noreferrer"&gt;issues welcome&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you use VULCA in research, please cite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight bibtex"&gt;&lt;code&gt;&lt;span class="nc"&gt;@inproceedings&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;vulca2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;{VULCA: A Framework for Cultural Art Evaluation}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;booktitle&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;{Findings of the Association for Computational Linguistics: EMNLP 2025}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;year&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;{2025}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj032vjdb2m6n19gmevmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj032vjdb2m6n19gmevmc.png" alt="Tradition grid" width="800" height="532"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;13 traditions. One SDK. Your machine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
