DEV Community

Cover image for Building a Multimodal AI Pipeline: Text Image Text Across Three Providers
YAIT
YAIT

Posted on

Building a Multimodal AI Pipeline: Text Image Text Across Three Providers

Three providers, three modalities, under 55 lines of Python — and a PNG file on disk at the end. Claude writes a sunset description, an image generation model paints it, and Qwen Vision analyzes the result. Each model does one thing well; the script wires them together.

This article walks through building exactly that pipeline using yait_aichain's Skill and Model primitives. We'll go step by step: generate text with Claude, turn that text into an image, then feed the image to Qwen Vision for analysis.

What We're Building

The pipeline has three stages:

  1. Text → Text (Claude claude-3-5-sonnet-20241022): Generate a one-sentence description of a sunset.
  2. Text → Image (imagine-image-pro): Turn that description into a 1024×1024 image.
  3. Image → Text (Qwen qwen-vl-max): Feed the generated image to a vision model and ask what it sees.

Each stage uses a different provider — Anthropic, xAI, and DashScope. The output of one stage becomes the input of the next.

Prerequisites

You need three API keys, each set as an environment variable:

export ANTHROPIC_API_KEY="your-anthropic-key"
export XAI_API_KEY="your-xai-key"
export DASHSCOPE_API_KEY="your-dashscope-key"
Enter fullscreen mode Exit fullscreen mode

Install the library:

pip install yait_aichain
Enter fullscreen mode Exit fullscreen mode

No extra dependencies for image handling — Python's base64 and pathlib modules cover the file I/O. yait_aichain handles provider routing internally, so you won't need to install Anthropic, xAI, or DashScope SDKs separately.

The Two Primitives You Need to Know

Model represents a connection to a specific model at a specific provider. You pass the model name and an API key — no provider-specific client classes, no adapter patterns to memorize.

Skill is a single unit of work. It takes a Model, an input (structured as messages), and optionally an output configuration. Call .run() and it executes. The message format uses a parts list inside each message, which is how yait_aichain handles multimodal content uniformly — text, images, and mixed content all go through the same structure.

Stage 1: Generating Text with Claude

import os, sys, base64, pathlib
from yait_aichain import Model, Skill

text_skill = Skill(
    model = Model("claude-3-5-sonnet-20241022", api_key=os.environ["ANTHROPIC_API_KEY"]),
    input = {"messages": [{"role": "user", "parts": ["Describe a sunset in one sentence."]}]},
)

description = text_skill.run()
print(f"[text → text · Claude]\n{description}\n")
Enter fullscreen mode Exit fullscreen mode

The input dictionary contains a messages list — identical in shape to what you'd see in a chat API. Each message has a role and a parts list. For plain text, parts is just a list of strings.

Notice the use of os.environ["KEY"] rather than os.getenv("KEY"). This is a deliberate choice I prefer for multi-provider scripts: os.getenv silently returns None when a key is missing, which pushes the error down to the provider's API where the message is far less useful. os.environ raises a KeyError immediately with the variable name. When you're juggling three different API keys for the first time, you want to know which one is missing.

text_skill.run() returns the model's response as a string. On a typical call, you'll get something like:

"The sun melted into the horizon, painting the sky in layered bands of amber, rose, and deep violet as the ocean mirrored its fading warmth."

That string becomes the input for Stage 2.

Why parts Instead of content?

The parts list is the design decision that makes multimodal work without special-casing. A text-only message uses ["some string"]. A message with an image uses a dictionary inside parts. A message with both uses both. Same field, same structure, every modality.

Stage 2: Turning Text Into an Image

We take Claude's text output and pass it to the image generation model as a prompt:

image_skill = Skill(
    model  = Model("imagine-image-pro", api_key=os.environ["XAI_API_KEY"]),
    input  = {"messages": [{"role": "user", "parts": [description]}]},
    output = {"modalities": ["image"], "format": {"type": "image", "size": "1024x1024"}},
)

image    = image_skill.run()
img_path = pathlib.Path("output_sunset.png")
img_path.write_bytes(base64.b64decode(image["base64"]))
print(f"[text → image]\nsaved → {img_path}\n")
Enter fullscreen mode Exit fullscreen mode

Two things to notice here.

The output configuration. This is the first time we specify how the response should come back. "modalities": ["image"] tells the Skill we expect an image. The "format" dictionary specifies the type and dimensions. Without this, the model might return text describing how it would generate an image — which is not helpful.

The return value. When a Skill produces an image, .run() returns a dictionary with at least two keys: "base64" (the image data) and "mime_type" (e.g., "image/png"). We decode the base64 data and write it to disk.

pathlib.Path("output_sunset.png") writes to the current working directory rather than using __file__. That's deliberate — __file__ is undefined in interactive environments like Jupyter notebooks or a REPL and raises a NameError. A relative path works consistently across all contexts.

A Note on Image Sizes

"1024x1024" is a common default for image generation models. If you pass a size the model doesn't support, you'll get an error at runtime rather than a silently resized image. Check your provider's documentation for supported dimensions before you assume.

Stage 3: Analyzing the Image with Qwen Vision

The image from Stage 2 goes into Qwen's vision-language model:

vision_skill = Skill(
    model = Model("qwen-vl-max", api_key=os.environ["DASHSCOPE_API_KEY"]),
    input = {
        "messages": [{
            "role": "user",
            "parts": [
                {"type": "image", "source": {"kind": "base64",
                                              "data": image["base64"],
                                              "mime": image["mime_type"]}},
                {"type": "text",  "text": "What do you see in this image?"},
            ],
        }]
    },
)

analysis = vision_skill.run()
print(f"[image → text · Qwen]\n{analysis}")
Enter fullscreen mode Exit fullscreen mode

The parts list now contains two items:

  1. An image part — a dictionary with "type": "image" and a "source" object. The source specifies "kind": "base64", the actual base64 data, and the MIME type — both pulled directly from Stage 2's output dictionary.

  2. A text part — a dictionary with "type": "text" and the question.

Same parts structure as Stage 1. The only difference is that instead of bare strings, we use typed dictionaries to describe each piece of content. The vision model receives the image and the question in a single message and Qwen's response comes back as a plain string — something like:

"The image shows a vivid sunset over an ocean. The sky displays gradients of orange, pink, and purple. The sun is partially below the horizon, with its reflection stretching across calm water."

The Complete Script

"""
Multimodal pipeline: Text → Image → Text, three different providers.

  1. text  → text   Claude  (claude-3-5-sonnet-20241022)
  2. text  → image           (imagine-image-pro)
  3. image → text   Qwen    (qwen-vl-max)

Required env vars:
    ANTHROPIC_API_KEY
    XAI_API_KEY
    DASHSCOPE_API_KEY
"""

import os, sys, base64, pathlib
from yait_aichain import Model, Skill

# ── 1. Text → Text (Claude) ──────────────────────────────────────────────────
text_skill = Skill(
    model = Model("claude-3-5-sonnet-20241022", api_key=os.environ["ANTHROPIC_API_KEY"]),
    input = {"messages": [{"role": "user", "parts": ["Describe a sunset in one sentence."]}]},
)

try:
    description = text_skill.run()
except Exception as e:
    print(f"Stage 1 failed: {e}"); sys.exit(1)
print(f"[text → text · Claude]\n{description}\n")

# ── 2. Text → Image ──────────────────────────────────────────────────────────
image_skill = Skill(
    model  = Model("imagine-image-pro", api_key=os.environ["XAI_API_KEY"]),
    input  = {"messages": [{"role": "user", "parts": [description]}]},
    output = {"modalities": ["image"], "format": {"type": "image", "size": "1024x1024"}},
)

try:
    image = image_skill.run()
except Exception as e:
    print(f"Stage 2 failed: {e}"); sys.exit(1)
img_path = pathlib.Path("output_sunset.png")
img_path.write_bytes(base64.b64decode(image["base64"]))
print(f"[text → image]\nsaved → {img_path}\n")

# ── 3. Image → Text (Qwen Vision) ────────────────────────────────────────────
vision_skill = Skill(
    model = Model("qwen-vl-max", api_key=os.environ["DASHSCOPE_API_KEY"]),
    input = {
        "messages": [{
            "role": "user",
            "parts": [
                {"type": "image", "source": {"kind": "base64",
                                              "data": image["base64"],
                                              "mime": image["mime_type"]}},
                {"type": "text",  "text": "What do you see in this image?"},
            ],
        }]
    },
)

try:
    analysis = vision_skill.run()
except Exception as e:
    print(f"Stage 3 failed: {e}"); sys.exit(1)
print(f"[image → text · Qwen]\n{analysis}")
Enter fullscreen mode Exit fullscreen mode

Three providers. Two modality transitions. Each stage wrapped in its own try/except so a failure at Stage 2 tells you it was Stage 2 — not a cryptic traceback from somewhere inside a provider SDK you didn't even know you were calling.

How the Stages Connect

There's no special "chaining" API. The variable description (a string) goes directly into image_skill's input. The variable image (a dictionary) gets its fields plucked out for vision_skill's input. Regular Python variables carry data between stages.

When you need to transform data between stages — truncating a description to 200 characters before image generation, for instance — you write normal Python between the calls. No callbacks, no middleware, no pipeline DSL. This is actually one of the things I like about this approach: the "pipeline" is just a script.

The parts list is what keeps the interface uniform across modalities:

  • Text-only: "parts": ["your string here"]
  • Image-only: "parts": [{"type": "image", "source": {...}}]
  • Mixed: "parts": [image_dict, text_dict]

One structure, every model, every modality.

Swapping Providers

Notice what's absent from the Skill configurations: no Anthropic client initialization, no provider-specific headers, no DashScope SDK imports. The Model constructor takes a model name and an API key; provider routing happens internally. Swapping the image generation model means changing one string and one environment variable — nothing else in the script changes.

Extending the Pipeline

Once you have this pattern, extensions are straightforward.

  • Add a fourth stage. Take Qwen's analysis and feed it to a text model for summarization or translation. Another Skill, another Model, same shape.
  • Branch instead of chain. Generate 3 different images from the same description using 3 different models. Compare the results by feeding all of them to the vision model in separate Skill calls.
  • Save intermediate results. The script already saves the image to disk. Add JSON logging for the text outputs and you have a full audit trail of the pipeline's execution.

The models do the hard work. The code connects them — and stays out of the way.

Top comments (0)