How I Built a Visual AI Orchestration Engine

Srinath Reddy — Tue, 19 May 2026 16:30:10 +0000

Every time I started a new AI project I wrote the same code.

Chain the LLM call. Wire up the tools. Handle the tool loop. Stream the output. Add a REST endpoint. Write logs. Fix the one case where the model calls two tools at once and the whole thing breaks.

By the fourth project I wasn't building products anymore — I was rebuilding infrastructure. So I stopped and built the infrastructure once, properly, as a visual tool I could just open and use.

The Problem

If you've built anything non-trivial with LLMs you know the real work isn't prompting. It's orchestration.

You need a loop that keeps running until the model stops calling tools
You need to pass tool results back correctly
You need to handle multiple nodes executing in the right order
You need structured output so your frontend isn't parsing free-text
You need streaming so users aren't staring at a blank screen for 8 seconds
You need a public endpoint so the workflow is actually callable from your app

Every one of these is solvable. But solving all of them, again, for every new project? That's the tax that kills momentum.

The other frustration was debugging. When something broke in a multi-step pipeline I had no visual sense of where it broke. I was reading logs and mentally reconstructing the execution graph that I should just be able to see.

My Approach

I built Pipecat — a visual AI workflow builder where you design pipelines as a DAG (Directed Acyclic Graph) on a canvas and run them via a public API.

The core idea: the execution graph should be visible. Not inferred from logs — actually drawn, with nodes that light up as they run.

The secondary idea: once built, a workflow should be callable with a single curl command. No re-deployment. No glue code. Just an API key and an endpoint.

The original target was developers — people who build AI-powered features but don't want to maintain a custom orchestration layer for every one of them. That's still the core.

But as I built it I kept running the same workflow myself: connect to an external API, search something, return a structured result. The most common version of that pattern turned out to be product search for e-commerce stores. So Pipecat grew a second use case: a drop-in AI shopping assistant for Shopify and any other storefront, built on the same DAG engine underneath. Same infrastructure, different front door — developers get a visual workflow builder and a public invoke API; merchants get a chat widget that knows their catalog and can push items straight to a Shopify cart.

Technical Breakdown

The Canvas

The workflow is a DAG. Nodes execute in topological order. The three primitives are:

Input node — receives the user's prompt
LLM node — runs the model, handles the tool-use loop internally
Output node — returns the result (plain text or structured JSON)

You wire them on a canvas. Parallel branches are supported — if two nodes have no dependency on each other they execute concurrently.

The LLM node runs an agentic loop: call the model → if it wants a tool, call it → feed the result back → repeat until the model produces a final response. max_iterations caps runaway loops.

Tools

Tools are HTTP endpoints you register in the dashboard. You give each one a name, a description, a method, a URL, headers, and parameters.

The description is what the model reads to decide whether to call the tool. Write it like a prompt, not a docstring — be specific about inputs and what the tool returns.

Parameters work exactly like function calling schemas. The model extracts values from the user's message and maps them to your API's expected fields.

Headers (including auth tokens) are encrypted at rest before storage.

The Invoke API

Once you publish a workflow it becomes callable:

curl -X POST https://api.pipecat.in/invoke/{workflow_id}/invoke \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"input": "What is the weather in Tokyo?"}'

Response:

{
  "run_id": "123e4567-...",
  "status": "success",
  "output": "It's currently 22°C and sunny in Tokyo.",
  "tool_calls": [
    {
      "tool": "get_weather",
      "input": { "location": "Tokyo" },
      "result": "{\"temperature\": 22, \"condition\": \"Sunny\"}"
    }
  ]
}

If you turn on structured output on the Output node, output becomes a parsed JSON object matching the schema you defined — not a string. Useful when the downstream consumer is code, not a human.

WebSocket Streaming

For real-time UIs there's a WebSocket endpoint:

const ws = new WebSocket(`wss://api.pipecat.in/ws/run/${workflowId}`);

ws.onopen = () => {
  ws.send(JSON.stringify({ token: apiKey, input: userMessage }));
};

ws.onmessage = (event) => {
  const { status, node_id, tool_name, output } = JSON.parse(event.data);
  // status: "running" | "tool_call" | "tool_result" | "success" | "error"
};

You get events as each node starts, as tools fire and return, and when the full workflow completes. Enough to build a proper "thinking..." UI with per-step feedback.

Multi-provider LLM Support

LLM nodes support OpenAI, Anthropic, Gemini, and OpenRouter. You switch models from the canvas without touching anything else in the workflow. Useful for cost experiments — swap GPT-4.1 for Gemini Flash on a high-volume node and see the difference in the stats panel.

Shopify Integration and Hybrid Product Retrieval

One of the workflows I built on top of Pipecat is a shop chat assistant — a widget you embed in a Shopify storefront that answers product questions and pushes items to cart. It's a good example of how the node model composes with external data.

The sync path. When a store connects their Shopify Storefront API token, we pull the full product catalog via paginated GraphQL rather than scraping HTML. Then each product goes through two enrichment passes: a Gemini call that extracts structured attributes (style, occasion, gender, materials, type) and an embedding pass that turns the product's title + description + tags into a semantic vector stored in pgvector.

The retrieval path is where it gets interesting. The naive approach would be: embed the user's query → cosine similarity → return top-N. That works fine for "red dress" but falls apart for "something under $50 for a 3-year-old boy." Vector similarity has no way to enforce a hard price constraint.

Instead, the chat endpoint does a multi-stage search that treats vector as a ranker, not a filter:

Filter extraction — an LLM call pulls structured filters out of the natural-language query: max_price, min_price, style, gender, occasion, material, type, keyword, age range.
SQL pre-filter — those filters become hard WHERE clauses against our product table. Price and gender are hard stops; style, occasion, and material use ILIKE for fuzzy matching.
Vector ranking within candidates — the query is embedded in parallel with the candidate count check. We rank the filtered set by cosine distance and return the top 10. Crucially, we only consult vector space after the structured filters have already narrowed the catalog.

The fallback cascade is what makes it reliable in practice. Three things can go wrong:

The keyword extracted from the query is too specific and returns fewer than 5 candidates — we retry without the keyword filter, keeping price/gender/style intact.
All metadata filters combined still produce fewer than 3 candidates — we drop to a pure embedding search across the whole store.
The embedding service is down — we fall back to recency-sorted SQL results.

The upshot: a query like "gift for my toddler nephew, budget around 30 USD" routes through price + gender + keyword filters first, gets ranked by semantic similarity within that matching subset, and only falls back to pure vector if the intersection is genuinely empty. Compared to pure vector search, this eliminates entire categories of wrong results (expensive items surfaced for a budget query, adult products surfaced for a children's query) before the model ever touches the output.

Support queries are short-circuited entirely. If the message matches known support keywords (returns, shipping, order status, size guide) and the store has scraped FAQ content, we skip product search and route straight to an LLM call that answers from the FAQ text. No embeddings, no SQL, just context injection.

At checkout time, the widget calls the Shopify Storefront API directly — fetching real-time variant availability and creating carts via GraphQL mutations — so inventory and pricing are always live, not cached.

If you're a Shopify merchant rather than a workflow developer, the dedicated landing page covers setup end-to-end: https://app.pipecat.in/ecomm

Going Live: One Script Tag

Once the store is connected and the product catalog is synced, embedding the assistant takes three steps:

1. Grab your embed snippet from the Overview tab:

<script
  src="https://app.pipecat.in/embed.js"
  data-store-id="YOUR_STORE_ID"
  defer
></script>

2. Paste it before </body> in your theme.

For Shopify: Admin → Online Store → Themes → Edit code → layout/theme.liquid → paste before </body> → Save.

For any other site: same snippet, same place in your HTML.

3. That's it.

The script is a self-contained IIFE. It reads the data-store-id attribute, injects a fixed-position floating bubble, and lazy-loads the chat UI inside a sandboxed iframe — no framework dependencies, no extra requests on page load. On first click it fetches your widget config (brand color, logo URL, welcome message) and applies it to the bubble in real time, so the launcher always matches your store's branding without you touching CSS.

The iframe communicates back to the parent via postMessage — when the user closes the chat, the message cw:close collapses the iframe and brings the bubble back. The whole open/close animation is CSS transitions, no JavaScript jank.

Challenges

The tool-call loop edge cases were brutal.

The spec says: model outputs tool calls → you call the tools → you feed results back → model continues. Simple. But in practice: what if the model calls three tools simultaneously? What if a tool returns an error — do you surface it to the model or abort? What if max_iterations is hit mid-loop?

I had to make explicit decisions for all of these and test each one. Concurrent tool calls now execute in parallel. Tool errors are returned to the model as error messages so it can attempt recovery. Hitting max_iterations terminates with a partial result rather than silently dropping output.

Debugging parallel branches.

When two branches execute concurrently and one fails, the error needs to be attributed to the right node without corrupting the other branch's state. Early versions had race conditions that produced interleaved log entries and wrong duration timestamps. Fixing this meant being much more deliberate about execution context isolation per-node.

Making the canvas feel fast.

The visual editor needs to feel instant even on large workflows. React Flow handles most of the heavy lifting but there were edge cases with node state updates during live execution (nodes lighting up as they run) that caused full re-renders. Fixed by memoizing node state and only updating the affected node's data rather than broadcasting to the full graph.

Demo

The default workflow when you sign up looks like this:

[Input] → [LLM] → [Output]

You can turn that into a real research agent in about 5 minutes:

Add a Tool node pointing to a search API
Attach it to the LLM node
Set the system prompt: "You are a research assistant. Use the search tool to find current information before answering."
Enable structured output — fields: summary (string), sources (array), confidence (number)
Publish
Call the invoke endpoint from anywhere

The canvas shows you each node's status in real time during a run — grey (waiting), blue (running), green (done), red (error). When a tool fires you see it light up separately from the LLM node. It makes the execution graph immediately legible.

Lessons Learned

Write the description for every tool as if the model is a new hire.

The LLM decides whether to call your tool based entirely on the description. Vague descriptions ("fetches data") produce unpredictable tool usage. Specific descriptions ("returns current weather conditions for a given city name — use this when the user asks about weather, temperature, or climate in a specific location") produce reliable behavior.

Structured output is underused.

Most people default to plain text output and then write parsing logic downstream. Defining the output schema upfront and letting the model fill it in is almost always the right move when your consumer is code. It's more reliable than regex on a free-text response and the model respects it well across all the major providers.

Don't use vector search as a filter — use it as a ranker.

The biggest retrieval mistake I see in demos is embedding the query and ranking the entire catalog by cosine similarity. That doesn't enforce hard constraints. The right pattern: extract structured filters from the query with an LLM call, apply them as SQL WHERE clauses to narrow the candidate set, then rank that subset by vector similarity. You get semantic relevance within a constraint-respecting pool, with cascading fallbacks when the intersection is too narrow.

Visual debugging changes how you think about architecture.

Once I could see the graph execute I started designing workflows differently. You notice immediately when a workflow is sequential when it could be parallel. You see duration bottlenecks at the node level. The canvas isn't just a UI — it's also a profiler.

Try It

If you're building anything that involves chaining LLM calls, calling external APIs from a model, or exposing an AI feature via a REST endpoint — Pipecat cuts the setup down significantly.

Free tier gets you started without a credit card: https://app.pipecat.in
For stores : https://app.pipecat.in/ecomm
For tutorials : https://app.pipecat.in/blogs

Happy to answer questions and hear your feedback.

Actively Iterating on the product.

Tags: ai python webdev productivity

DEV Community: Srinath Reddy