I've been building web apps for a while now. REST APIs, GraphQL, WebSockets — the usual suspects. Nothing really surprises me anymore when it comes to integrating a backend.
Then I tried to integrate ComfyUI.
ComfyUI, if you haven't used it, is basically THE tool for local image generation right now. It's insanely powerful — you build image generation pipelines by connecting nodes in a visual graph editor. Stable Diffusion, Flux, ControlNet, upscaling, inpainting — if it exists in the image gen world, ComfyUI probably supports it.
The catch? Its API is... unique.
The API That Speaks Node Graphs
Most APIs work like this: you send a request with some parameters, you get a response. Simple.
ComfyUI works like this: you send an entire node graph as a JSON object, where each node has connections to other nodes, and the server executes the whole graph and gives you back the result.
Here's what a "simple" text-to-image request looks like:
{
"3": {
"class_type": "KSampler",
"inputs": {
"seed": 156680208700286,
"steps": 20,
"cfg": 8,
"sampler_name": "euler",
"scheduler": "normal",
"denoise": 1,
"model": ["4", 0],
"positive": ["6", 0],
"negative": ["7", 0],
"latent_image": ["5", 0]
}
},
"4": {
"class_type": "CheckpointLoaderSimple",
"inputs": {
"ckpt_name": "v1-5-pruned-emaonly.safetensors"
}
}
}
See those ["4", 0] references? Those are connections between nodes. Node 3's "model" input connects to node 4's first output. You basically have to construct an entire directed acyclic graph in JSON just to generate one image.
first time i saw this i just stared at my screen for like 10 minutes
The WebSocket Surprise
Okay so you've built your workflow JSON. You POST it to /prompt. Cool. But how do you get the image back?
Not from the response. The response just gives you a prompt ID. To actually get your image, you need to open a WebSocket connection and listen for execution updates.
const ws = new WebSocket('ws://localhost:8188/ws?clientId=my-app')
ws.onmessage = (event) => {
const data = JSON.parse(event.data)
if (data.type === 'executing' && data.data.node === null) {
// workflow finished, now fetch the image via HTTP
}
}
So the flow is: POST the workflow -> get a prompt ID -> listen on WebSocket for completion -> then fetch the actual image from another HTTP endpoint using the prompt ID.
three different communication patterns for one image. cool cool cool.
What Actually Tripped Me Up
The node graph stuff is weird but manageable once you understand it. What actually cost me sleep was the silent failures.
ComfyUI doesn't really do error messages in the traditional sense. If your workflow JSON has a bad connection, sometimes it just... doesn't execute. No error. No response. The WebSocket goes quiet and you're sitting there wondering if your code is broken or if the server crashed.
I ended up building a timeout system where if I don't get a WebSocket update within 30 seconds, I assume something went wrong and check the server's queue endpoint to see what happened. Not elegant but it works.
The other thing: model names have to match EXACTLY what's on disk. Not the display name, not a slug — the actual filename including the extension. v1-5-pruned-emaonly.safetensors, not stable-diffusion-1.5 or whatever you might expect. I spent an embarrassing amount of time debugging a "model not found" issue that turned out to be a missing .safetensors extension.
How I Abstracted It Away
The whole point of building Locally Uncensored was that users should NOT have to think about any of this. You type a prompt, pick a model, hit generate.
So I built a workflow template system. Instead of constructing node graphs on the fly, I have pre-built workflow templates for common tasks (txt2img, img2img, video gen) where only the dynamic parameters get swapped in.
const buildTxt2ImgWorkflow = (params: {
prompt: string
negativePrompt: string
model: string
width: number
height: number
steps: number
seed: number
}) => {
const workflow = structuredClone(TXT2IMG_TEMPLATE)
workflow["6"].inputs.text = params.prompt
workflow["7"].inputs.text = params.negativePrompt
workflow["4"].inputs.ckpt_name = params.model
// ... etc
return workflow
}
This way the React frontend just deals with simple form inputs and the template layer handles all the node graph nonsense.
The WebSocket State Machine
For the frontend, I needed to track the state of each generation request: queued, running, which node is currently executing, progress percentage, done, or failed.
I ended up with a custom React hook that manages the WebSocket connection and exposes a clean state object:
const { status, progress, image, error } = useComfyGeneration({
workflow: buildTxt2ImgWorkflow(params),
onComplete: (imageUrl) => addToGallery(imageUrl)
})
Under the hood it's handling the WebSocket connection, matching prompt IDs, tracking node execution progress, fetching the final image, and cleaning up. Took me probably 3 days to get this right — way longer than the actual Ollama chat integration which was like an afternoon.
Was It Worth It?
Absolutely. ComfyUI's node-based architecture is what makes it so powerful and extensible. The API complexity is a direct consequence of that flexibility. And once you've abstracted it properly, you get the best of both worlds — the power of ComfyUI's ecosystem with a clean user experience on top.
If you're building something on top of ComfyUI, my advice: don't try to be clever with dynamic workflow generation at first. Start with templates. Get it working. Then add complexity later.
The full source is on GitHub if you want to see how the integration works in practice: Locally Uncensored
the ComfyUI integration code specifically lives in the api layer if you want to skip straight to the pain
Top comments (0)