zaochuan5854

Posted on Apr 12

Stop Choosing Between Speed and LoRAs: Meet ComfyUI-TensorRT-Reforge 🚀

#ai #performance #programming #python

Originally published in Japanese on Qiita

👋 Introduction

Hey ComfyUI creators! Have you ever found yourself generating images and thinking, "I really wish this was blazingly fast"?

If you've looked into accelerating AI model inference, you've probably heard of TensorRT. While there are a few custom nodes out there that bring TensorRT to ComfyUI, they often come with frustrating trade-offs. You usually hear complaints like, "I can't use my LoRAs anymore," or "The node is outdated and unmaintained..."

To solve this, I've developed ComfyUI-TensorRT-Reforge! 🚀 It's a brand-new custom node that lets you reap the benefits of TensorRT's insane speeds while still using your favorite LoRAs freely.

In this post, I'll walk you through how to set it up, how to use it, and dive into some of the cool tech working under the hood. Let's dive in! 👇

🙌 Acknowledgments

This project builds upon the fantastic ComfyUI-TensorRT originally created by ComfyUI's author, comfyanonymous. Huge thanks to them!

🛠️ What's in the Box?

ComfyUI-TensorRT-Reforge is kept simple and consists of two main custom nodes:

TensorRT Exporter Reforge (The Exporter)
TensorRT Loader Reforge (The Loader)

Their roles are straightforward.
First, the Exporter takes your standard .safetensors model and converts it into a highly optimized TensorRT model. Next, the Loader brings that converted model into ComfyUI, wrapping it so you can use it exactly like you would any normal model.

💻 System Requirements

Here are the requirements and the environment I used for testing:

Target	Requirements	My Test Environment
OS	Windows 11/10, WSL, Ubuntu	Docker on WSL (See Dockerfile below)
GPU	RTX 2000 series or newer	RTX 4070 Ti
VRAM	8GB minimum	12GB
CUDA	12.x	12.8
Models	SD1.5, SDXL, AuraFlow, Flux, SD3, Anima, SVD	SD1.5, SDXL, SD3, Anima

⚠️ Note on CUDA versions
CUDA 11 and CUDA 13 are not officially supported right now. However, you might get them to work by tweaking the ONNX/TensorRT versions or export options. If you manage to get it running on those versions, please drop a comment in our Discussions section—I'd love to hear about it!

📦 Installation

Installing ComfyUI-TensorRT-Reforge is just as easy as any other custom node. Choose the method that best fits your workflow.

1. Via ComfyUI-Manager (Recommended)

If you use ComfyUI-Manager, you're just a few clicks away:

Click on [Manager] in the ComfyUI menu.
Open the [Custom Nodes Manager].
Search for TensorRT-Reforge.
Once you spot ComfyUI-TensorRT-Reforge, hit [Install].
Restart ComfyUI.

💡 Can't find it in the Manager?
Since this project is brand new, it might take a moment to appear in the default list. If you don't see it, you can use the "Install via Git URL" feature in the Manager, or just fall back to the manual installation below.

2. Manual Installation (git clone)

For the terminal lovers, just navigate to your ComfyUI directory and run:

# Navigate to the custom_nodes directory
cd custom_nodes

# Clone the repository
git clone https://github.com/zaochuan5854/ComfyUI-TensorRT-Reforge

# Install the required dependencies
cd ComfyUI-TensorRT-Reforge
pip install -r requirements.txt

⚠️ Important Configuration (Windows Users)

Before you proceed to the workflow, there is a crucial step for those running ComfyUI on Windows to ensure smooth model conversion.

Adding the Launch Argument

When using this node to convert .safetensors models into TensorRT engines on Windows, you need to add the following argument to your ComfyUI startup command:

--disable-dynamic-vram

🚨 ALERT: Windows Environment Setup
This argument is mandatory for model conversion in most Windows setups. However, please note that some versions of ComfyUI may not recognize this flag. If ComfyUI fails to launch with this argument, simply remove it and try again.

🚀 How to Use It

Let's walk through the workflow step-by-step.
(Pro-tip: You can drag and drop the workflow image at the bottom of this article directly into ComfyUI to import it!)

Step 1: Convert your model (The Exporter)

Drop an Exporter node into your workspace and select the .safetensors model you want to turbocharge.
Next, configure your constraints: batch size, resolution range, and whether you want to enable LoRA. Give it a prefix name, and hit queue!

👆 In this example, I'm converting anima-preview2.safetensor into a TensorRT model locked to a batch size of 1 and an exact 1024x1024 resolution.

⚠️ Crucial Exporting Tips:

The conversion process can take anywhere from 3 to 10 minutes. Grab a coffee and be patient! ☕

If you don't enable LoRA here, you CANNOT apply LoRAs to this model later!

If you're using an Anima model or have LoRA enabled, the exporter will generate a custom .bundle file (see the appendix for nerds).

💡 Why the "Resolution Range"?
TensorRT requires you to strictly define the shape (size) of the data it will process beforehand. Locking it to a single, exact size yields the fastest speeds, but restricts you to that specific output size. (Setting min/max to 0 leaves it unrestricted).

Step 2: Load the Engine (The Loader)

Once the export is done, grab a Loader node. Select the shiny new model you just created and specify the Model Type.

Note: The Model Type is usually auto-detected from the filename, but it never hurts to double-check!

Step 3: Setup Latents and Generate!

From here on out, just wire it up like your standard ComfyUI workflow and hit generate.

🚨 ALERT: Watch your Latent Sizes!
If your empty latent image size or batch size conflicts with the constraints you set in the Exporter, ComfyUI will throw an error. If you forget your settings, just check the TensorRT filename—the constraints are usually tagged right there.

👆 For example, here I'm using Width 1024, Height 1024, and Batch Size 1, perfectly matching my export settings.

🤓 The Tech Behind It: How is it so fast AND supports LoRA?

Time for some under-the-hood geekery! How exactly does this node manage to achieve dramatic speedups while still allowing dynamic LoRA swapping?

The Basics

TensorRT: The Custom-Built F1 Car

Normally, when ComfyUI loads a .safetensors file, the math is handled by PyTorch. PyTorch is incredibly flexible—like a highly reliable, all-terrain manual transmission vehicle. But that flexibility comes with a bit of processing overhead.

TensorRT, on the other hand, performs hyper-optimizations specifically for your exact GPU:

Kernel Fusion: It takes separate mathematical operations (like ReLU and Conv) and smashes them together into single commands, drastically reducing memory access times.
Optimal Routing: It automatically benchmarks and selects the absolute fastest algorithms your specific silicon can run.

Think of TensorRT as ditching the all-terrain vehicle to build a custom F1 car tuned to drive perfectly on one specific circuit.

LoRA (Low-Rank Adaptation): Smart Delta Learning

Instead of painfully modifying the massive, billions-of-parameters base model, LoRA alters the output by simply injecting tiny matrices.

Mathematically, the update to the original weight matrix W looks like this:

W_{updated} = W + \Delta W = W + BA

Because matrices A and B are ridiculously small compared to the original, you get incredible stylistic control with barely any computational or storage overhead.

Enter "Refit": Making the Rigid Flexible

Historically, TensorRT's biggest flaw for AI artists was its rigidity.
Because it physically "compiles" that highly optimized F1 engine, if you wanted to change the weights (like adding a new LoRA), you essentially had to spend several minutes rebuilding the entire F1 car from scratch. Not ideal for rapid iteration.

The solution? A feature called Refit.

What does Refit do?

Refit allows us to rapidly overwrite the internal "weights" of the engine without altering the underlying mathematical structure.

The skeleton stays intact: TensorRT keeps all of its ultra-fast, fused kernel paths.
Injecting the payload: We mark specific weight zones in advance ("Hey, we might change this later!"). When you swap a LoRA, we just pour the new LoRA math directly into those pre-marked slots.

Why is this a game changer?

By leveraging Refit, the previously agonizing process of swapping LoRAs in TensorRT is solved:

Lightning-fast swaps: What used to take 5-10 minutes now takes a few seconds.
Zero speed loss: You maintain 100% of the insane inference speeds of TensorRT while freely hot-swapping LoRAs.

"TensorRT Speeds" × "LoRA Flexibility". We finally get the best of both worlds, and that's exactly what ComfyUI-TensorRT-Reforge delivers!

📝 Appendix

Workflow Example

Drag and drop this image into your ComfyUI canvas to instantly import the workflow used in this article:

Dockerfile for Development

If you prefer running ComfyUI in Docker, here is my setup. Feel free to tweak it to your needs (warning: building this takes a while!).

FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    tzdata && \
    ln -sf /usr/share/zoneinfo/Asia/Tokyo /etc/localtime && \
    echo "Asia/Tokyo" > /etc/timezone && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential curl git tmux nano htop lsyncd ssh-client fontconfig fonts-ipafont fonts-ipaexfont\
    && rm -rf /var/lib/apt/lists/*

RUN fc-cache -fv

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /opt
ENV UV_HTTP_TIMEOUT=600
RUN uv venv $VIRTUAL_ENV --python 3.12 --seed \
    && uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 \
    && uv pip install comfy-cli ComfyUI-EasyNodes beautifulsoup4 aiohttp_retry

RUN (echo n; echo y) | comfy --workspace /opt/comfyui install --nvidia --cuda-version 12.8\
    && comfy --workspace /opt/comfyui node install \
    ComfyUI-Impact-Pack \
    ComfyUI-Manager \
    was-node-suite-comfyui

RUN cd /opt/comfyui/custom_nodes \
    && git clone https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes.git

RUN cd /opt/comfyui/custom_nodes \
    && git clone https://github.com/rgthree/rgthree-comfy.git

RUN cd /opt/comfyui/custom_nodes \
    && git clone https://github.com/cosmicbuffalo/comfyui-mobile-frontend.git

ENV UV_INDEX_STRATEGY=unsafe-best-match
RUN cd /opt/comfyui/custom_nodes \
    && git clone https://github.com/zaochuan5854/ComfyUI-TensorRT-Reforge.git \
    && cd ComfyUI-TensorRT-Reforge \
    && uv pip install -r requirements.txt

WORKDIR /opt/comfyui

ENV COMFYUI_PATH="/opt/comfyui"

Bundle File Format

For advanced users who want to extract the underlying .engine or .onnx files, here is the .bundle layout. You can also reference the source code for parser logic.

## File Layout
Each data chunk's role is defined by its leading ID. The appearance order within the file is arbitrary.

+-----------------------------------------+ <--- Offset 0
| [ID:1B][Size:8B][Chunk Data...]         | Data Chunk A
+-----------------------------------------+
| [ID:1B][Size:8B][Chunk Data...]         | Data Chunk B
+-----------------------------------------+
| ...                                     | (Additional Chunks in any order)
+-----------------------------------------+ <--- End of Data Chunks (data_limit)
|                                         |
|      Metadata Section (JSON)            | Variable Length (No ID prefix)
|                                         |
+-----------------------------------------+ <--- Metadata End (EOF - 8 bytes)
|      Metadata Size (8 bytes)            | uint64, Little Endian
+-----------------------------------------+ <--- EOF

---

## ID Definition
- 0x01: TensorRT Engine Data
- 0x02: ONNX Model Data
- 0x03: WeightsMap (JSON / Binary)
- 0x04-0xFF: Reserved for future extensions

---

## Parsing Logic (ID-Driven)
1. Read the last 8 bytes of the file to get `meta_size`.
2. Calculate `data_limit` = (EOF - 8 - meta_size).
3. Initialize `current_offset = 0`.
4. While `current_offset < data_limit`:
    a. Read 1 byte as `chunk_id`.
    b. Read 8 bytes as `chunk_size`.
    c. Record `data_start = current_offset + 9`.
    d. Store the mapping of `chunk_id` -> `data_start`.
    e. Jump to the next chunk: `current_offset += (9 + chunk_size)`.
5. Seek to `data_limit` and parse the Metadata JSON.

DEV Community