Yang ella

Posted on Aug 6

Qwen-Image by Tongyi Achieves New SOTA in Image Generation, Disrupting the Open-Source Landscape

#qwenimage

The Tongyi team has open-sourced its first image model, Qwen-Image, a 20B MMDiT model that is said to rival gpt-4o's complex text rendering and image editing capabilities. The open-sourcing follows the Apache 2.0 license, making it free for commercial use. (Image editing mode is not yet available)

Key highlights include:

Exceptional Text Generation and Typography: Qwen-Image excels at complex text rendering, supporting multi-line layouts, paragraph-level text generation, and refined character details. It achieves highly realistic visual output for both English and Chinese text.
Precise and Consistent Image Editing: Thanks to a reinforced multi-task joint training strategy, Qwen-Image demonstrates excellent contextual consistency during image editing, ensuring modified content blends naturally into the original scene.
Leading Performance on Cross-Benchmarks: In multiple public authoritative benchmark tests, Qwen-Image has achieved state-of-the-art (SOTA) performance in both image generation and editing tasks, showcasing its power as an advanced foundational model for image generation.

Qwen-Image Free Online Experience: https://kontextflux.io/image-models/qwen-image

Qwen-Image has been comprehensively evaluated in numerous public benchmarks, consistently demonstrating superior performance and outperforming existing models across various tasks. For general image generation, the model was rigorously tested on GenEval, DPG, and OneIG-Bench. For image editing, its capabilities were assessed using benchmarks like GEdit, ImgEdit, and GSO. Notably, Qwen-Image's performance in text rendering on LongText-Bench, ChineseWord, and TextCraft is particularly outstanding, especially in Chinese text generation tasks. This consistent leading performance across diverse benchmarks establishes Qwen-Image as a top-tier image generation model, equipped with both broad general capabilities and exceptional precision in text rendering.

In addition to the official benchmark comparisons, the third-party Artificial Analysis Image Arena Leaderboard also provides a performance ranking for Qwen-Image.

Among all image generation models (including both open-source and closed-source), Qwen-Image's performance is roughly on par with Flux Kontext pro, Imagen3.0, and Ideogram 3.0.

When compared exclusively to other open-source models, Qwen-Image indeed achieves SOTA performance.

Open-Source Model Usage

Qwen-Image model weights are open-sourced on GitHub, Hugging Face, and Modelscope.

ComfyUI has already added support for Qwen-Image:

Workflow: JSON workflow
Docs: ComfyUI Native Workflow Example

Local Deployment

transformers>=4.51.3 (Supporting Qwen2.5-VL)
Install the latest version of diffusers
System requirements: 24GB GPU memory and 64GB+ RAM > pip install git+https://github.com/huggingface/diffusers

from diffusers import DiffusionPipeline
import torch

model_name = "Qwen/Qwen-Image"

# Load the pipeline
if torch.cuda.is_available():
    torch_dtype = torch.bfloat16
    device = "cuda"
else:
    torch_dtype = torch.float32
    device = "cpu"

pipe = DiffusionPipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
pipe = pipe.to(device)

positive_magic = {
    "en": "Ultra HD, 4K, cinematic composition.", # for english prompt
    "zh": "超清，4K，电影级构图" # for chinese prompt
}

# Generate image
prompt = '''A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197".'''

negative_prompt = " " # Recommended if you don't use a negative prompt.

# Generate with different aspect ratios
aspect_ratios = {
    "1:1": (1328, 1328),
    "16:9": (1664, 928),
    "9:16": (928, 1664),
    "4:3": (1472, 1104),
    "3:4": (1104, 1472),
    "3:2": (1584, 1056),
    "2:3": (1056, 1584),
}

width, height = aspect_ratios["16:9"]

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

image.save("example.png")

Showcase

Let it create an image to summarize Qwen-Image's capabilities for a website:

A movie poster titled "The Power of Qwen-Image". The first row is the main title in a bold, modern font: "QWEN-IMAGE: THE FUTURE OF IMAGING". The second row, directly below, reads "Witness Unparalleled Text Rendering and Precise Image Editing". The third row states "Starring: Superior Chinese & English Text Generation". The fourth row reads "Director: The 20B MMDiT Architecture". The central visual features a sleek, futuristic computer (representing the 20B MMDiT model) from which radiant colors, whimsical creatures, and dynamic, swirling patterns explosively emerge, symbolizing its generative power. Emerging from the digital energy are clear, realistic depictions of its capabilities: a shop sign with the Chinese text "云存储", a book cover with the English text "The Silent Patient", and a traditional Chinese couplet with elegant calligraphy. The background transitions from dark, cosmic tones into a luminous, dreamlike expanse, evoking a digital fantasy realm. At the bottom edge, the text "Powered by State-of-the-Art Cross-Benchmark Performance" appears in a bold, modern sans-serif font with a glowing, slightly transparent effect. The overall style blends sci-fi surrealism with graphic design flair—sharp contrasts, vivid color grading, and layered visual depth—reminiscent of visionary concept art and digital matte painting. 32K resolution, ultra-detailed, masterpiece.

However, I found that when using the same prompt with an aspect ratio that deviates too much from a standard square, the text can become less clear and appear "stuck together." For example, changing the aspect ratio to 3:1 with the same prompt results in this (the second line "Witness" and fourth line "architecture"):

Of course, many other cases are still excellent:

A miniature raccoon explorer made of wool wearing all kinds of equipment, walking through dry grass, the whole world is made of felt textile

Some official examples:

According to tests by users on X (formerly Twitter), other languages like Japanese also work:

Practical Applications and Future Outlook

Qwen-Image's outstanding capabilities, especially its breakthrough in complex text rendering, make it more than just an image generation tool. It's a creative platform with broad applications across multiple industries.

Applications in Design and Content Creation

Advertising and Marketing: For graphic designers, generating an image with a specific slogan, brand name, and clear product information has always been a challenge. Qwen-Image handles multi-line, varied fonts, and mixed Chinese/English text with ease, significantly shortening the production cycle for ad posters and product images.
Game Development: In games, UI elements, street signs, posters, or specific props often need to include text. With Qwen-Image, developers can quickly generate textures with precise text, eliminating the need for complex post-processing.
Education and Publishing: Teachers and publishers can use Qwen-Image to create educational illustrations or posters with clear charts, titles, and body text. For example, generating an infographic explaining "deep learning" where all the text is accurately rendered.

The Immense Potential of Image Editing

Although Qwen-Image's image editing mode is not yet available, its underlying architecture has already demonstrated a powerful ability to understand context. Once this feature is released, its potential will be immense:

Precise Replacement and Modification: Imagine being able to select text in an image and replace it with any new content, with the font, lighting, and style blending seamlessly with the original image. This will fundamentally change the image editing workflow.
Content Personalization: In e-commerce, you could quickly generate personalized product images with different customer names. On social media, you could easily modify the slogan or text in an image to suit different communication needs.
Seamless Integration: Whether it's adding new objects to an existing scene or applying stylistic adjustments, the powerful multi-task joint training will ensure that the edited image maintains a high degree of consistency and naturalness with the original.

AI Arena

Beyond the open-source model, to comprehensively evaluate Qwen-Image's general image generation capabilities and objectively compare it with advanced closed-source models, the team has launched AI Arena - an open-source benchmark platform based on the Elo rating system. AI Arena provides a fair, transparent, and dynamic evaluation environment for continuous comparison of different models. In each round of evaluation, the system generates two anonymous images based on the same prompt, inviting users to compare them and vote. The voting results are used to update individual and global leaderboards in real-time via the Elo algorithm, enabling a scientific, data-driven assessment of model performance. AI Arena is now open to the public.

Free Online Platforms:

qwen chat: Tongyi's intelligent conversation platform. Select Image Generation during a chat; it can be slow at times.
huggingface: Can be slow.
qwen-image: Get 20 credits upon registration.
wavespeed: Get 50 generation credits upon registration.