DEV Community: Shawn Wang

Manus: The AI Agent That's Breaking the Internet - Revolutionary Breakthrough or Marketing Masterpiece?

Shawn Wang — Thu, 06 Mar 2025 15:57:37 +0000

Manus: The AI Agent That's Breaking the Internet - Revolutionary Breakthrough or Marketing Masterpiece?
In a thunderous overnight sensation, a new AI product called Manus has exploded across Chinese tech circles, claiming to be the world's first truly general-purpose AI Agent. With bold assertions of capabilities that surpass industry giants like OpenAI's DeepResearch and Anthropic's Claude, Manus has triggered both excitement and skepticism. But what exactly is this mysterious new player, and does it live up to the extraordinary hype?

Your browser does not support the video tag.

The Midnight Thunder: Manus Emerges

"Last night, thunder in the deep night," as one Chinese tech blogger dramatically described it. A previously unknown team suddenly released what they claim is the first truly autonomous AI agent: Manus. The demonstrations left viewers stunned, with reactions ranging from "humans are about to be utterly defeated" to more measured skepticism.

According to enthusiasts, Manus represents the ultimate fusion of OpenAI's DeepResearch and Claude's Computer Use capabilities, with the added ability to write and execute its own code. This combination creates what one blogger called a "monster" of AI capability that arrived far sooner than expected.

Manus Website

What Makes Manus So Special?

Unlike conventional AI chatbots where users must guide the conversation step-by-step, Manus operates with remarkable autonomy. When presented with a complex task, it:

Automatically decomposes tasks into logical sub-steps
Plans and executes each step independently
Searches and analyzes information autonomously
Adapts to user preferences (e.g., remembering preferred output formats)
Runs operations in the cloud without requiring constant user input
Notifies users when complex processes are complete

For example, when analyzing Amazon store sales data, Manus can automatically visualize the data, analyze trends, and recommend data-driven strategies to increase next month's sales by 10%. For resume screening, it can independently extract information from dozens of resumes, rank candidates, and categorize them based on experience levels - all without manual file uploads or human intervention.

The interface features a distinctive split-screen design where users can watch Manus work in real-time through a virtual machine window. This creates what fans describe as an addictive sense of control and transparency, as users can observe the AI's "thought process" unfold through each action.

Impressive Capabilities Showcased

The demonstrations that have circulated showcase Manus performing tasks that would typically require multiple specialized tools and human intervention:

Document transformation: Converting academic PDFs into presentation-ready PowerPoints with minimal prompting
Resume screening: Autonomously extracting, analyzing, and ranking candidate information from multiple resumes
Real estate analysis: Acting as a property consultant by researching neighborhoods, safety statistics, and educational resources
Stock market analysis: Functioning as a financial advisor by collecting and cross-referencing market data
Data visualization: Creating comprehensive reports with charts and actionable recommendations

In one particularly impressive demonstration, a user simply requested "Help me create a ten-page PPT introducing Xiaomi SU7" - and Manus independently researched, organized, and generated a complete presentation without further input.

GAIA Benchmark: The Crown Jewel Claim

Perhaps most striking is Manus's claimed performance on the GAIA (General AI Assistants) benchmark - a rigorous evaluation framework developed by Meta AI and Hugging Face containing 466 carefully designed problems.

According to promotional materials, Manus has achieved the highest GAIA score to date, surpassing even OpenAI's DeepResearch. This is significant because GAIA tests practical problem-solving rather than just specialized knowledge, requiring capabilities like web searching, tool usage, programming, and file processing.

For context, when GAIA was introduced in 2023, humans typically achieved 90% success rates, while the then-strongest AI (GPT-4) barely reached 15% on the easiest level. Manus's purported dominance of this benchmark has raised eyebrows throughout the AI community.

The Skeptic's Perspective: Red Flags

Despite the excitement, several aspects of Manus's sudden rise have triggered skepticism among more cautious observers:

1. The Curious Case of Regional Virality

While Manus exploded across Chinese social media starting around 6-7 AM one morning, it remains virtually unknown internationally, with only one video having a few thousand views on international platforms. This geographically isolated virality pattern is unusual for truly groundbreaking AI technology.

2. Media-First, Not Expert-First

Unlike other significant AI breakthroughs, Manus's popularity was driven primarily by influencers and content creators. Most demonstrations relied on official promotional materials, with very few independent tests. This contrasts sharply with products like DeepSeek, which gained recognition through rigorous testing by AI professionals before wider media coverage.

3. Invitation Code Economics

Manus operates on an invitation-only basis, with codes reportedly selling for as much as ¥88,000 ($12,000) on secondary markets. This artificial scarcity model resembles classic viral marketing techniques from the early mobile internet era rather than typical AI product launches.

The company has officially denied selling invitation codes, yet the synchronized wave of influencer posts - many claiming special access through "connections" - suggests a coordinated promotional campaign.

4. Contradictory Positioning

While influencers emphasize Manus as a triumph of Chinese innovation with emotional phrases like "the night sky belongs to China" and "dawn breaks in the East," the product itself presents entirely differently:

The official website is exclusively in English
Promotional videos feature English narration with Chinese subtitles
Contact information directs to Western platforms like Twitter/X
Registration requires international authentication methods

This disconnect between nationalistic marketing and international product positioning has raised questions about the company's actual target audience and strategy.

5. Technical Analysis Raises Doubts

AI developers who've analyzed available information suggest Manus may be less revolutionary than claimed. According to technical assessments shared by industry insiders from Baichuan and MGX:

Manus appears to be a combination of compute use, virtual machines, artifacts, and pre-built agents
Its positioning as a universal agent contradicts the personalized nature of agent technology
The most likely path forward would be becoming a new interface integrating various agents and compute capabilities
The technology faces significant barriers to mass adoption due to its complexity
Its core capabilities risk being internalized by larger language models in the future

The Verdict: Impressive But Unproven

Combining all available information, Manus appears to be an intriguing product with genuine capabilities, but one surrounded by marketing tactics that raise legitimate questions.

The company's representative has stated they've never paid for marketing promotion, yet the synchronized wave of influencer content and podcast appearances before launch suggests a carefully orchestrated campaign. One influencer even mentioned giving an award to the Manus team and participating in a podcast with them days before the public launch.

💡 Looking for AI Image Inspiration and Artistic Character Images?

Explore VisionGeni AI: a completely free, no-signup gallery of Stable Diffusion 3.5 & Flux images with prompts. Try our Flux prompt generator instantly to spark your creativity.

Looking Forward: The Real Test Awaits

Whether Manus represents a genuine leap forward or clever marketing will become clear once it becomes more widely available for independent testing. Until then, potential users would be wise to:

Acknowledge the impressive capabilities demonstrated while maintaining healthy skepticism
Wait for thorough evaluations from technical experts rather than influencers
Consider how Manus compares to rapidly evolving offerings from established players
Remember that genuine technological revolutions typically withstand scrutiny over time

The excitement surrounding Manus reflects the market's hunger for truly autonomous AI agents that can handle complex tasks with minimal supervision. Whether Manus fulfills that promise or simply capitalizes on the desire remains to be seen. As the dust settles and more users gain access, the true nature of this midnight thunder will become clear. Until then, both excitement and skepticism are equally warranted in the face of what might be either the next frontier of AI or simply the next chapter in tech marketing.

Alibaba Releases Wan2.1: A Breakthrough in Open-Source Video Generation Models

Shawn Wang — Wed, 05 Mar 2025 16:00:00 +0000

Introduction

Alibaba has recently open-sourced Wan2.1, a powerful video generation model that has achieved state-of-the-art performance in the field of AI video generation. Released under the Apache 2.0 license, Wan2.1 is now available for developers worldwide through GitHub and HuggingFace platforms.

Your browser does not support the video tag.

Key Features of Wan2.1

Wan2.1 stands out in the AI video generation landscape with several impressive capabilities:

Superior Performance: Ranks #1 on the VBench leaderboard, outperforming both open-source and commercial models
High Resolution Support: Capable of generating videos up to 720P resolution
Low Hardware Requirements: Can run on consumer-grade GPUs with as little as 8GB VRAM
Multilingual Text Support: Uniquely able to generate videos with both Chinese and English text/subtitles
Natural Motion: Produces videos with natural movement, avoiding the distortions common in earlier AI video models

Multiple Task Support

Wan2.1 supports a variety of generation tasks:

Task Type	Description
Text-to-Video (T2V)	Generates complete videos from text descriptions
Image-to-Video (I2V)	Creates dynamic videos from a single image
Video-Edit	AI optimization or modification of existing videos
Text-to-Image (T2I)	Generates high-quality images from text
Video-to-Audio (V2A)	Creates AI audio that matches video content

Available Models

The open-source release includes four specific models across two parameter sizes:

Wan2.1-I2V-14B-720P: 14B parameter model for generating high-definition 720P videos from images
Wan2.1-I2V-14B-480P: 14B parameter model for generating 480P videos from images
Wan2.1-T2V-14B: 14B parameter model for text-to-video generation, supporting both 480P and 720P resolutions
Wan2.1-T2V-1.3B: A lightweight 1.3B parameter model that can run on almost any consumer GPU, requiring only 8.19GB VRAM to generate a 5-second 480P video

The 1.3B model is particularly noteworthy as it outperforms other 5B parameter models and even some larger models, making it an efficient option for developers with limited computational resources.

Technical Innovations

Wan2.1 incorporates several technical innovations:

3D Spatiotemporal VAE

Wan2.1 utilizes an advanced 3D spatiotemporal variational autoencoder (Wan-VAE) that achieves:

More efficient video compression while maintaining temporal consistency
Support for 1080P long videos without losing temporal information
Faster processing and higher quality compared to traditional VAEs
2.5x faster video reconstruction speed on A800 GPUs compared to HunYuanVideo

Video Diffusion Transformer (DiT)

The model employs a mainstream video DiT structure with:

Full Attention mechanism for effective modeling of long-term spatiotemporal dependencies
Flow Matching framework combined with T5 encoder
MLP for processing time embeddings

Data Processing

The training process involved a four-step data curation workflow focusing on:

Basic dimensions
Visual quality
Motion quality

The pre-training process was divided into four stages, gradually increasing resolution and video duration to optimize training within computational constraints.

💡 Looking for AI Image Inspiration?

Explore VisionGeni AI: a completely free, no-signup gallery of Stable Diffusion 3.5 & Flux images with prompts. Try our Flux prompt generator instantly to spark your creativity.

How to Use

Developers can download and use the models through:

GitHub: https://github.com/Wan-Video/Wan2.1
HuggingFace: https://huggingface.co/Wan-AI

The models can be run locally using a Gradio Web interface for an interactive experience.

Wan2.1 also has been integrated with ComfyUI, allowing users to leverage the model within the ComfyUI workflow: https://comfyanonymous.github.io/ComfyUI_examples/wan/

How to Use ComfyUI API with Python: A Complete Guide

Shawn Wang — Wed, 05 Mar 2025 04:28:04 +0000

ComfyUI is an open source node-based application for creating images, videos, and audio with GenAI. While the graphical interface is user-friendly, programmatic access via API can enable automation and integration into your applications. This guide will walk you through two approaches to interact with ComfyUI API using Python.

Prerequisites

Python 3.x
websocket-client library (pip install websocket-client)
A running ComfyUI instance
- For local deployment: use 127.0.0.1:8188
- For remote deployment: use your server's IP address, e.g. 192.168.1.100:8188

Method 1: Basic API with Image Saving

This method is used when your ComfyUI workflow contains SaveImage nodes, which save generated images to the local disk. The API will then retrieve these saved images through HTTP endpoints.

Key Steps

1. Prepare the Workflow Prompt

The workflow prompt is a JSON structure that defines your entire generation pipeline. You can export this from the ComfyUI interface after creating your desired workflow. It includes all nodes (like model loading, sampling, encoding) and their connections.

prompt_text = """
{
    "3": {
        "class_type": "KSampler",
        "inputs": {
            "cfg": 8,
            "denoise": 1,
            "seed": 8566257,
            "steps": 20,
            "sampler_name": "euler",
            "scheduler": "normal",
            "latent_image": ["5", 0],
            "model": ["4", 0],
            "positive": ["6", 0],
            "negative": ["7", 0]
        }
    },
    "4": {
        "class_type": "CheckpointLoaderSimple",
        "inputs": {
            "ckpt_name": "v1-5-pruned-emaonly.safetensors"
        }
    },
    # ... other nodes configuration
}
"""
prompt = json.loads(prompt_text)

2. Customize the Prompt

Before execution, you can modify various parameters in the prompt to customize the generation. Common modifications include changing the text prompt, seed, or sampling parameters.

# Modify the text prompt for the positive CLIPTextEncode node
prompt["6"]["inputs"]["text"] = "masterpiece best quality man"

# Change the seed for different results
prompt["3"]["inputs"]["seed"] = 5

3. Set Up WebSocket Connection

ComfyUI uses WebSocket to provide real-time updates about the generation process. This connection allows you to monitor the execution status and receive preview images during generation.

client_id = str(uuid.uuid4())  # Generate a unique client ID
ws = websocket.WebSocket()
ws.connect(f"ws://{server_address}/ws?clientId={client_id}")

4. Queue the Prompt

Submit the generation request to ComfyUI's queue. Each request receives a unique prompt id that we'll use to track its execution and retrieve results.

def queue_prompt(prompt):
    p = {"prompt": prompt, "client_id": client_id}
    data = json.dumps(p).encode('utf-8')
    req = urllib.request.Request(f"http://{server_address}/prompt", data=data)
    return json.loads(urllib.request.urlopen(req).read())

# Get prompt_id for tracking the execution
prompt_id = queue_prompt(prompt)['prompt_id']

5. Monitor Execution Status

Listen to WebSocket messages to track the generation progress. The server sends updates about which node is currently executing and when the entire process is complete. You can also receive preview images during generation.

while True:
    out = ws.recv()
    if isinstance(out, str):
        message = json.loads(out)
        if message['type'] == 'executing':
            data = message['data']
            if data['node'] is None and data['prompt_id'] == prompt_id:
                break  # Execution complete
    else:
        # Binary data (preview images)
        continue

6. Get History and Retrieve Images

Once execution is complete, we need to:

Fetch the execution history to get information about generated images
Use that information to retrieve the actual image data through the view endpoint

def get_history(prompt_id):
    with urllib.request.urlopen(f"http://{server_address}/history/{prompt_id}") as response:
        return json.loads(response.read())

def get_image(filename, subfolder, folder_type):
    data = {"filename": filename, "subfolder": subfolder, "type": folder_type}
    url_values = urllib.parse.urlencode(data)
    with urllib.request.urlopen(f"http://{server_address}/view?{url_values}") as response:
        return response.read()

# Get history for the executed prompt
history = get_history(prompt_id)[prompt_id]

# Since a ComfyUI workflow may contain multiple SaveImage nodes,
# and each SaveImage node might save multiple images,
# we need to iterate through all outputs to collect all generated images
output_images = {}
for node_id in history['outputs']:
    node_output = history['outputs'][node_id]
    images_output = []
    if 'images' in node_output:
        for image in node_output['images']:
            image_data = get_image(image['filename'], image['subfolder'], image['type'])
            images_output.append(image_data)
    output_images[node_id] = images_output

7. Process Images and Clean Up

Finally, process the retrieved images as needed (save to disk, display, or further processing) and clean up resources by closing the WebSocket connection.

# Process the generated images
for node_id in output_images:
    for image_data in output_images[node_id]:
        # Convert bytes to PIL Image
        image = Image.open(io.BytesIO(image_data))
        # Process image as needed
        # image.save(f"output_{node_id}.png")

# Always close the WebSocket connection
ws.close()

Method 2: WebSocket-Based Image Transfer

This method is used when your ComfyUI workflow contains SaveImageWebsocket nodes, which stream generated images directly through the WebSocket connection without saving to disk. This is more efficient for real-time applications.

Key Steps

1. Prepare and Customize Prompt

Similar to Method 1, but using SaveImageWebsocket node:

prompt_text = """
{
    "3": {
        "class_type": "KSampler",
        "inputs": {
            "cfg": 8,
            "denoise": 1,
            "seed": 8566257,
            "steps": 20,
            "sampler_name": "euler",
            "scheduler": "normal",
            "latent_image": ["5", 0],
            "model": ["4", 0],
            "positive": ["6", 0],
            "negative": ["7", 0]
        }
    },
    # ... other nodes remain the same ...
    "save_image_websocket_node": {
        "class_type": "SaveImageWebsocket",
        "inputs": {
            "images": ["8", 0]
        }
    }
}
"""
prompt = json.loads(prompt_text)

# Customize the prompt
prompt["6"]["inputs"]["text"] = "masterpiece best quality man"
prompt["3"]["inputs"]["seed"] = 5

2. Set Up WebSocket Connection

client_id = str(uuid.uuid4())
ws = websocket.WebSocket()
ws.connect(f"ws://{server_address}/ws?clientId={client_id}")

3. Queue the Prompt

def queue_prompt(prompt):
    p = {"prompt": prompt, "client_id": client_id}
    data = json.dumps(p).encode('utf-8')
    req = urllib.request.Request(f"http://{server_address}/prompt", data=data)
    return json.loads(urllib.request.urlopen(req).read())

# Get prompt_id for tracking the execution
prompt_id = queue_prompt(prompt)['prompt_id']

4. Monitor Execution Status

Similar to Method 1, we monitor the WebSocket messages to track execution progress, but we also need to track which node is currently executing to properly collect image data. When we detect that the save_image_websocket_node is executing, any subsequent binary data received will be the image data, which we collect directly from the WebSocket stream.

current_node = ""
output_images = {}

while True:
    out = ws.recv()
    if isinstance(out, str):
        message = json.loads(out)
        if message['type'] == 'executing':
            data = message['data']
            if data['prompt_id'] == prompt_id:
                if data['node'] is None:
                    break  # Execution complete
                else:
                    current_node = data['node']
    else:
        # Handle binary image data from SaveImageWebsocket node
        if current_node == 'save_image_websocket_node':
            images_output = output_images.get(current_node, [])
            images_output.append(out[8:])  # Skip first 8 bytes of binary header
            output_images[current_node] = images_output

5. Process Images and Clean Up

Once all images are collected, we can process them as needed:

# Process the images
for node_id in output_images:
    for image_data in output_images[node_id]:
        # Convert binary data to PIL Image
        image = Image.open(io.BytesIO(image_data))
        # Process image as needed
        # image.show()

# Clean up
ws.close()

Complete Example Code

For complete working examples of both methods, please refer to the official ComfyUI repository:

Method 1 (Basic API): websockets_api_example.py
Method 2 (WebSocket): websockets_api_example_ws_images.py

💡 Looking for AI Image Inspiration?

Explore VisionGeni AI: a completely free, no-signup gallery of Stable Diffusion 3.5 & Flux images with prompts. Try our Flux prompt generator instantly to spark your creativity.

Choosing Between Methods

Use Method 1 (Basic API) when:
- You need to persist images to disk
- You want simpler error recovery
- Network stability is a concern
Use Method 2 (WebSocket) when:
- You need real-time image processing
- You want to avoid disk I/O
- You're building an interactive application
- Performance is critical