DEV Community: brooks wilson

DeepSeek-V4 Preview: Entering the Era of Accessible Million-Token Context

brooks wilson — Fri, 24 Apr 2026 03:20:03 +0000

DeepSeek-V4 Preview: Entering the Era of Accessible Million-Token Context

Today, we are officially launching and open-sourcing the preview release of DeepSeek-V4, our new model family.

DeepSeek-V4 supports an ultra-long 1M-token context window and reaches leading performance in China and across the open-source ecosystem in agent capabilities, world knowledge, and reasoning. The model family is available in two sizes.

Starting today, you can visit chat.deepseek.com or use the official DeepSeek app to chat with the latest DeepSeek-V4 models and explore the new experience enabled by 1M-context memory.

The API service has also been updated. To call the new models, simply change model_name to either deepseek-v4-pro or deepseek-v4-flash.

DeepSeek-V4-Pro: Performance Comparable to Top Closed-Source Models

Significantly Improved Agent Capabilities

Compared with the previous generation, DeepSeek-V4-Pro delivers a substantial improvement in agent capabilities.

In agentic coding evaluations, V4-Pro has reached the strongest level currently available among open-source models. It also performs well across other agent-related benchmarks.

DeepSeek-V4 is now used internally as the company’s agentic coding model. According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6 in non-thinking mode. However, it still trails Opus 4.6 in thinking mode.

Rich World Knowledge

In world knowledge evaluations, DeepSeek-V4-Pro significantly outperforms other open-source models and is only slightly behind the top closed-source model, Gemini-Pro-3.1.

World-Class Reasoning Performance

Across evaluations in mathematics, STEM, and competitive programming, DeepSeek-V4-Pro surpasses all open-source models with public benchmark results to date, achieving performance comparable to the world’s leading closed-source models.

DeepSeek-V4-Flash: A Faster and More Cost-Efficient Option

Compared with DeepSeek-V4-Pro, DeepSeek-V4-Flash is slightly weaker in world knowledge, but demonstrates similar reasoning capabilities.

Because it has fewer parameters and lower activation requirements, V4-Flash can provide faster and more economical API service.

In agent evaluations, DeepSeek-V4-Flash performs on par with DeepSeek-V4-Pro on simple tasks, but still shows a gap on more difficult tasks.

Architectural Innovation and Highly Efficient Long Context

DeepSeek-V4 introduces a new attention mechanism that compresses along the token dimension. Combined with DSA sparse attention—DeepSeek Sparse Attention—it achieves globally leading long-context capability while substantially reducing compute and memory requirements compared with traditional approaches.

Starting now, 1M context will become the standard configuration for all official DeepSeek services.

Targeted Optimization for Agent Workloads

DeepSeek-V4 has been adapted and optimized for mainstream agent products such as Claude Code, OpenClaw, OpenCode, and CodeBuddy.

It shows improvements across code tasks, documentation generation, and related workflows. The following figure shows an example of a PPT slide generated by V4-Pro within an agent framework.

Scroll up and down or click to enlarge.

API Access

Due to limited access to high-end compute, Pro currently has very limited service throughput. Its pricing is expected to drop significantly in the second half of the year once Ascend 950 supernodes begin coming online at scale.

The DeepSeek API now supports both V4-Pro and V4-Flash, with compatibility for the OpenAI Chat Completions API and the Anthropic API.

The base_url remains unchanged. To access the new models, set the model parameter to one of the following:

deepseek-v4-pro
deepseek-v4-flash

Both V4-Pro and V4-Flash support a maximum context length of 1M tokens. Both models support non-thinking mode and thinking mode.

In thinking mode, the reasoning_effort parameter can be used to set the reasoning intensity:

high
max

For complex agent scenarios, we recommend using thinking mode and setting the reasoning intensity to max.

For model invocation and parameter configuration, please refer to the API documentation:

https://api-docs.deepseek.com/zh-cn/guides/thinking_mode

Please note that the two legacy API model names, deepseek-chat and deepseek-reasoner, will be discontinued in three months, on July 24, 2026.

During the transition period, these two model names will point to the following modes:

deepseek-chat      -> deepseek-v4-flash, non-thinking mode
deepseek-reasoner  -> deepseek-v4-flash, thinking mode

Open Weights and Local Deployment

DeepSeek-V4 model weights are available at:

https://huggingface.co/collections/deepseek-ai/deepseek-v4

https://modelscope.cn/collections/deepseek-ai/DeepSeek-V4

The DeepSeek-V4 technical report is available here:

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

Closing Thoughts

“Do not be tempted by praise, do not fear criticism. Follow the right path, and hold yourself upright.”

Thank you to every user for your trust and support. Your recognition, suggestions, and expectations are what drive us to keep exploring and improving. They also remind us to stay true to our original mission and remain focused on continuous innovation.

We will continue to follow a long-termist approach, move forward steadily through experimentation and reflection, and keep working toward the goal of AGI.

GPT Image 2: What It Is, What It Can Do, and Why It's Different From Every AI Image Tool That Came Before

brooks wilson — Thu, 23 Apr 2026 03:56:15 +0000

On April 21, 2026, OpenAI dropped something the industry has been waiting on for about a year: GPT Image 2 (branded as ChatGPT Images 2.0 inside the chat product).

The launch wasn't quiet. Within 24 hours, GPT Image 2 was sitting at #1 across all three LM Arena image leaderboards — text-to-image (Elo 1512), single-image editing (1513), and multi-image editing (1464) — and had already been integrated by Figma, Canva, Adobe Firefly, fal, and Hermes Agent.

But the benchmark numbers aren't really the story. The story is this:

For the first time, an image model will stop, think about your request, search the web if it needs to, check its own work, and only then start drawing pixels.

That change sounds small when you summarize it. It isn't. It's the same architectural shift that turned chat models from "autocomplete engines" into something you can actually give a problem to. Now it's happening in image generation.

This is a long guide. Here's what it covers:

What GPT Image 2 actually is (and what's new about the architecture)
The five capabilities that make it a different category of tool
Five hands-on prompts I ran myself, with notes on why each one matters
Pricing, with real per-image cost math
Head-to-head comparison with Midjourney, Nano Banana Pro, Flux.2, and Stable Diffusion
Where GPT Image 2 still fails
How to use it in ChatGPT and through the API
FAQ

If you're evaluating whether to build image generation into your product — or whether to cancel your Midjourney subscription — the goal of this article is to save you two or three hours of research.

What Is GPT Image 2?

GPT Image 2 is OpenAI's third-generation native image generation model, and the first image model in the industry with built-in reasoning capabilities.

Two things in that sentence matter.

"Native" means GPT Image 2 generates images the same way GPT generates text: token by token, inside the language model itself. Older tools like DALL-E 3 were diffusion models bolted onto ChatGPT as an external module. GPT Image 2 is part of the same transformer stack that handles language, which is why it understands prompts the way it does. It knows what a "magazine cover" is because it knows what everything is — the same world knowledge that makes GPT-5 useful for text is now rendering pixels.

"Reasoning" means the model borrows the thinking-then-answering architecture from OpenAI's o-series. Before a single pixel is committed, GPT Image 2 can:

Analyze the semantic intent of your prompt
Plan composition, spatial layout, and typography
Reason about physical and logical constraints (shadows match the light source, reflections match geometry, text is legible at the intended size)
Search the web mid-generation for reference imagery or factual data
Generate multiple candidate images and self-select the best one

That loop is what "thinking mode" means in practice. The immediate consequence is that complex prompts — the kind that used to require three or four tries on older models — now succeed on the first attempt significantly more often.

The model ID for developers is gpt-image-2. It's live on ChatGPT, Codex, and the OpenAI API simultaneously, which is unusual — OpenAI typically staggers releases.

A Quick Family Tree

gpt-image-1 — April 2025. The first native image model inside GPT. Launched with the Studio Ghibli meme that briefly broke Twitter; 130M+ users generated 700M+ images in the first week.
gpt-image-1.5 — December 2025. Up to 4× faster, better instruction following on edits, warmer color cast.
gpt-image-2 — April 2026. Reasoning, 2K native resolution, near-perfect multilingual text, ~3 second generation, multi-image consistency. The warm color cast is gone.

Why Architecture Matters (Short Version)

If you want the technical reason GPT Image 2 behaves differently from Midjourney and Flux, it's this:

Diffusion models start with noise and gradually denoise toward an image. Stable Diffusion, Midjourney, Flux, DALL-E — all diffusion. The upside is beautiful gradients and painterly output. The downside is that the model doesn't really "know" what it's drawing halfway through; it's just denoising toward a target.

Autoregressive models write the image from left to right, token by token, the same way you'd write a sentence. Each visual token is conditioned on every token that came before it. The upside is logical consistency — if the model wrote "E = mc²" on a blackboard in the top-left, it knows that text is there when drawing the rest of the scene. The downside, historically, has been speed and resolution.

GPT Image 2 is autoregressive. Adding the reasoning step on top means the model plans the composition before it starts generating tokens, which reduces the chance of the sequence painting itself into a corner.

This is why you'll see GPT Image 2 nail things that stump diffusion models: precise text, 3×3 grids where each cell stays separate, infographics with real labels, UI mockups with working hierarchies. These are sequential logic problems, not aesthetic problems.

The Five Capabilities That Matter

1. Thinking Mode — The Headline Feature

GPT Image 2 has two modes:

Instant — Direct generation, ~3 seconds per image, similar UX to the older models. Available to all ChatGPT users including the free tier.
Thinking — The model reasons about composition, can search the web, generates multiple candidates, and self-checks outputs. Available to ChatGPT Plus, Pro, Business, and Enterprise users; available to all API users.

Thinking mode is where the bigger jumps in quality show up. Examples OpenAI highlighted at launch:

Page-long manga from a single prompt, with the same character drawn consistently across 6–8 panels
Full magazine layouts with proper headlines, subheads, body text, captions, and image placement
Design plans for every room in a house, maintaining a coherent aesthetic across images
Social media graphic sets (think: Instagram story + post + reel cover) with matching typography and brand feel

With thinking mode enabled, a single prompt can return up to 8 images at once. Consistency across those 8 images — same character, same product, same style — is what multi-image editing tools used to do in multiple manual passes.

2. Near-Perfect Multilingual Text Rendering

This is probably the single most important practical upgrade.

Text rendering has been the Achilles' heel of AI image generation since DALL-E. If you asked Midjourney to write a Chinese headline or a Japanese caption on a poster, you'd get convincingly font-like shapes that weren't actually characters. GPT Image 2 changes that.

LM Arena blind tests report near character-level 100% accuracy on short-to-medium text across English, Chinese (Simplified and Traditional), Japanese, Korean, Hindi, Bengali, and Arabic. One tester's quote captured the scale of the change: "The gap between GPT Image 2 and Nano Banana Pro on text is as big as the gap between Nano Banana Pro and DALL-E."

What this unlocks, concretely:

Localized marketing assets across multiple languages from a single prompt
Posters, packaging, and signage that ship without a Photoshop pass to fix the text
Infographics and charts with correct numerical labels and legends
UI mockups with real button labels, menu items, and status text
Multi-panel comics with coherent dialogue

Longer paragraph text — paragraphs of body copy inside a generated image — is still an area where Nano Banana Pro sometimes holds an edge. If you're generating document-style posters with a lot of small body text, test both before committing.

3. Native 2K Resolution, Experimental 4K

GPT Image 2 renders at up to 2048×2048 natively. Custom dimensions are supported as long as both edges are multiples of 16 and the total pixel count stays within the model's budget. Practical sizes include 1024×1024, 1920×1080, 2560×1440, and tall verticals like 1280×3840 for mobile-first content.

Above 2K, OpenAI officially labels the output "experimental." In practice: 4K sometimes works beautifully, sometimes shows artifacts at the edges or inconsistencies across large areas. The production-recommended workflow for anything beyond 2K is generate at 2K, then run through a dedicated upscaler like Magnific or Topaz. That path is also cheaper.

4. Precise Editing via Masked Inpainting and Outpainting

The editing endpoint supports mask images. You pass the original image plus a mask (black and white PNG indicating where changes are allowed), and the model modifies only the masked region — unrelated pixels stay pixel-identical.

Use cases where this is dramatically better than full-image regeneration:

Product photo background swaps — new setting, same product, same lighting
Packaging visualization — update copy or logos without redrawing the box
Outfit and accessory replacement — swap one item while preserving the rest of the scene
Iterative design refinement — change one element at a time across a long review cycle

In practical testing, GPT Image 2 handles chained edits (edit → edit → edit, building on each other) more stably than any of the competing models.

5. Speed: ~3 Seconds Per Image

Arena observers clocked GPT Image 2 at roughly 3 seconds per generation in instant mode. Nano Banana Pro takes 10–15 seconds. Midjourney V7 is typically 30–60 seconds for a standard grid.

Three seconds is an interactive experience. Ten seconds needs a loading animation. Thirty seconds is a queue. This is why the speed difference matters more than it looks on paper — the UX pattern for a 3-second model is completely different from the UX pattern for a 30-second model.

Thinking mode is slower, usually 15–40 seconds depending on prompt complexity, because the reasoning step generates additional tokens. Still faster than Midjourney, still plenty fast for batch workflows.

Five Hands-On Prompts, With Notes

These five prompts are designed to hit the specific capabilities listed above. Each one comes with a short note explaining what I was trying to stress-test and what the expected result shows. If you want to run them yourself, they work best in thinking mode.

Prompt 1 — Multilingual Magazine Cover

What this tests: The flagship capability. Text rendering across four scripts on a single composition (Latin, Chinese, Japanese, Korean, Arabic), combined with editorial layout discipline.

Why it matters: This is the single hardest thing to do with older models. Midjourney V7 will fail at the Chinese title; DALL-E 3 will fail at the Arabic subtitle; every diffusion model will mangle at least one of these scripts. If GPT Image 2 gets all of them right with correct typography and layout, that's the defining proof that this is a different category of model.

Prompt:

A vertical magazine cover titled "AI 浪潮" in bold modern Chinese 
typography, with English subtitle "Issue No.47 — The GPT Image 2 Era". 
Below, three smaller headlines in three languages:
- 日本語：「画像生成の新時代」
- 한국어："이미지 생성의 미래"
- العربية: "عصر جديد"

Design style: editorial minimalism, deep navy background with a soft 
orange accent stripe on the left edge, photorealistic lighting, paper 
texture. The Chinese main title takes up roughly 40% of the cover 
height. Price tag: $9.99 in the bottom right corner.

Prompt 2 — Infographic with Real Data

What this tests: Structured layout with multiple content zones, data visualization (a simple line chart), mixed typography at different sizes, and — critically — correctly rendered numerical labels. Plus, the content itself is a meta joke: it's an infographic about GPT Image 2, which means I'm asking the model to describe its own capabilities on a poster.

Why it matters: Infographics are what Midjourney and older diffusion models completely collapse on. The data points have to line up, the labels have to be readable, the hierarchy has to make sense. This is also the exact use case most business users care about — quarterly reports, product one-pagers, pitch deck slides.

Prompt:

A clean vertical infographic titled "GPT Image 2 at a Glance".

- Header: a small abstract geometric logo "G2", subtitle 
  "Released April 21, 2026"
- Section 1: a simple line chart showing "Text Accuracy" rising from 
  71% (Midjourney V7) → 87% (GPT Image 1.5) → ~100% (GPT Image 2). 
  Label each data point clearly.
- Section 2: three small stat cards — "2K native resolution", 
  "~3 sec per image", "$0.21 per HD image"
- Section 3: a horizontal bar labeled "Supports: English · 中文 · 
  日本語 · 한국어 · हिन्दी · বাংলা · العربية"

Sans-serif typography, off-white #F9F9F8 background, navy and warm 
orange as accent colors, flat vector style, Apple-like clean layout. 
Readable at mobile size.

Prompt 3 — Photorealistic App UI Mockup

What this tests: Object realism (an iPhone) combined with screen-within-screen generation — the model has to render both the physical device and a plausible UI running on it. Status bar details, button states, and small UI text all need to be right.

Why it matters: Product teams spend a lot of time making mockups for investor decks, design reviews, and marketing pages. If GPT Image 2 can generate convincing device mockups from a text description, that's hours saved per sprint. This capability was what convinced LM Arena testers that the model was a step-change — UI reconstruction is another problem that's really a sequential-logic problem disguised as a visual one.

Prompt:

A photorealistic iPhone 16 Pro mockup floating at a slight angle on a 
soft gray gradient background. On the screen: a mobile app UI titled 
"ImageLab" with:

- Top nav: "Home · Create · Gallery" tabs, the middle one highlighted 
  in orange
- Main area: a 2×2 grid of generated image thumbnails with captions 
  "Portrait · Product · Infographic · Poster"
- Bottom: a prompt input bar with placeholder text "Describe what you 
  want to create..." and a blue "Generate" button
- Status bar shows 9:41, full battery, 5G

Style: clean SaaS product UI, subtle drop shadows, realistic glass 
reflection on the phone screen, studio lighting. Add a small floating 
caption under the phone that reads "Built with GPT Image 2".

Prompt 4 — Four-Panel Comic With Character Consistency

What this tests: Multi-image consistency, one of the headline features of thinking mode. The same character has to appear in all four panels with recognizable facial features, clothing, and hairstyle — while the expression, pose, and background change. Dialogue bubbles have to read correctly. Panel layout has to follow Western reading order.

Why it matters: Multi-panel consistency is the capability that separates "image generator" from "visual storytelling tool." Without it, you can't make comics, storyboards, product sequences, or tutorial illustrations without heavy manual work. OpenAI put a ton of weight on this at launch — page-long manga from a single prompt was one of their flagship demos.

Prompt:

A 4-panel black-and-white manga-style comic strip, arranged 2×2, with 
clean dialogue bubbles in English.

- Panel 1: A tired-looking designer at a messy desk, surrounded by 
  printed drafts. Thought bubble: "I need 20 variations by tomorrow..."
- Panel 2: The designer types a prompt into a laptop glowing with a 
  subtle "GPT Image 2" UI. Motion lines suggest speed.
- Panel 3: A wide shot of a grid of finished posters appearing on the 
  screen, each clearly different but on-brand. Designer's eyes wide 
  with shock: "Wait, all of them... in one shot?"
- Panel 4: The designer leaning back, coffee in hand, feet on desk, 
  monitor in background showing "✓ Done". Caption at the bottom: 
  "The new creative workflow."

Style: crisp ink lines, screentone shading, consistent character 
design across all 4 panels.

Prompt 5 — Commercial Product Shot With Two Types of Text

What this tests: The all-in-one challenge. Photorealism, material rendering (matte metal, walnut wood, leather), controlled depth of field, studio-grade lighting — and two different kinds of text in the same image (engraved serif on the pen, handwritten cursive on the card). A lot of specialized photography skills compressed into one prompt.

Why it matters: This is what real commercial use looks like. Product photographers charge hundreds of dollars per shot to set up this kind of scene. If GPT Image 2 can produce a usable version of it, it's not just a curiosity — it's a production tool. This is also the prompt where material realism matters most, and where Flux.2 Pro historically held an edge. Worth seeing whether GPT Image 2 has closed that gap.

Prompt:

A hyper-realistic product hero shot of a minimalist matte-black 
fountain pen lying at a slight angle on a smooth dark walnut desk 
surface.

- Engraved on the pen barrel in fine silver serif text: 
  "CRAFTED FOR CLARITY · EST. 2026"
- Next to the pen, a small folded card with handwritten cursive text 
  that reads: "Dear Reader, thank you for choosing us."
- Soft window light from the top-left, creating long gentle shadows 
  and a subtle highlight on the metallic clip.
- Shallow depth of field, the back of the desk softly out of focus, 
  with a hint of a leather notebook and a cup of black coffee.

Photography style: commercial editorial, shot on Phase One, 85mm, f/2.8.

Pricing: ~$0.21 Per HD Image, Thinking Mode Extra

OpenAI prices GPT Image 2 by tokens, not by image. Here's the rate card:

Item	Price per 1M tokens
Text input	$5
Text output	$10
Image input	$8
Image input (cached)	$2
Image output	$30

Translated to per-image costs at common sizes:

Size	Quality	Approximate cost
1024×1024	Low	$0.006
1024×1024	Medium	$0.053
1024×1024	High	$0.211
1024×1536	Low	$0.005
1024×1536	Medium	$0.041
1024×1536	High	$0.165

A few things worth noting:

At 1024×1024 high quality, GPT Image 2 is about 60% more expensive than GPT Image 1.5 ($0.211 vs $0.133). That's the cost of the larger internal canvas and the reasoning step. But at 1024×1536, GPT Image 2 is actually cheaper than its predecessor ($0.165 vs $0.20). The pricing math shifts with aspect ratio in non-obvious ways, so benchmark for your exact use case.

Thinking mode consumes additional reasoning tokens. A simple illustration prompt might add a few thousand reasoning tokens. A multi-panel comic with complex layout constraints can add a lot more. Budget for variable per-image cost when doing layout-heavy work, not a flat rate.

Cached image inputs are 4× cheaper ($2 vs $8 per million tokens). If you're doing iterative editing on the same source image, the second and subsequent requests get a meaningful discount.

For high-volume use cases, the cost ladder typically looks like:

Iterate 10–20 drafts at quality=low (~$0.006 each)
Narrow to 2–3 directions at quality=medium
Render the final at quality=high

This keeps the total spend per final asset under $0.50 even for complex work.

GPT Image 2 vs Midjourney vs Nano Banana Pro vs Flux.2

There's no single winner. Each model is optimized for a different primary constraint.

Dimension	GPT Image 2	Nano Banana Pro	Midjourney V7	Flux.2 Pro	Stable Diffusion / DALL-E 3
Architecture	Native autoregressive + reasoning	Multimodal diffusion + search grounding	Diffusion	Diffusion	Diffusion
Text rendering	~100%, multilingual	87–96%, strong on long paragraphs	~71%, weak	Mid	Weak
Reasoning	✅ o-series thinking	✅ Search grounding	❌	❌	❌
Speed	~3s / ~15–40s thinking	10–15s	30–60s	5–10s	5–20s
Native resolution	2K (4K experimental)	4K native	2K	2K	1–2K
API access	✅	✅ Vertex AI	❌ Discord/web only	✅	✅
Strengths	Text, reasoning, UI, infographics, speed	Consistency, 4K, long-form editing	Artistic style, cinematic look	Material realism	Open source, self-hostable
Weaknesses	Portrait realism, spatial reasoning (reflections)	Speed	No API, no precise control	Instruction following	Text, complex instructions
Cost per HD image	~$0.21	~$0.039–$0.151	~$0.033 (subscription)	$0.06–$0.15	Near-zero (self-hosted)

Which Should You Actually Use?

Pick GPT Image 2 when: you need accurate text, you're generating UI mockups, you're doing infographics or data viz, you want reasoning over composition, you need the fastest generation in production, or you want integration with the rest of the OpenAI stack.

Pick Nano Banana Pro when: you need true 4K, you need 14-image reference capability, you need maximum consistency across many edits, or you need SynthID watermarking for compliance. It's also the current choice for enterprise through Google Cloud with copyright protection.

Pick Midjourney when: you need art direction, cinematic mood, stylistic coherence, or aesthetic output for creative applications. Midjourney still wins on pure aesthetic. No API, so automation isn't an option.

Pick Flux.2 when: you need material realism (fabrics, skin, surfaces) or you need an open-source model you can self-host and fine-tune on your own data.

Pick Stable Diffusion / open-source models when: cost per image must approach zero, you need custom training, or you have regulated data that can't leave your infrastructure.

A pattern that's emerged in 2026: production teams run two models in parallel. Midjourney for concepts and moodboards, GPT Image 2 or Nano Banana Pro for final production assets. The subscription math still works out because each tool is better at its specific job.

Where GPT Image 2 Still Fails

It's not flawless. Things to watch for:

Portrait realism at close range. LM Arena blind tests show Nano Banana Pro ahead on fine skin texture, hair detail, and emotional nuance in portraits. If you're doing fashion photography or beauty close-ups, test both.

Spatial reasoning on reflective surfaces. The classic failure case is a Rubik's cube in a mirror — the reflection should be geometrically correct, and GPT Image 2 sometimes gets this wrong. If your scene depends on precise reflection physics (a product in a mirror, a character reflected in a store window), verify before shipping.

Multi-reference consistency over long sequences. Thinking mode maintains consistency across 6–8 images from a single prompt. Beyond that — a 12-panel story, a 20-shot product catalog — consistency starts drifting. Nano Banana Pro with its 14-image reference capability handles longer sequences better.

Dense body paragraphs. Single headlines, short captions, UI labels — GPT Image 2 is near-perfect. Long paragraphs of small body text in a poster-style image still occasionally have artifacts. Nano Banana Pro is currently better for document-style output.

Real person likenesses. OpenAI's safety layer actively blocks generation of recognizable real people. If your workflow needs celebrity likenesses or real-person reference, this is a hard limit and won't change.

4K at production quality. Experimental for a reason. Use 2K + upscaler instead.

How to Use It: ChatGPT and API

In ChatGPT

As of April 22, 2026, every ChatGPT and Codex user can use ChatGPT Images 2.0 directly in the web or mobile interface. The entry point is the same as before — just prompt for an image.

Free users: instant mode only
Plus ($20/month) and above: instant + thinking mode, web search during generation, multi-image consistency, up to 8 images per prompt

Inside Codex, image generation is integrated into the workspace and does not require a separate API key.

Via API

The endpoint follows the same /images/generations pattern as previous models. Pass gpt-image-2 as the model ID.

Python example:

from openai import OpenAI
client = OpenAI()

response = client.images.generate(
    model="gpt-image-2",
    prompt="A hyperrealistic fountain pen on a walnut desk...",
    size="1024x1024",
    quality="high",
    reasoning_effort="medium"  # optional: enables thinking mode
)

image_url = response.data[0].url

Key parameters:

size — any dimensions where both edges are multiples of 16 and total pixels stay within budget
quality — low / medium / high. Start with low during iteration.
reasoning_effort — minimal / low / medium / high. Controls thinking mode strength. Higher effort burns more reasoning tokens but improves first-attempt success on complex layouts.

For editing, the /images/edits endpoint accepts an image URL plus an optional mask PNG:

response = client.images.edit(
    model="gpt-image-2",
    image=open("product.png", "rb"),
    mask=open("background-mask.png", "rb"),
    prompt="Replace the background with a dramatic overcast sky",
    quality="high"
)

Rate limits and batch behavior are documented in the OpenAI API docs. Queue-based async patterns are supported through the standard job endpoints and also through third-party platforms like fal if you need higher throughput.

Practical Tips (From Running It for a Week)

1. Start every project at quality=low. The cost drops 35× compared to high quality, and low quality is genuinely usable for ideation. Switch to high only once direction is locked.

2. For text-heavy prompts, always turn on thinking mode. The first-attempt success rate improvement is large enough to save money on retries even after accounting for reasoning token cost.

3. Vertical and portrait formats are often cheaper. 1024×1536 high quality is $0.165, less than 1024×1024 at $0.211. Optimal for mobile-first content (Instagram, TikTok, WeChat) anyway.

4. Don't force 4K in production. Use 2K + a dedicated upscaler. More reliable, cheaper.

5. For portraits and fashion work, keep a Nano Banana Pro or Flux.2 backup. GPT Image 2 is great for most things, but these are the two domains where it sometimes loses.

6. Cache image inputs for iterative edits. The 4× discount on cached image tokens adds up fast over a review cycle.

7. Use the reasoning_effort parameter strategically. minimal for simple illustration prompts, medium for standard work, high only for complex layouts where first-attempt success actually matters.

FAQ

What's the difference between ChatGPT Images 2.0 and GPT Image 2?
Same thing, two names. ChatGPT Images 2.0 is the consumer product name; gpt-image-2 is the API model ID.

Is it free for ChatGPT users?
Instant mode is free for everyone including the free tier. Thinking mode, web search during generation, and multi-image consistency are limited to Plus, Pro, Business, and Enterprise plans.

What does one high-quality image cost through the API?
About $0.211 at 1024×1024 and $0.165 at 1024×1536. Thinking mode adds variable reasoning token costs on top. Budget $0.25–$0.40 per complex thinking-mode image to be safe.

Can it generate images of real people?
Not recognizable real people — OpenAI's safety layer blocks this at both the input and output stages. Fictional characters, generic people, and stylized representations are fine.

Does it replace Midjourney?
For text, UI, infographics, and technical work — yes, immediately. For aesthetic concept art and cinematic mood pieces — no, Midjourney's artistic sensibility is still unmatched. Many teams subscribe to both and route by use case.

Is the output commercially usable?
Yes. Generated images follow OpenAI's standard commercial usage terms. All outputs include C2PA metadata identifying the model, which helps with provenance but does not restrict use.

Can I run it offline or self-host it?
No. GPT Image 2 is closed-source and only available through OpenAI's API or through platforms that proxy to it (Azure Foundry, fal, OpenRouter, and similar). For self-hosting, look at Flux.2 or Stable Diffusion.

Bottom Line

GPT Image 2 isn't a replacement for Midjourney or a clone of Nano Banana Pro. It's the first image model that reasons before it draws — the same architectural shift that turned chat models into thinking assistants, now applied to pixels.

Three things are worth your attention:

Multilingual text rendering is effectively solved, which means a huge category of business visuals (posters, infographics, localized ads, UI mockups) can skip the Photoshop pass
Thinking mode + multi-image consistency means comics, storyboards, design systems, and product catalogs can be generated in coherent batches rather than one-at-a-time retries
~3 seconds per image at $0.21 makes GPT Image 2 viable as a production API, not just a creative toy

For founders, developers, designers, and content creators, this is the most significant image model update since Midjourney V6. If you've been waiting for the moment to build image generation into a product, this is it.

The next 6 months will be about seeing what people actually make with it. I'll be watching.

An Anonymous Model Just Took #1—and Flipped the AI Video Race Overnight

brooks wilson — Sat, 11 Apr 2026 15:31:22 +0000

How “HappyHorse” Disrupted the AI Video Generation Landscape

A Sudden Shift in the Rankings

On April 7, the global AI community woke up to an unexpected development: a previously unknown model named HappyHorse-1.0 appeared at the top of the Artificial Analysis Video Arena leaderboard.

The reaction was immediate and widespread. Developers and researchers began sharing results and speculating about its origin. The model demonstrated capabilities that felt notably ahead of what many had seen in production systems.

Within hours:

It ranked #1 in text-to-video with a score of 1332
Achieved 1391 in image-to-video, setting a new record
Placed #2 globally in audio-integrated video generation

The margin wasn’t incremental—it was decisive. The previous leader, ByteDance’s Seedance 2.0, was surpassed by nearly 60 points.

A Carefully Orchestrated Release

The timeline suggests this was not a spontaneous breakthrough, but a deliberate rollout.

Early April 7 (UTC): HappyHorse-1.0 appears on the leaderboard
Morning: Discussion spreads rapidly across X (Twitter) and developer communities
Afternoon: Speculation intensifies—possible origins include Alibaba, ByteDance, Tencent, or even DeepSeek
April 8 (Market Open): Alibaba’s stock rises significantly, reflecting market speculation
Later that day: A website appears claiming full open-source release, including:
- Base model
- Distilled variants
- Super-resolution modules
- Inference code

This sequence reveals three key signals:

1. Timing Was Strategic

The model was likely developed over months and released at a moment designed to maximize visibility and impact.

2. Anonymity Was Intentional

A team capable of building such a system would not lack marketing channels. Remaining anonymous suggests one of two goals:

Avoid disrupting existing commercial products
Test market and community reactions

3. Open Source Was the Real Move

Releasing a state-of-the-art model as open source fundamentally lowers barriers across the industry.

Closed models compete on pricing and access. Open models reshape the baseline.

What Makes HappyHorse Technically Notable?

1. Ultra-Fast Inference

Traditional video diffusion models typically require dozens to hundreds of denoising steps.

Seedance 2.0: ~2–4 minutes per video
HappyHorse: ~8 steps, under 1 minute

Notably, it achieves this without classifier-free guidance (CFG).

This has direct implications:

Lower compute cost (roughly halved)
Higher throughput for production workloads
Better scalability for content pipelines

For teams producing video at scale, this translates into significant operational efficiency gains.

2. Native Audio-Video Generation

HappyHorse adopts a joint audio-video generation architecture, producing:

Environmental sound
Background music
Dialogue

All synchronized at millisecond-level precision.

This eliminates the need for post-processing steps like:

Audio alignment
Manual dubbing
Timeline synchronization

In practice, this moves output closer to production-ready assets.

3. Diffusion Transformer (DiT) Architecture

The model reportedly uses:

40-layer single-stream Transformer
8-step diffusion inference

This aligns with the Diffusion Transformer (DiT) approach, known for:

Faster inference
Strong controllability
Optimization-friendly structure

This design choice is consistent with Alibaba’s Wan series, which has emphasized:

Unified audio-video generation
High-speed inference
Transformer-based diffusion

From a technical perspective, HappyHorse appears to be a more mature iteration of this direction.

Why Many Believe It’s Alibaba

While initially anonymous, several factors point toward Alibaba:

The architecture aligns closely with the Wan model family
Alibaba released Wan 2.7 Video just days earlier
The timing suggests a two-step strategy:

Launch a commercial product (Wan 2.7)
Follow with an open-source release (HappyHorse)

Additionally, the involvement of Zhang Di, a former key contributor to Kuaishou’s Kling AI, fits the timeline:

Joined Alibaba in late 2025
Led video generation efforts
Delivered a major release within ~4 months

This combination of talent and timing strengthens the attribution hypothesis.

Strategic Implications: Open Source vs Closed Models

Alibaba’s potential strategy becomes clearer when viewed through a product lens.

Dual-Track Positioning

Wan 2.7: Enterprise-grade, paid API
- Stability
- Control
- Support
HappyHorse: Open-source ecosystem driver
- Community adoption
- Developer engagement
- Talent attraction

This allows Alibaba to:

Maintain revenue from enterprise customers
Expand influence through open-source adoption
Avoid cannibalizing its own pricing model

Pressure on Competitors

For ByteDance (Seedance):

Option 1: Accelerate Seedance 3.0
Option 2: Compete on price

Both increase cost and competitive pressure.

For smaller developers:

Open-source alternatives reduce reliance on expensive APIs
Cost-sensitive teams may shift away from closed platforms

Why Open Source Hits Competitors Harder

Open source changes the economics:

Closed models rely on compute-heavy APIs
Open models shift cost to local or distributed deployment

In this context, open source acts less as a monetization tool and more as a strategic lever.

Industry Context: Competition Is Intensifying

The AI video generation space is entering a more competitive phase:

OpenAI’s Sora
ByteDance’s Seedance
Kuaishou’s Kling
Alibaba’s Wan / HappyHorse

Each iteration pushes:

Generation quality
Latency reduction
Cost efficiency

The pace of progress is accelerating, and the gap between research and production systems continues to shrink.

Final Thoughts

Whether HappyHorse ultimately proves as strong as initial benchmarks suggest is still subject to verification. Some details remain unconfirmed, and official sources are limited.

However, regardless of attribution, the signal is clear:

Inference efficiency is becoming a primary battleground
Audio-video integration is moving toward default capability
Open vs closed strategies will shape market structure

The AI video race is no longer just about model quality—it’s about distribution, cost, and ecosystem control.

And that competition is only getting started.

Happy Horse 1.0: What We Actually Know About the Model That Topped Artificial Analysis' Video Arena

brooks wilson — Wed, 08 Apr 2026 15:16:45 +0000

Happy Horse 1.0: What We Actually Know About the Model That Topped Artificial Analysis' Video Arena

An unfamiliar model called HappyHorse-1.0 is currently sitting at #1 on Artificial Analysis' Video Arena, the blind user-voted benchmark widely used to evaluate AI video generation systems. This post summarizes what's verifiable from public sources and what remains unconfirmed, because the gap between those two categories is larger than usual for a model at this rank.

What's on the leaderboard

From Artificial Analysis' public text-to-video (no audio) leaderboard, as of April 8, 2026:

Rank Model Creator Elo 95% CI Samples

1 HappyHorse-1.0 HappyHorse 1,355 ±11 5,062

2 Dreamina Seedance 2.0 720p ByteDance Seed 1,273 ±8 8,130

3 SkyReels V4 Skywork AI 1,245 ±9 5,712

4 Kling 3.0 1080p (Pro) KlingAI 1,242 ±9 5,262

5 Kling 3.0 Omni 1080p (Pro) KlingAI 1,230 ±10 4,776

Three observations worth pulling out:

The gap is statistically clean. An 82-point Elo lead over #2 is not within the noise floor of a preference-based arena. HappyHorse-1.0's confidence interval (1,344–1,366) doesn't overlap with Seedance 2.0's (1,265–1,281). That's a clean separation, not a coin flip.

The sample size is real. 5,062 blind matchups is the same order of magnitude as the #3 and #4 entries, which means the Elo isn't riding on a lucky early streak. It's been stable across thousands of votes.

API status is "Coming soon." The row on the leaderboard lists API availability as pending. The model is generating output on the arena but is not yet broadly available for production use.

What the model claims about itself

Here's where I want to be careful, because the information below comes from sites associated with the project (primarily happyhorse-ai.com and happyhorses.io) and has not been independently verified by any third party as of this writing.

According to these sources, HappyHorse-1.0 is described as:

A 15B-parameter unified transformer (the parameter count appears on secondary documentation, not on Artificial Analysis itself).
A 40-layer self-attention architecture with no cross-attention. First and last 4 layers use modality-specific projections; the middle 32 layers are shared across text, video, and audio tokens.
Trained to run inference in 8 denoising steps without CFG, via a DMD-2 distillation recipe.
Reportedly capable of generating a 5-second 1080p clip in ~38 seconds on an H100 (self-reported).
Natively supporting joint audio-video generation across 6 languages (English, Mandarin, Japanese, Korean, German, French; a secondary site lists Cantonese as a 7th).

If these numbers are accurate, the architecture would represent a fairly aggressive bet on unified multimodal transformers over the multi-stream cross-attention approaches that most current video models use. It would also place HappyHorse-1.0 in the same design family as Meta's Transfusion line of research, though there is no direct connection established between the projects.

None of these claims can be independently verified right now. The GitHub and HuggingFace links referenced on the project's own sites currently point to "coming soon" placeholders. No weights, no reproducible demo outside the arena, no third-party benchmark of inference speed or memory footprint.

Who built it

As of April 8, no team or organization has officially claimed HappyHorse-1.0. The most widely discussed attribution in the Chinese tech press, now circulating in English AI circles, links the model to a new team reportedly led by Zhang Di — the former VP at Kuaishou who led the Kling video generation effort, and who reportedly joined Alibaba in late 2025 to run the Future Life Lab inside the Taotian Group.

I want to stress: this is the most credible theory currently in circulation, but it is not confirmed. Alibaba has not commented. No one publicly associated with HappyHorse has confirmed or denied it. Other community speculation has pointed to alternative origins. If you're making engineering or editorial decisions based on the attribution, wait for official confirmation.

What this means if you evaluate video models

If you benchmark video models before integrating them into a pipeline, the honest summary is:

The leaderboard result is real. Blind user preferences, 5,000+ matchups, clean confidence intervals. That's not marketing; that's what the arena is designed to measure.
Everything else is not yet real for you. No weights, no API, no reproducible local run. You can't currently fine-tune it, can't self-host it, can't measure its latency on your own hardware, can't verify the claimed architecture.
The "what" is known. The "how" and "by whom" are not.

That combination is unusual at the top of the leaderboard. Most models at this rank come with a paper, a model card, a team announcement, and at least an API. HappyHorse-1.0 currently has a leaderboard row and a set of unverifiable claims. That may change quickly — the project sites describe an imminent broader release — or it may not.

Sources

Artificial Analysis Video Arena (live leaderboard): https://artificialanalysis.ai/video/leaderboard/text-to-video
HappyHorse-1.0 public testing interface and current technical spec: https://happyhorses.io
Chinese-language reporting referencing the Zhang Di / Future Life Lab attribution is cited across several tech media outlets as of April 7–8, 2026

Leaderboard rankings are dynamic and may shift as new votes and new models are added.

Claude Code Architecture Explained: Agent Loop, Tool System, and Permission Model (Rust Rewrite Analysis)

brooks wilson — Thu, 02 Apr 2026 03:09:58 +0000

Claude Code Deep Dive (Part 1): Architecture Overview and the Core Agent Loop

Claude Code’s leaked source code weighs in at over 510,000 lines of TypeScript—far too large to analyze directly.

Interestingly, a community-driven Rust rewrite reduced that complexity to around 20,000 lines, while still preserving the core functionality.

Starting from this simplified version makes one thing much clearer:

What does an AI agent system actually need to work?

Why Start with the Rust Rewrite?

On March 31, 2026, Claude Code’s full source was unintentionally exposed due to an npm packaging mistake.

The package @anthropic-ai/claude-code v2.1.88 included a 59.8MB source map file, which allowed anyone to reconstruct the original TypeScript codebase.

To clarify:

The official GitHub repo always existed
But it only contained compiled bundles and documentation
The readable source code was not normally accessible

The Problem with the Original Codebase

Most analyses focused on the leaked TypeScript code:

510K+ lines
QueryEngine alone: ~46K lines
40+ tools
Complex plugin system

The result: too much detail, not enough clarity.

Why the Rust Version Is More Useful

Shortly after the leak:

Developer Sigrid Jin (instructkr community)
First built a Python clean-room version
Then pushed a Rust implementation (claw-code)

👉 Project overview: claw-code

This version:

~20K lines of Rust
Retains core functionality:
- Agent loop
- Tool system
- Permission control
- Prompt system
- Session management
- MCP protocol
- Sub-agents

The key benefit:

Rewriting forces simplification. What remains is what actually matters.

Architecture Overview: A 6-Module System

The Rust implementation is structured into six modules:

claw-code/
├── runtime/          # Core runtime: loop, permissions, config, session, prompt
├── api/              # LLM client, SSE streaming, OAuth
├── tools/            # Tool registry and execution
├── commands/         # Slash commands (/help, /cost)
├── compat-harness/   # TS → Rust compatibility layer
└── rusty-claude-cli/ # CLI, REPL, terminal rendering

These modules form a layered architecture:

CLI / REPL (User Interaction)
─────────────────────────────
MCP Protocol · Sub-agents (Extension Layer)
─────────────────────────────
API Client · Session Management (Communication Layer)
─────────────────────────────
System Prompt · Config (Context Layer)
─────────────────────────────
Agent Loop · Tools · Permissions (Core Layer)

A Key Design Decision

The runtime module defines interfaces, not implementations:

ApiClient → LLM communication
ToolExecutor → tool execution

Concrete implementations live at the top (CLI layer).

This enables:

Mock implementations for testing
Real implementations for production
Zero changes to core logic

Testability is built into the architecture—not added later.

The Core: An 88-Line Agent Loop

If you only read one file, read this:

conversation.rs

The entire agent loop is implemented in ~88 lines.

Runtime State: Simpler Than Expected

AgentRuntime {
    session            # message array (the only state)
    api_client         # LLM interface
    tool_executor      # tool execution
    permission_policy  # access control
    system_prompt
    max_iterations
    usage_tracker
}

The surprising part:

The only state is a message array.

No explicit state machine. No workflow graph.

The Core Loop: `run_turn()`

Here’s the simplified logic:

```python id="n6pj6p"
def run_turn(user_input):
session.messages.append(UserMessage(user_input))

while True:
    if iterations > max_iterations:
        raise Error("Max iterations exceeded")

    response = api_client.stream(system_prompt, session.messages)

    assistant_message = parse_response(response)
    session.messages.append(assistant_message)

    tool_calls = extract_tool_uses(assistant_message)

    if not tool_calls:
        break

    for tool_name, input in tool_calls:
        permission = authorize(tool_name, input)

        if permission == Allow:
            result = tool_executor.execute(tool_name, input)
            session.messages.append(ToolResult(result))
        else:
            session.messages.append(
                ToolResult(deny_reason, is_error=True)
            )




---

## A Concrete Example

User asks:

> “What is 2 + 2?”

Execution flow:

| Step   | Message State              | Description              |
| ------ | -------------------------- | ------------------------ |
| Start  | `[User("2+2")]`            | User input               |
| API #1 | + Assistant (calls tool)   | Model decides to compute |
| Tool   | + ToolResult("4")          | Tool executes            |
| API #2 | + Assistant("Answer is 4") | Final answer             |
| End    | Loop exits                 | No more tool calls       |

Termination condition:

> The model decides to stop calling tools.

---

## Key Design Insight #1: Messages = State

Instead of managing state explicitly:

* The system stores everything as messages
* The full state is reconstructible from history

Benefits:

* Easy persistence (save session)
* Easy replay (debugging)
* Easy compression (context trimming)

> One append-only structure solves multiple problems.

---

## Key Design Insight #2: Errors Are Feedback

When a tool is denied:

* The system does **not** crash
* It returns an error as a `ToolResult`

This is fed back to the model.

Result:

* The model adapts
* Chooses alternative strategies

> Failure becomes part of the reasoning loop.

---

## Tool System: 18 Tools, One Pattern

The Rust version implements 18 built-in tools in a unified structure.

---

### Three Layers



```plaintext
1. Tool Registry     → defines schema and permissions
2. Dispatcher        → routes tool calls
3. Implementation    → executes logic

Tool Specification

```json id="i9j1sx"
{
"name": "bash",
"description": "Execute shell commands",
"input_schema": {
"command": "string",
"timeout": "number?"
},
"required_permission": "DangerFullAccess"
}




This schema is passed directly to the LLM.

---

### Why JSON Schema Matters

* Decouples LLM from implementation
* Enables language-agnostic tools
* Standardizes interfaces

> Schema = contract

---

### Dispatcher Pattern



```python id="5g5syv"
def execute_tool(name, input):
    match name:
        "bash" -> run_bash()
        "read_file" -> run_read()
        ...

Adding a tool:

Define input struct
Implement logic
Add one dispatch line

Sub-Agent Design

Sub-agents reuse the same runtime:

```python id="5y9zsl"
runtime = AgentRuntime(
session = new_session,
tool_executor = restricted_tools,
permission = high,
prompter = None
)




Key constraint:

* Sub-agents cannot spawn sub-agents

This prevents recursion loops.

---

## Permission System: Minimal but Complete

The system uses **5 permission levels**:

* ReadOnly
* WorkspaceWrite
* DangerFullAccess
* Prompt
* Allow

---

### Core Logic



```python id="9t9ahj"
if current >= required:
    allow
elif one_level_gap:
    ask_user
else:
    deny

Design Insight: Gradual Escalation

Instead of:

All-or-nothing access

It uses:

Controlled escalation

Small gap → ask user
Large gap → deny

Sub-Agent Safety Model

Sub-agents:

Have high permission
But no user prompt interface

Result:

Allowed within scope
Automatically blocked outside

Two mechanisms combine into precise control.

Part 1 Summary

Claude Code’s core reduces to three components:

Agent Loop     → execution engine
Tool System    → action layer
Permissions    → safety control

Key principles:

Messages are the only state
LLM decides when to stop
Tools are schema-driven
Errors are part of reasoning
Permissions are incremental

Final Thought

After stripping away 500K lines of code, what remains is surprisingly small:

A loop, a tool interface, and a permission system.

That’s enough to build a functional AI agent.

But making it robust, scalable, and safe—that’s where the real complexity begins.

Next Part

Claude Code Deep Dive (Part 2): Context Engineering and Design Patterns

Prompt construction
Config merging
Context compression
Practical design takeaways

References

Claw Code (Rust rewrite): https://github.com/instructkr/claw-code
Project site: https://claw-code.codes/
Claude Code official repo: https://github.com/anthropics/claude-code

Claude Mythos 5 Leak: Anthropic’s “Capybara” Model Surpasses Opus 4.6

brooks wilson — Sun, 29 Mar 2026 16:04:34 +0000

Anthropic Just Leaked a Model Stronger Than Opus — And It Might Be Too Powerful

Anthropic may have just revealed its most powerful model yet — unintentionally.

No rumors. No controlled announcement. No staged “insider leak.”

Instead, a misconfigured CMS exposed nearly 3,000 internal documents to the public internet, which were subsequently reviewed by a Fortune journalist. A Cambridge cybersecurity researcher, Alexandre Pauwels, was brought in to validate the materials. Anthropic later confirmed: the model is real.

The model is called Claude Mythos.
Its internal codename: Capybara.
Some information about mythos-5:https://m1astra-mythos.pages.dev/

A New Tier Above Opus

Anthropic’s model lineup has followed a familiar three-tier structure:

Haiku — lightweight and fast
Sonnet — balanced performance
Opus — largest and most capable

For a long time, Opus has been treated as the ceiling.

Mythos breaks that assumption.

According to internal draft materials, Mythos is not an iteration of Opus, nor a refinement of Sonnet. It represents:

“A new tier of model, larger and more intelligent than Opus.”

In other words, this is not incremental progress — it’s a structural expansion of the product hierarchy.

If Opus 4.6 already feels state-of-the-art, Mythos is positioned as something beyond that baseline.

How Much Stronger Is It?

The leaked documents indicate that Mythos achieves significantly higher performance than Claude Opus 4.6 across multiple domains.

At minimum, three areas stand out:

1. Software Engineering

Programming is currently one of the most competitive benchmarks in AI.

Claude Opus 4.6 is already considered among the strongest coding models available. Mythos reportedly extends that lead further — not by marginal gains, but by a noticeable margin.

For developers relying on Claude for daily coding tasks, this suggests:

A step change in capability, not a minor improvement.

2. Academic Reasoning

This includes:

Mathematics
Scientific reasoning
Formal logic

The internal drafts explicitly highlight “academic reasoning” as a separate evaluation category, where Mythos shows clear improvements.

This is typically where models struggle with depth and consistency.
Anthropic appears confident enough in this area to emphasize it directly.

3. Cybersecurity (The Most Concerning Part)

This is where the tone of the internal documents shifts.

One excerpt stands out:

Although Mythos significantly exceeds all other AI models in cybersecurity capabilities, it signals an upcoming wave where models may exploit vulnerabilities faster than defenders can respond.

This is not typical product language.

Not “leading”
Not “competitive”
But “significantly exceeds”

And importantly, this comes from internal evaluation — not marketing copy.

Anthropic’s spokesperson described Mythos as:

A “qualitative leap”
The “most powerful model to date”

Not Just Competition — A Shift in Scale

Over the past two years, major AI models (GPT, Gemini, Claude, Llama) have largely competed within the same performance band.

Differences were measurable, but incremental — often within single-digit percentages across benchmarks.

Mythos suggests something different:

Not incremental improvement, but a potential change in scale.

That may explain why every major Anthropic update tends to trigger the same reaction online:

“@sam Altman — are you awake?”

Anthropic’s Response: Prioritize Defense First

Anthropic positions itself as a safety-focused AI company.

So what happens when your own internal evaluation suggests you’ve built something that could overwhelm defenders?

Their response is unusual:

The first users of Mythos will not be developers or enterprise customers — but cybersecurity defense organizations.

The reasoning is straightforward:

If the model’s offensive capabilities are as strong as suggested
Then defenders need access to comparable tools before broader release

In effect:

The antidote is distributed before the risk is fully released.

This approach is rare.

OpenAI conducted red-teaming before GPT-4
Google ran safety reviews for Gemini

But explicitly prioritizing defensive users in the release pipeline is not common practice.

This decision can be interpreted in multiple ways:

Genuine concern about potential misuse
A strategic demonstration of capability
Or both

The Cost Problem

Another constraint is practical:

Mythos is currently very expensive to operate.

The internal drafts note that significant efficiency improvements are required before any large-scale release.

In plain terms:

This is not yet a consumer-ready model
It remains closer to a high-cost experimental system

Why “Capybara”?

Every major model has an internal codename:

GPT-4 → Arrakis
Google models → gemstone names

Anthropic’s strongest model so far?

A capybara.

The same internet-famous animal known for being:

Calm
Social
Universally compatible

The leak revealed two versions of the same blog draft:

One using “Mythos”
Another replacing every instance with “Capybara”

This suggests the codename was used internally for an extended period, with “Mythos” introduced later as a public-facing name.

An Unexpected Collision

There’s a twist.

In the AI ecosystem, “Capybara” is already strongly associated with Alibaba’s Qwen (Tongyi) models, where it serves as a mascot.

So when the codename surfaced, reactions were immediate.

One of the most notable responses came from a former Qwen technical lead:

“capybara? seriously?”

Two competing AI ecosystems, independently choosing the same meme animal.

Unintentional, but memorable.

The Leak Itself: A Basic Mistake

The cause of the leak is almost trivial.

Anthropic attributed it to:

A manual configuration error in an external CMS tool.

Key details:

Uploaded assets were public by default
Privacy required manual configuration
That step was missed

This is functionally equivalent to:

An improperly secured S3 bucket
A well-documented, preventable issue

Anthropic emphasized that:

The incident was not caused by AI-generated code
It did not affect core infrastructure or customer data

Still, the irony is hard to ignore:

A company building cutting-edge cybersecurity AI exposed itself through a basic permission misconfiguration.

What the Leak Actually Reveals

Beyond the technical mistake, the content of the leak is more important.

The documents suggest something the industry rarely states explicitly:

The model may be powerful enough that even its creators need to treat it with caution.

This is a different tone from the usual release narrative:

Faster
Stronger
Safer

Instead, the implication is:

“We’ve built something that requires careful handling.”

Marketing, or Something More?

It’s reasonable to question whether this is simply another form of positioning:

Emphasizing risk to signal capability
Framing caution as exclusivity

But the language in the drafts doesn’t read like standard marketing.

When internal materials describe:

“An upcoming wave of AI-driven vulnerability exploitation”

That suggests either:

An unusually bold marketing strategy
Or a genuine internal assessment

Final Thought

The leak itself is almost incidental.

What matters is the signal:

A new tier above Opus
A measurable jump in capability
And a growing awareness of the risks that come with it

All triggered by something as mundane as:

Forgetting to toggle a “private” setting in a CMS.

Tsinghua Open-Sources OpenMAIC: One-Click Generation of Multi-Agent AI Classrooms

brooks wilson — Thu, 19 Mar 2026 13:54:32 +0000

OpenMAIC: One-Click Multi-Agent AI Classrooms

What happens when AI systems know more than the teacher—and can adapt to every student?

In a traditional classroom, the model is fixed:

One teacher lectures
Dozens of students listen

If the pace is too fast, some fall behind.
If it’s too slow, others disengage.

This “one-size-fits-all” structure has always been a bottleneck.

Now imagine a different setup:

Every student has a personal AI assistant
It never gets tired
It adapts to individual learning pace
It can generate interactive lessons on demand

This may sound speculative—but systems like OpenMAIC are already making it real.

Developed and open-sourced by a Tsinghua University team, the project has quickly gained traction, attracting significant attention on X within hours of release.

01 · What OpenMAIC Does

At its core, OpenMAIC generates complete, interactive learning environments using AI agents.

Instead of reading static material, learners can:

Attend AI-led “classes”
Interact with multiple AI agents
Participate in discussions and exercises

GitHub: https://github.com/THU-MAIC/OpenMAIC

Generate a Course from a Topic

You can start with a simple prompt—for example:

“Create a course explaining OpenClaw”

Within minutes, OpenMAIC generates:

A structured lesson
AI instructor narration
Multi-agent discussions
Interactive exercises

The output includes:

Voice explanations
HTML-based interactive simulations
Built-in quizzes
Export options to .pptx or interactive .html

Turn PDFs into Interactive Lessons

OpenMAIC also supports document-based learning.

Upload a PDF, and the system will:

Extract and restructure the content
Generate explanations with visual aids
Insert quizzes and checkpoints

For example, a report analyzing OpenClaw’s impact on WeChat can be transformed into a guided course.

Importantly, this is not just passive narration.

The system introduces interaction:

Visual breakdowns of concepts
Simulated workflows
Step-by-step reasoning

For instance, when explaining how AI agents work, it can render:

Input → internal processing → output

as an interactive, visualized pipeline.

Making Abstract Concepts Tangible

One of the harder parts of learning—especially in subjects like math and physics—is abstraction.

Take the Pythagorean theorem.
Hearing the formula repeatedly rarely leads to intuition.

OpenMAIC approaches this differently:

It embeds interactive components directly into lessons
Learners can manipulate variables and observe real-time changes

For example:

Instead of memorizing the formula, students can:

Drag triangle edges
See how values update dynamically
Build intuition through interaction

This shift—from explanation to exploration—can significantly improve retention.

Integration with Other AI Systems

Some developers have already integrated OpenMAIC into OpenClaw, enabling:

Automatic generation of instructional videos
On-demand learning content inside agent workflows

This suggests a broader pattern:

Learning becomes a capability embedded inside AI systems—not a separate activity.

02 · How to Use OpenMAIC

You can either use the hosted version or deploy it locally.

Option 1: Use Online

Visit: openmaic chat

Option 2: Self-Host

1. Clone the repository

git clone https://github.com/THU-MAIC/OpenMAIC.git
cd OpenMAIC
pnpm install

2. Configure environment

cp .env.example .env.local

At minimum, provide an API key for an LLM provider.
You can also configure providers via server-providers.yml.

3. Start the app

pnpm dev

Open:

http://localhost:3000

Initial Setup

Once inside the interface, you can:

Upload PDFs
Customize AI voice
Set your learner profile
Choose AI “classmates”

Then enter a topic and start the session.

What the Learning Experience Feels Like

OpenMAIC tries to simulate a real classroom:

AI instructor explains with voice and visual cues
Spotlight and pointer effects guide attention
Interactive components encourage hands-on learning

During the session:

Questions are raised for discussion
AI agents debate among themselves
You can join the conversation at any time

In some cases, the system may even prompt you directly.

Why This Matters

OpenMAIC points toward a shift in how education might scale in the AI era.

From Uniform Teaching → Personalized Learning

Previously:

One teacher, many students
Limited personalization

Now:

One AI system per learner
Fully adaptive pacing and content

From Content Consumption → Interactive Exploration

Instead of:

Reading documents
Watching videos

Learners:

Interact
Experiment
Participate

Limitations and Open Questions

While promising, this approach is not without trade-offs:

Requires reliable LLM infrastructure
Quality depends on prompt design and source material
May not replace structured curricula in formal education
Long-term learning outcomes still need broader validation

Final Thoughts

OpenMAIC demonstrates a practical direction for AI in education:

Generate what you want to learn
Learn at your own pace
Turn knowledge into interaction

It lowers the barrier to both learning and teaching:

Want to learn something? Generate a course.
Want to teach something? Generate a classroom.

This represents a shift not just in tools, but in how knowledge is produced and shared.

Whether this becomes mainstream remains uncertain. But as an open-source experiment, OpenMAIC offers a concrete glimpse into what AI-native education might look like.

Zhipu AI AutoClaw: Install an AI Agent on Your Computer in 1 Minute

brooks wilson — Tue, 10 Mar 2026 07:39:29 +0000

Install a Full-Powered “Claw” Agent on Your Computer in One Minute

1 Minute. No Setup. Your Computer Just Got an AI Agent

Install an AI Agent on Your Computer in One Minute

Running a full AI agent locally has usually meant dealing with complex setup steps—Python environments, API keys, cloud machines, and lengthy tutorials.

That barrier may be disappearing.

Zhipu AI has released a new desktop application called AutoClaw (nicknamed “AoLong”), designed to make running an AI agent as simple as installing a regular app. In practice, the entire process—from download to execution—takes about a minute.

Once installed, a user can issue a prompt and the agent immediately begins executing autonomous tasks.

For example, a simple instruction like this:

Continuously track the latest OpenClaw-related updates from Bilibili, Douyin, Xiaohongshu, GitHub, X, Google, Baidu, and Zhihu. Summarize the latest developments every hour.

Within a minute, the agent begins running the task.

If the task is created at 20:14, AutoClaw will automatically repeat the process every hour—collecting and summarizing new information across those platforms.

At first glance, this may sound similar to what many existing AI agents already do. The difference is that no configuration is required.

AutoClaw: A One-Minute AI Agent Deployment

AutoClaw’s primary design goal is reducing deployment complexity.

Traditionally, running agent frameworks such as OpenClaw requires:

Python environment setup
API key configuration
dependency installation
sometimes renting cloud GPU instances
following long installation guides

For many users, these requirements become a practical barrier. Even with step-by-step tutorials, most people never make it past the setup stage.

AutoClaw attempts to solve that problem by packaging the entire agent stack into a desktop application.

The installation process resembles installing any other software.

Installation Workflow (Example: macOS)

Download the installation package
Install it like a standard desktop application
Log into your account
Review the Security and Risk Guide

Once the setup is confirmed, the user enters the main interface and can start creating tasks immediately.

The experience is intentionally designed to remove the traditional “AI infrastructure” layer from the user’s workflow.

Built-In Model Flexibility

Another notable feature is model switching.

AutoClaw allows users to choose between multiple models, including:

GLM-5
DeepSeek
Kimi
other compatible models

The demo above uses a model called Pony-Alpha-2, which Zhipu designed specifically for agent workflows.

The “Pony” name continues the naming convention used during pre-release versions of GLM-5. According to reports, the model is expected to launch officially soon.

Preloaded Skills: 50+ Agent Capabilities

AutoClaw ships with more than 50 built-in skills, effectively forming what the developers describe as a “team of agents.”

These skills cover common automation scenarios, allowing users to run tasks without building workflows from scratch.

This means users typically don’t need tutorials or scripting knowledge to begin experimenting with agent workflows.

Deep Integration With Feishu

One of the most practical features is one-click integration with Feishu (the enterprise collaboration platform also known as Lark).

Inside the AutoClaw interface, users simply click “Connect to Feishu.”

The remaining steps—including authentication and integration—are handled automatically by the agent itself.

Once the integration request is approved by administrators, the agent becomes available inside Feishu.

From that point on, users can interact with it directly in chat.

Example: Automated Industry Monitoring

For example, instead of running tasks in the desktop interface, you can assign tasks directly inside Feishu.

A typical instruction might look like:

Every day at 9:10 PM, collect the latest news in the new energy industry and send the summary to this chat.

At the scheduled time, the AutoClaw agent automatically posts the report in the chat.

Using Agents Inside Group Conversations

The integration also allows agents to participate in group chats.

Users can simply @mention the agent to trigger tasks such as:

monitoring potential reputation risks
collecting market discussions
summarizing topic-specific information

The interaction pattern becomes similar to messaging a coworker.

Cross-Platform Content Automation

AutoClaw can also handle cross-platform publishing tasks.

For example, it can automatically synchronize content to platforms such as:

Xiaohongshu
X (Twitter)

This turns the agent into a lightweight content automation system.

Example Experiment: A Pixel Office Generator

To explore more creative use cases, one test prompt asked the agent to generate a pixel-style office environment based on the GitHub project Star-Office-UI.

The agent successfully assembled the environment using the referenced project.

While the example is playful, it demonstrates how agents can combine external resources and automation workflows.

From Chatbots to Agents

The release of AutoClaw reflects a broader shift in AI interaction models.

The industry is moving from chat-based systems to autonomous agents.

Chatbots respond to prompts.

Agents execute goals.

This shift has attracted significant attention since the rise of open-source agent projects like OpenClaw. Many developers were fascinated by the idea of fully autonomous digital workers.

However, real-world deployment proved difficult.

Setting up agents required technical expertise and infrastructure knowledge, which excluded most non-technical users.

AutoClaw attempts to change that by lowering the entry barrier.

Lowering the Barrier to the Agent Era

The core narrative behind AutoClaw is simple:

Radically reduce the friction required to run AI agents.

Instead of renting cloud machines or configuring environments, users simply download the application.

Within a minute, a regular personal computer becomes capable of running agent workflows.

For many users, this could be their first practical entry point into the agent ecosystem.

Stability Matters More Than Installation

Ease of installation is only the first step.

For agents to become truly useful, they must also be reliable during complex multi-step tasks.

Running generic large language models in agent pipelines often causes problems such as:

mid-task failures
inconsistent reasoning
hallucinations in multi-step execution

Zhipu addresses this by introducing Pony-Alpha-2, a model optimized specifically for agent workloads.

According to the company, the model focuses on two priorities:

faster execution speed
greater stability during long task chains

A More Capable Browser Agent

Another technical upgrade is AutoClaw’s browser automation capability.

The native browser tools in many agent frameworks can typically perform only basic actions such as clicking buttons or filling simple forms.

AutoClaw integrates AutoGLM-Browser-Agent, a system developed by Zhipu.

This allows the agent to complete complex browser workflows, including:

navigating across multiple pages
executing sequential actions
connecting multiple web operations into a single automated process

Built-In Workflows Out of the Box

Finally, AutoClaw emphasizes immediate usability.

With over 50 preconfigured skills and messaging platform integration, many workflows are ready to use immediately.

After installation, users will see multiple assistant agents appear inside Feishu—for example:

monitoring assistants
research assistants
task automation agents

Instead of managing a complex agent dashboard, users can interact with them the same way they communicate with colleagues.

A message in chat is enough to trigger an automated workflow.

From Developer Tools to Everyday Assistants

What makes AutoClaw interesting is not just the technology itself, but the change in accessibility.

Agent frameworks began as developer-focused tools requiring code and infrastructure knowledge.

Applications like AutoClaw push them toward a different direction: everyday software assistants available to non-technical users.

Whether this model becomes widely adopted remains to be seen.

But one thing is clear: the agent era is moving quickly—from experimental codebases toward tools that ordinary users can run on their own machines.

MaxClaw Guide (MiniMax Agent): One-Click Cloud OpenClaw Deployment, Built-In Tools, and Expert 2.0 Workflows

brooks wilson — Mon, 02 Mar 2026 03:00:37 +0000

MaxClaw: A Practical Guide to “Out-of-the-Box” AI Agents on MiniMax

MaxClaw is a cloud-hosted AI agent platform released by MiniMax on February 26, 2026. It is built on the open-source framework OpenClaw and runs on the MiniMax M2.5 large language model.

The value proposition is straightforward:

no server to rent
no Docker setup
no API key wrangling
no manual skill installation

You click a button and, within about 20 seconds, you get an agent with end-to-end capabilities like web search, image generation, code execution, and file handling. MaxClaw also supports integrations across Feishu, DingTalk, Telegram, Discord, Slack, and more. On top of that, MiniMax ships an “Expert 2.0” community: 16,000+ ready-made workflows spanning development, finance, writing, and office automation.

If you’ve been curious about AI agents but bounced off OpenClaw’s setup complexity, MaxClaw is positioned as a lower-friction entry point.

1. Why This Exists: OpenClaw Is Popular—and Hard to Use

To understand MaxClaw, you need the context of its predecessor: OpenClaw.

What OpenClaw is

OpenClaw (previously named Clawdbot and Moltbot) is an open-source personal AI agent platform created by Austrian developer Peter Steinberger. It gained traction quickly in January 2026, at one point reaching 68,000+ GitHub stars. It’s often described as “an AI assistant that actually does work.”

The key distinction is intent:

A chatbot explains.
An agent executes.

OpenClaw’s core capabilities include:

using messaging platforms (WhatsApp, Telegram, Discord, etc.) as primary interfaces
running shell commands, controlling a browser, managing local files
operating calendars and email; scheduling meetings
a heartbeat mechanism that monitors tasks and proactively pushes reminders
persistent memory across sessions (preferences and history)

A simple mental model from the original text: if ChatGPT is “a consultant that talks,” OpenClaw is “an assistant that acts.”

The project’s naming and stewardship changes

OpenClaw started in November 2025 as Clawdbot. Due to trademark disputes, it was renamed twice: first to Moltbot (implying “metamorphosis”), and finally to OpenClaw in late January 2026, emphasizing open-source and community-driven development.

On February 14, 2026, Peter Steinberger announced he joined OpenAI, and OpenClaw was transferred to an open-source foundation for continued maintenance. The arc—personal prototype → rapid adoption → naming friction → foundation stewardship—reflects how fast the open-source AI agent space is evolving.

The technical stack and the “skill explosion” problem

OpenClaw lives in the JavaScript/TypeScript ecosystem and depends heavily on Node.js (v22+). It uses Express and Hono for routing and API handling.

OpenClaw’s official skill marketplace, ClawHub, reportedly has 9,000+ skills covering scraping, content generation, customer support, scheduling, and more.

The upside is obvious: lots of capabilities.
The downside is equally real: each capability adds configuration surface area. Users commonly report spending hours installing skills, configuring API credentials, and debugging compatibility issues.

Security concerns: powerful agents expand your risk surface

Because OpenClaw needs access to email, calendars, chat platforms, and other sensitive services, misconfiguration or public exposure can create security and privacy risks.

The original text cites a case where Cisco’s AI security research team tested a third-party OpenClaw skill and found it executed data exfiltration and prompt-injection behavior without the user’s awareness—suggesting the skill ecosystem still needs stronger security review mechanisms.

The practical takeaway: self-hosting is not only “hard,” it can also be “risky” if you don’t know what you’re doing.

What OpenClaw setup looks like in practice

A complete OpenClaw deployment typically involves:

Provision an environment
You need a machine (local or cloud) and Node.js 22+. For many non-technical users, “Node.js” is already a blocker.
Install OpenClaw
You run command-line installs, configure firewall ports (commonly 18789), set npm mirrors, and so on.
Configure an LLM provider
OpenClaw doesn’t ship with a built-in model. You must obtain an API key (e.g., from Anthropic, OpenAI, or Alibaba Bailian), then edit a JSON config. A typical configuration from the original text:

{
  "models": {
    "providers": {
      "bailian": {
        "baseUrl": "https://dashscope.aliyuncs.com/compatible-mode/v1",
        "apiKey": "你的API_KEY",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-max-2026-01-23",
            "name": "qwen3-max-2026-01-23",
            "reasoning": false,
            "contextWindow": 262144,
            "maxTokens": 65536
          }
        ]
      }
    }
  }
}

Connect a messaging channel
If you want Feishu or Telegram control, you create a bot app, obtain tokens/App IDs, then bind them via CLI.
Install skills
Search, image generation, etc., are not “just there.” You install them from ClawHub and configure each one.

Even for developers, this tends to take 30–60 minutes end-to-end. For non-technical users, it’s often a dead end. One OpenClaw maintainer (“Shadow” in the original text) summed it up bluntly: if you don’t know the command line, the project may be too risky for you.

This gap—between “heard about it” and “actually using it”—is the problem MaxClaw is trying to solve.

2. What MaxClaw Is: A Hosted OpenClaw With MiniMax’s Stack

One-sentence definition

MaxClaw is MiniMax’s cloud-hosted OpenClaw-based agent service, integrated into the MiniMax Agent web product, packaged as click-to-deploy.

What used to be the user’s burden—servers, containers, API keys, skills, operations—is bundled into a managed service.

Architecture, broken into three layers

Layer 1: MiniMax M2.5 (the “brain”)

MaxClaw runs on MiniMax M2.5, described here as a Mixture-of-Experts (MoE) model with about 229B total parameters while activating around 10B per inference.

Claims in the original text include:

fast inference, supporting 100 TPS (tokens per second)
benchmark results:
- SWE-Bench Verified: 80.2%
- Multi-SWE-Bench: 51.3%
- BrowseComp: 76.3%
- GDPval-MM (office tasks): 59.0% average win rate
trained using MiniMax’s Forge framework and CISPO algorithm with large-scale reinforcement learning optimized for agent scenarios
Process Reward mechanisms to monitor multi-step execution quality, improving completion speed by 37% vs M2.1 and reducing search iterations by about 20%

Layer 2: OpenClaw (the “skeleton”)

OpenClaw provides a modular agent framework that standardizes how the model, channels, and tools are orchestrated. Core components described in the original text:

Gateway
- coordinates tool execution
- manages client connections (often via WebSocket for real-time interaction)
- enforces security policies
Skills
- plugin-based capability expansion
- follows standard OpenAPI conventions
Memory
- persistent cross-session storage for context, preferences, history
Channels
- standardized message interfaces to connect IM platforms

In MaxClaw, MiniMax hosts and manages these components on its infrastructure.

Layer 3: MiniMax Agent UI + Expert 2.0 ecosystem (the “skin”)

Users interact via the MiniMax Agent web interface at:

https://agent.minimaxi.com/

On top sits Expert 2.0, a community-driven workflow ecosystem intended to expand MaxClaw with reusable “expert agents.”

3. MaxClaw vs. Self-Hosted OpenClaw

Here is the same comparison from the original article, reconstructed in a developer-friendly table.

Dimension	OpenClaw (Self-hosted)	MaxClaw (Managed cloud)
Deployment	Bring your own machine; install Node.js; set up environment	Click a button; deployed in ~20 seconds
Model setup	Obtain API keys; edit JSON config	Built-in MiniMax M2.5; no model config
Skills	Install from ClawHub; configure each API	Built-in core skills (image/video/search/web deploy, etc.)
Channels	Manually create bots/tokens; bind via CLI	Guided via natural-language setup; supports Feishu/DingTalk/etc.
Ops	You handle updates, dependencies, process supervision	Fully managed by MiniMax
Cloud storage	No default cloud storage	Includes 50GB dedicated storage
Long-term memory	You configure persistence yourself	Native long-term memory across sessions
Best fit	Developers / tinkerers / strict data control	Broad users; minimal setup; “no infra”

4. Built-in Tooling: What You Get on Day One

A big difference is that MaxClaw comes with a pre-integrated toolchain rather than requiring you to install each skill.

Information retrieval tools

web search for up-to-date information
image search for finding visual references online
web extraction for pulling and structuring content from a URL

Together, these let the agent behave like a research assistant: gather sources, extract key points, structure results.

Content creation tools

text-to-image generation
video generation (short-form creation)

In the OpenClaw world, you typically wire these up yourself via third-party APIs. MaxClaw positions them as built-in.

Office/document tools

Word formatting
PowerPoint editing
Excel data processing

The original text attributes this to M2.5 being reinforced specifically for office workflows.

Developer tools

code execution across multiple languages
web deployment (publish generated web content online)

The combination is framed as enabling even non-coders to produce simple pages or tools via natural language.

Understanding/analysis tools

image understanding (analyze uploaded images)
video understanding (extract and analyze video content)

The goal is a full loop: not only generate content, but also interpret it.

All of these tools are hosted and maintained by MiniMax, so users don’t manage API versions or low-level integrations.

5. Getting Started: Creating Your First MaxClaw (End-to-End)

The setup is intentionally minimal.

Step 1: Open the MiniMax Agent site

Go to:

https://maxclaw.ai/

If you don’t have an account, register (phone/email verification).

Step 2: Find the MaxClaw entry

After logging in, look for MaxClaw in the left navigation and click into it.

Step 3: One-click creation

Click “Start” / “Create MaxClaw.” The platform deploys a full OpenClaw instance in the cloud, typically in 10–20 seconds, then drops you into a chat-like interface for your agent.

At no point do you need to:

rent a server
install dependencies
edit config files
apply for third-party API keys
write code

The article’s argument here is simple: because MiniMax is the model vendor, “model access” is a first-class part of the product, not something you bolt on.

Step 4: Confirm the baseline capabilities

Your agent is created with:

web search
image understanding + generation
video understanding + generation
web extraction
code execution
file handling (Word/Excel/PPT)
image search
web deployment

In OpenClaw, you’d often install/configure these one by one. In MaxClaw, they are presented as out-of-the-box, without extra API charges.

6. Deep Integration Example: Using Feishu for Cross-Platform Work

MaxClaw emphasizes messaging-platform integration, especially for mainstream Chinese workplace tools.

Why connect Feishu?

Once connected, you can message the agent directly inside Feishu to assign tasks, without opening the web UI. Deliverables and results can still be viewed in the web interface, enabling cross-device collaboration.

Step-by-step Feishu integration

Step 1: Request setup guidance inside MaxClaw

In the MaxClaw chat, type:

I want to integrate with Lark. Please guide me through the configuration process.

MaxClaw recognizes the intent and guides you through the configuration steps.

Step 2: Create an app on Feishu Open Platform

Following the guidance, go to open.feishu.cn and:

log in
create an app
choose “enterprise self-built app”
fill in name/description (anything is fine)
enable the “bot” capability
configure required permissions under “events & callbacks”

Step 3: Provide App ID and App Secret to MaxClaw

Once created, Feishu gives you an App ID and App Secret. Send them to MaxClaw; it completes the remaining configuration.

No forms. No config files. The workflow is conversational: you say what you want; it tells you what to do.

Step 4: Verify

Find the bot in Feishu and send a test message.

The article notes MaxClaw supports similar flows for DingTalk, Telegram, WhatsApp, Discord, Slack, etc.

7. “Expert Modes”: More Than Chat, More Like Configured Tools

MaxClaw includes multiple “expert configuration modes,” each mapping to a professional working style. Switching modes is intended to load a different set of capabilities and workflows quickly.

Switching modes

In the MaxClaw UI, go to Settings → Current Configuration and select a mode.

Image creation mode

In “Image Creation,” MaxClaw acts like a design assistant. Example prompt from the article:

Please help me create a tech-style poster with the theme "AI Redefining Efficiency".
The color scheme should be mainly dark blue and silver-white, and it needs to incorporate futuristic geometric elements.
The size should be in portrait mode for mobile phones, with space at the bottom for text.

MaxClaw generates an image and can iterate via natural-language feedback.

The contrast with OpenClaw is operational: on OpenClaw, you’d typically install an image-generation skill and wire up an API first.

MAX mode (default)

“MAX” is the general-purpose mode and is framed as automatically choosing the right Office skills based on task type—especially for Word/PPT/Excel workloads.

Custom experts

Beyond presets, you can define custom experts via natural language. That leads to the larger concept: Expert 2.0.

8. Expert 2.0: A Community Workflow Library

What Expert 2.0 is

Expert 2.0 is MiniMax Agent’s ecosystem for reusable “expert agents.” Each “expert” is a pre-configured workflow: domain knowledge + tools + execution logic.

As of February 2026, the article claims there are 16,000+ expert agents created and used across areas like development, creative writing, office productivity, and finance.

What it changes, operationally

Before Expert 2.0, building a serious agent often meant manually configuring:

skills
sub-agents
MCP (Model Context Protocol)
prompt structures and orchestration logic

Expert 2.0 reframes this as: describe the goal in natural language, and the system derives SOP, tool orchestration, and capability configuration.

Example from the article (financial modeling expert):

You need to create an expert for me, skilled in using Excel's native capabilities to build professional financial models (DCF, sensitivity analysis), and deliver a complete, error-free .xlsx file.
You need to break down the necessary knowledge, skills, and process configurations required for this expert role.

The system is described as automatically injecting domain knowledge (DCF, sensitivity analysis, Excel function conventions), configuring tools/sub-agents, generating example scenarios, and enforcing output rigor.

Using existing experts

If you don’t want to build your own, browse the community, click “Use,” and then provide minimal input.

The finance example in the original text: you specify a company, and the expert agent runs a pipeline:

map company → ticker
pull financial data
retrieve recent news and industry context
run DCF analysis
generate a complete report (business model, financial health, team, competition, valuation conclusion)

The comparison is again about time-to-value: OpenClaw can do it, but you assemble the pipeline yourself; Expert 2.0 is positioned as click + one sentence.

Creating your own expert

If no existing expert matches, you define one via natural language.

Example: an e-commerce competitor monitoring expert, with responsibilities, data dimensions, output requirements, and triggers (like weekly Monday reports).

The article notes MiniMax provides 15 free rounds of creation/debugging per user to refine an expert.

Why a community matters

The piece frames Expert 2.0 as a knowledge-sharing mechanism: professional experience can be “packaged” into executable workflows.

It also mentions future plans for creator pricing/revenue sharing and team-level expert sharing—turning individual expertise into reusable team infrastructure.

9. Advanced Workflows: Prompt Templates You Can Reuse

This section is intentionally hands-on: complete prompt templates you can copy.

Scenario 1: Scheduled news collection + topic selection

For creators, marketers, researchers.

The article highlights transparency: you can see which sites it visits and what it reads, making it easier to trust it’s not fabricating.

Scenario 2: GitHub project parsing + outline generation

For technical bloggers, PMs, or readers who struggle with long English READMEs.

Scenario 3: Business trip planning automation

For frequent travelers, assistants, admins.

Scenario 4: Multilingual translation + localization workflow

Not “sentence translation,” but a professional-style pipeline: analyze → terminology → translate → QA.

Scenario 5: Automated code review workflow

For teams, tech leads, indie developers.

The article adds an important limitation: even with strong coding benchmarks (SWE-Bench Verified 80.2%), AI review should be treated as guidance. For critical production logic, experienced engineers should make the final call.

10. M2.5, Explained: Why MaxClaw Behaves Like an Agent (Not Just a Chatbot)

MaxClaw’s behavior is attributed to M2.5’s agent-oriented design.

MoE: strong capability without always paying full cost

M2.5 uses Mixture-of-Experts: ~229B parameters total, ~10B activated per inference.

The article’s analogy: a large hospital with many specialist departments—patients don’t require every doctor at once; triage routes them to the relevant specialists. That’s the idea behind sparse activation.

Forge + CISPO: reinforcement learning for agents

MiniMax trains M2.5 using its own Forge RL framework and a CISPO algorithm designed to keep large-scale training stable. The text describes CISPO as clipping importance-sampling weights to constrain training while still allowing exploration.

Interleaved Thinking: “think → act → observe → reflect → act”

M2.5 includes “Interleaved Thinking,” enabling dynamic reasoning at multiple points during execution rather than “think once, answer once.” This matters for agents that search, browse, and adapt mid-run (e.g., revising search queries if results are poor).

Native agent optimization and “spec-first” behavior

The article claims M2.5 was reinforced across 10+ programming languages and hundreds of thousands of real environments, supporting full lifecycle work: system design, environment setup, iteration, testing.

It also highlights “native spec behavior”: before coding, the model tends to decompose requirements, plan system structure, and even outline UI layouts—more like an architect than a code autocomplete engine.

Long context

M2.5 supports up to 262,144 tokens of context (the article notes this is roughly 200k Chinese characters), useful for long documents and complex multi-turn tasks.

Benchmarks summarized

From the original text:

SWE-Bench Verified: 80.2%
Multi-SWE-Bench: 51.3%
BrowseComp: 76.3%
GDPval-MM: 59.0% average win rate (office tasks)
RISE: “leading level” (real-world expert search tasks)

Open-source weights

The article notes that M2.5 weights are fully open-sourced on HuggingFace. The implication: MiniMax differentiates via the hosted product experience (MaxClaw), not only by keeping the model closed.

11. Security and Privacy: What You Should Consider Before Using It

Agents are powerful because they touch real systems. That comes with responsibility.

Data security

MaxClaw is cloud-hosted, meaning interaction data goes through MiniMax servers. If you handle highly sensitive business data or personal privacy data, you should evaluate whether cloud usage fits your security posture.

If you need maximum data control, self-hosting OpenClaw can keep data on your own infrastructure.

Credentials (App ID / App Secret) handling

When integrating Feishu/DingTalk, you provide credentials equivalent to keys. Configure in a trusted environment and treat them as sensitive.

Permission boundaries

Follow least privilege: grant only what’s necessary. Avoid broad, persistent permissions when a narrower scope works.

Prompt injection risk

Like all browsing agents, MaxClaw can be exposed to malicious instructions embedded in web pages or external content (prompt injection). The article says MaxClaw includes some mitigations, but users should still verify outputs—especially for important decisions.

12. Competitive Landscape: Where MaxClaw Fits

MaxClaw vs self-hosted OpenClaw

The core conclusion is consistent:

OpenClaw: best for technical users and those with strict data control requirements
MaxClaw: best for people who want “fast onboarding” and don’t want to manage infrastructure

MaxClaw vs Alibaba CoPaw

The article describes CoPaw as a domestic OpenClaw alternative with broad IM integration (DingTalk/Feishu/QQ) and both local + cloud deployment options.

The difference, as framed here:

CoPaw aligns with Alibaba Cloud’s ecosystem and enterprise use cases
MaxClaw aligns with MiniMax’s ecosystem and emphasizes agent-optimized model behavior plus the Expert 2.0 workflow community

MaxClaw vs lightweight variants (ZeroClaw, NanoClaw)

ZeroClaw and NanoClaw are lightweight OpenClaw implementations (thousands of lines or even hundreds). They’re great for teaching and understanding core agent mechanics, but they don’t offer the managed hosting, built-in toolchain, or expert ecosystem described for MaxClaw.

MaxClaw vs developer frameworks (LangChain, AutoGen)

This is a category difference:

LangChain / AutoGen: building blocks and orchestration frameworks; developers assemble, host, and maintain agents themselves
MaxClaw: a packaged, ready-to-use agent product

If you want deep customization and you’re writing code, frameworks fit better. If you want an agent that works immediately, MaxClaw is the closer match.

Broader China agent ecosystem context

The article notes the domestic agent landscape in early 2026 is active: Alibaba (CoPaw, Bailian), ByteDance (Coze), Baidu (Qianfan AppBuilder), Tencent (Yuanqi), and others.

MiniMax’s differentiation is summarized as:

M2.5 is optimized for agent use (tools + multi-step reasoning)
Expert 2.0 provides UGC workflow depth
Deep integration with OpenClaw inherits ecosystem resources and community experience

13. Practical Usage Advice and Common Questions

A suggested onboarding plan (first 5 days)

Day 1: try basic tasks (search, generate an image) to feel how it differs from a standard chatbot
Day 2: build one simple automation (e.g., “when I send a URL, summarize it”)
Day 3: use one existing Expert 2.0 workflow
Day 4: connect your primary chat tool (Feishu/DingTalk)
Day 5+: create a custom expert for your real job workflow

Prompting tips from the original article

specify role (“senior market analyst”, “technical documentation specialist”)
describe requirements structurally (steps + expected output)
define output format (Markdown/table/JSON)
provide positive/negative examples if quality matters

FAQ (as stated in the original text)

Is MaxClaw free?
It requires a MiniMax Agent basic subscription. Pricing should follow the latest information on agent.minimaxi.com.
MaxClaw vs MiniMax Agent—what’s the difference?
MiniMax Agent is the general AI chat platform; MaxClaw is a specific module focused on automated agent execution—an “agent mode” within the platform.
Will my workflows and experts be lost?
MaxClaw includes 50GB dedicated cloud storage, and configurations/data persist in the cloud. The article still recommends backing up important configurations for safety.
What languages are supported?
M2.5 supports Chinese and English, among others. You can interact in Chinese while processing English content (e.g., reading English docs).

14. The Trend View: Why MaxClaw Matters (According to This Article)

The original text frames MaxClaw as part of a broader shift in the agent space:

From capability competition to experience competition
By early 2026, the differentiator is less “who has the biggest benchmark score” and more “shortest path from idea to working automation.”
From tool to assistant
Agents move beyond input/output into proactive behaviors: schedules, triggers, cross-platform execution.
From individual capability to ecosystem capability
Expert 2.0 turns individual expertise into reusable workflows, scaling “collective intelligence” through UGC.

Conclusion: Who MaxClaw Is For (and Who It Isn’t)

This article’s conclusion can be distilled into three points:

It removes deployment friction.
Hosted infrastructure and one-click provisioning collapse a complex setup into a simple action.
It ships a full toolchain by default.
Search, image/video generation, code execution, and document handling are available without manual API wiring.
It leans on an expert workflow ecosystem.
Expert 2.0 is positioned as “solutions, not just tools,” enabling reuse and knowledge sharing through workflows.

Practical guidance from the original author:

If you wanted to try OpenClaw but got blocked by setup, MaxClaw is a low-friction entry point.
If you’re a developer, it can be a fast way to validate ideas without rebuilding an environment each time.
If you’re a creator or operator, Expert 2.0’s ready-made workflows can bootstrap an automation pipeline quickly.
If you have strict security requirements, you can learn agent usage on MaxClaw first, then consider self-hosting OpenClaw once you’re confident.

The agent era is still early. MaxClaw and Expert 2.0 are presented here as a step toward making “everyone has their own AI assistant” feel less like a slogan and more like something you can actually use.

Official access: https://agent.minimaxi.com/

MiniMax MaxClaw: The Ultimate Stand-In for OpenClaw?

brooks wilson — Fri, 27 Feb 2026 04:56:58 +0000

MaxClaw: Is This the Ultimate Replacement for OpenClaw?

OpenClaw is arguably the most impossible-to-ignore open-source AI project of early 2026.

What started as a weekend side project in late 2025 has grown into a phenomenon: over 220,000 GitHub stars and millions of weekly visits, pushing the idea of locally deployed AI agents far beyond niche hacker circles and into mainstream discussion.

But alongside the hype, one very practical sentiment has never gone away:

“I want to use it — I just don’t know how to install it.”

Environment setup, cloning repos, configuring API keys, editing config.toml, wiring up Telegram or Slack… none of these steps are individually hard. Taken together, they’re enough to stop most non-technical users cold. In the OpenClaw Discord, deployment questions have consistently been the most common category.

This week, MiniMax offered a clear response.

They introduced MaxClaw: a fully hosted, cloud version of OpenClaw, integrated directly into the MiniMax Agent web interface. At the same time, they upgraded their expert agent system to Expert 2.0.

Two announcements, one shared goal: lower the barrier to entry.

What MaxClaw Actually Is

The short version: MaxClaw is OpenClaw running on MiniMax’s cloud.

Under the hood, it’s powered by MiniMax M2.5, a model released only recently but already notable. Within a week of launching on OpenRouter, it climbed to the top of the token usage charts. On SWE-Bench Verified, it scored 80.2%, with programming and agent-style tasks as its clear strengths.

MaxClaw packages those capabilities into a browser-based product.

There’s no need to provision servers or manage API keys. You log into the MiniMax Agent website, click MaxClaw in the sidebar, and within seconds the agent is live.

Tools, Skills, and What’s Included by Default

Functionally, MaxClaw builds on OpenClaw’s original capabilities—image understanding, video understanding, web extraction, search—and extends them with a set of built-in tools:

Image generation

Video generation

Image search

Web app deployment

Crucially, these tools don’t require third-party API setup and don’t incur extra fees. You can chain tasks end to end: search for news, find images, write copy, and package the output in one run. You can also connect it to Notion for structured archiving, or use the built-in arXiv search skill to create a live academic paper monitor.

Like OpenClaw, MaxClaw supports integrations with Slack, Feishu, Telegram, and DingTalk. The difference is in usability. Instead of reading documentation, you can simply ask MaxClaw how to connect a platform. It walks you through the process step by step. No code required.

Once connected, you effectively gain a 24/7 always-online assistant inside your work channels—ready to be @-mentioned for research, drafting, meeting summaries, or task breakdowns.

MaxClaw vs OpenClaw: Which One Should You Choose?

This is the first question most people ask when they see MaxClaw.

The answer depends on who you are.

OpenClaw (Open Source) vs MaxClaw (MiniMax)

Dimension	OpenClaw (Open Source Version)	MaxClaw (MiniMax Version)
Deployment Method	Self-hosted: local PC, VPS, Mac mini, home server, etc. Requires manual setup: install Node.js, configure environment, pull code, run services	Cloud-hosted: log in to `agent.minimax.io` or `agent.minimaxi.com` → click MaxClaw in the left menu → ready in seconds. No server or environment setup required
API Key Setup	Required. You must prepare your own model API keys (e.g. Claude, MiniMax M2.5, Kimi, DeepSeek, GLM). Costs are paid by the user	Generally not required. Uses MiniMax’s own M2.5 model by default. No external API fees (consumes platform credits instead)
Runtime Status	Online only when your machine is running. Shutdowns, network drops, or reboots cause downtime. Uptime must be maintained by the user	Cloud-based, 24/7 always-on. Maintained by MiniMax infrastructure
Ease of Getting Started	Medium–high difficulty: requires basic command-line knowledge, editing `config.toml`, integrating chat tools (Telegram / WhatsApp / Discord / Slack), and debugging	Extremely easy: natural language chat out of the box. Beginner-friendly. ~10 seconds to start
Built-in Tools / Skills	3000+ community open-source plugins, but you must discover, install, and configure them yourself	Officially curated expert-level skills (deal hunting, multi-agent research, trend tracking, image/search/video generation, app deployment). Works out of the box and can directly call thousands of expert agents
Storage & Memory	Local storage or self-configured storage. Full data ownership and control	Includes 50 GB dedicated cloud storage + long-term memory. Data is stored in MiniMax cloud (privacy trade-off for convenience)
Integration Ecosystem	Flexible support for arbitrary models and IM tools, but requires manual integration	Deep integration with the MiniMax Agent ecosystem (Expert 2.0 agents directly callable). Supports Feishu, DingTalk, and other IM tools; mobile support planned
Cost	Model API fees + VPS / hardware / electricity costs. Can be very low with inexpensive models (e.g. MiniMax M2.5)	Credit-based pricing: basic users receive 1000 credits initially + 200 credits daily (free tier covers most daily use). Subscription required for heavy usage
Privacy / Control	Highest: fully local or self-hosted. Data never leaves your own devices	Medium: data stored in MiniMax cloud (with security and compliance guarantees). Best for non-sensitive tasks
Target Users	Power users, developers, privacy-sensitive users, people who want deep customization	General users, those who want to experience OpenClaw without deployment hassle, MiniMax ecosystem users
Current Status	Actively maintained open-source project (official site: `https://openclaw.ai`), community-driven	Newly launched experimental feature (late Feb 2026). Rapidly gaining traction; often described as the “first major OpenClaw cloud offering in China”

Deployment and Setup

OpenClaw requires self-deployment. You can run it locally, on a VPS, or on a Mac mini—but you’ll need to install Node.js, configure the environment, connect messaging platforms, and debug issues yourself.

MaxClaw is one-click cloud deployment. Log in at agent.minimax.io, click, and it’s ready in about ten seconds.

Availability

OpenClaw depends on your own machine. Shut it down, and the agent goes offline.

MaxClaw runs continuously in the cloud, available 24/7.

Tools and Skills

OpenClaw relies on a community ecosystem of 3,000+ open-source plugins. The flexibility is high, but selection and configuration are on you.

MaxClaw comes with a curated set of official skills out of the box—trend tracking, multi-agent research teams, image/search/video generation, app deployment—and remains compatible with OpenClaw’s ClawHub skills. It can also directly invoke over 16,000 expert agents on the MiniMax platform.

Storage, Privacy, and Cost

OpenClaw keeps all data local, offering maximum privacy and control.

MaxClaw includes 50 GB of cloud storage and long-term memory, with data hosted on MiniMax’s servers—a convenience-for-privacy trade-off.

In terms of cost, OpenClaw expenses come from model APIs and hardware or electricity. MaxClaw uses a credit system: the basic plan includes an initial 1,000 credits plus 200 daily credits, which is sufficient for most routine use.

This is not a “replacement” story.

OpenClaw is for developers, tinkerers, and users with strict privacy or customization requirements. MaxClaw is for general users, creators, and teams who want something that works immediately.

What MiniMax has done is add a cloud layer on top of the OpenClaw ecosystem—shifting the threshold from “able to write code” to “able to type.”

Expert 2.0 and the MiniMax Agent Ecosystem

Alongside MaxClaw, MiniMax also released Expert 2.0, a major update to its expert agent system.

The MiniMax Agent interface is straightforward. The top half of the sidebar is the MiniMax Lab section (where MaxClaw lives). The lower half is the Expert module. Inside “Explore Experts,” you’ll find a categorized community covering technical development, creative writing, office productivity, finance, marketing, education, design, and audio/video work. Each expert lists its creator and usage metrics.

The key change in Expert 2.0 is how experts are created.

Previously, building an expert agent meant manually defining skills, arranging sub-agents, configuring MCP connections, and structuring prompts—manageable for developers, intimidating for everyone else.

Now, you simply describe the goal in natural language. The system automatically handles SOP design, tool orchestration, and capability configuration.

For example, if you want an expert focused on AI and technology news, you can create or reuse an existing one that tracks relevant topics, summarizes daily updates, and even generates interactive polls.

As of now, over 16,000 expert agents have been created and used on the platform. MiniMax has also outlined what’s next: creator pricing and revenue sharing (experts can be monetized per call), and team-level expert sharing so individual expertise becomes shared infrastructure.

The intent is clear. MiniMax isn’t just shipping an AI product—it’s building an agent ecosystem. Expert agents are the content, MaxClaw is the entry point, and MiniMax M2.5 is the foundation. Together, they form a closed loop from model capability to application distribution.

Final Thoughts

What MiniMax did here isn’t technically flashy. The idea is almost straightforward: OpenClaw is powerful but hard to set up, so remove the setup. Expert agents are valuable but tedious to configure, so let natural language handle it.

The product judgment, however, is sound.

In the agent space right now, the biggest bottleneck isn’t model capability. It’s the gap between “technically possible” and “pleasant to use.”

MaxClaw still has things to prove. Cloud hosting means giving up some data control. Whether the credit model remains cost-effective long term, and how stable MiniMax M2.5 is across diverse workloads, will only become clear with time and user feedback.

But at this moment, MaxClaw offers a very clear option: if OpenClaw intrigued you but you never quite took the plunge, this is the lowest-friction way to try it.

DeepSeek V4 Explained: mHC, Engram, and Native Sparse Attention Powering 1M-Token Context

brooks wilson — Sat, 21 Feb 2026 10:06:14 +0000

DeepSeek V4: Architectural Innovation Driving AI Beyond Its Limits

DeepSeek V4 introduces a new architectural direction for large language models.

Instead of relying solely on scale, it combines three structural innovations—mHC, Engram, and NSA—to unlock million-token–level long-context processing with significantly lower inference cost.

At a high level, DeepSeek V4 focuses on one core idea:

Decouple depth, memory, and attention efficiency—so each can scale without breaking the system.

Below is a breakdown of what’s new, why it matters, and how these changes translate into real performance gains.

mHC Architecture: A Stable and Efficient Foundation

What problem it solves

Deep transformer models often struggle with two related issues as depth increases:

information flow degradation
training instability (gradient explosion or collapse)

These problems limit how deeply models can scale without excessive tuning or compute waste.

How mHC works

The mHC (Manifold-constrained Hyper-Connections) architecture addresses this by constraining the connection matrices to a doubly stochastic matrix manifold.

In practice, this means:

signal gain is kept stable (around 1.6×) across layers
deep representations are preserved
training collapse is avoided even at large depth

The result is a model that remains expressive without becoming fragile.

Measured impact

According to internal benchmarks:

compute utilization improves from an industry average of ~60% to 85%+
training stability increases significantly
reliance on raw compute is reduced by 30%+

In short, mHC makes depth usable, not just theoretically possible.

Engram: Decoupling Memory from Compute

The core idea

Engram is a conditional memory module designed to offload static knowledge—such as entities, formulas, and factual mappings—from expensive GPU memory (HBM) to much cheaper system memory (DRAM).

Instead of keeping everything “in mind” at all times, the model looks things up when needed.

Think of it as giving the model a fast, structured reference system—closer to a dictionary than a cache.

Why this matters

GPU memory is scarce and expensive. Using it to store static knowledge competes directly with dynamic reasoning.

Engram solves this by:

reserving GPU memory for active reasoning
moving long-term knowledge to DRAM
retrieving it efficiently during inference

Experimental results

This design leads to concrete gains:

HBM usage reduced by over 60%
inference speed improved by 2–3×
in benchmarks covering knowledge retrieval, general reasoning, coding, and math, a 27B-parameter Engram-enabled model outperforms traditional models of the same size
long-context handling at 128K and even 1M tokens becomes practical

Engram is not just a memory optimization—it changes how models balance recall and reasoning.

NSA Architecture: The Key to Million-Token Context

What NSA is

DeepSeek V4 adopts NSA (Native Sparse Attention), a sparse attention architecture jointly developed by DeepSeek and Peking University.

NSA is designed specifically for extreme-length contexts, where dense attention becomes prohibitively expensive.

Proven at scale

On a 27B-parameter backbone, NSA demonstrates:

perfect accuracy on 64K “needle-in-a-haystack” tests
up to 9× faster forward inference
up to 11.6× faster decoding

Cost implications

Thanks to NSA, DeepSeek V4 can process million-token contexts at a fraction of the usual cost:

inference cost is roughly 1/10 of GPT-series models
compared to Claude-class models, cost drops to about 1/68

This is not just a scaling win—it fundamentally shifts the economics of long-context reasoning.

Performance Highlights

Programming capability

DeepSeek V4 shows strong performance in coding tasks:

~58% accuracy on SWE-Bench Pro–class comprehensive code benchmarks
80%+ accuracy in vertical scenarios such as frontend development and data analysis

In Design-to-Code tasks (converting design mockups directly into code), V4 reaches 92.0% accuracy, approaching human expert performance and clearly exceeding GPT-5.3-Codex (85%).

More information about deepseek v4.

Long-text understanding

DeepSeek V4 expands its core context window from 128K to 1M tokens.

In practical terms, this means it can ingest and reason over text at the scale of The Three-Body Problem trilogy in a single pass.

This directly addresses long-standing issues such as:

fragmented context
forced chunking
loss of global structure in long documents or large codebases

Updated knowledge cutoff

The model’s knowledge base has been updated to May 2025.

Even in offline scenarios, it can accurately reference:

major news events from April 2025
recent industry developments

This resolves the previous eight-month “knowledge freeze,” where the model was effectively stuck at mid-2024.

Closing Thoughts

DeepSeek V4 is not just another incremental model release.

By rethinking:

how depth is stabilized (mHC)
how memory is stored and retrieved (Engram)
how attention scales to extreme lengths (NSA)

it demonstrates a clear architectural path toward long-context, high-efficiency AI systems.

Rather than brute-forcing scale, DeepSeek V4 shows what’s possible when architecture, memory, and economics are designed together—and that may matter more than raw parameter counts in the years ahead.

I Spent 5,000 RMB and 50 Hours on OpenClaw—Here’s What I Learned (and What It Means)

brooks wilson — Fri, 20 Feb 2026 09:20:18 +0000

What Did OpenClaw Actually Bring? Reflections on Engineering, Business, and Philosophy

This Lunar New Year, I suspect I wasn’t the only one who basically spent the holiday with a lobster. 🦞

I’m talking about OpenClaw.

After burning through nearly 5,000 RMB and at least 50 hours of trial, error, and “why is this happening,” I feel like I’ve earned the right—and maybe the responsibility—to write down what I’ve learned.

This isn’t a tutorial. It’s an experience report. A mix of engineering intuition, business framing, and a little philosophy—because if you really use something like OpenClaw, it’s hard not to end up there.

1. Why OpenClaw Felt Different This Time

Let me start with four moments that genuinely shook me.

And for context: I’m a “classical-era” product manager. I haven’t written a proper PRD in ages. Modern dev stacks are not my home turf. I’m usually the person who asks, “Can we ship this next week?” without fully understanding what “this” is.

Then OpenClaw happened.

Moment 1: I shipped a full app while biking and playing cards

No exaggeration: in under three hours, while I was out riding a bike, eating, and messing around with friends, I finished a functional app with real front-end/back-end interaction.

The wild part wasn’t the code.
The wild part was deployment.

It asked me for a few permissions, then went and handled things like Cloudflare and Aliyun domain management on its own—pushed the app online, publicly accessible.

It felt less like “I built an app,” and more like “I approved a plan and watched a system execute it.”

Moment 2: One detail made me instantly trust it

I found bugs during testing—but the overall completeness was already shockingly high.

And then I saw a safety mechanism that basically won me over: a high-level “data wipe protection” guardrail. It was the kind of precaution I rarely see implemented properly, even in teams with solid dev + QA.

I’ve worked with enough engineers to know: that level of defensive thinking is not common.

Moment 3: I described a bug casually—and it produced a full fix doc in 3 minutes

I started a new project and typed a few lines about what felt wrong. In about three minutes it produced a structured, detailed repair document.

Not “maybe try this.”
A real document. Clear steps. Reasoning. Coverage.

Moment 4: Subagents gave me a parallel dev team

When I finally got the subagent workflow running, I realized I now had something that looked like a team: parallel execution, coordination, momentum.

And I’ll be honest: it almost made me emotional.

Because I’ve been on the other side of this—startup years, payroll anxiety, debt, the feeling that every feature costs blood.

Suddenly, the “team” was something you could spin up.

After all that, I finally understood why the lobster hype exploded.

It gives each person a shell in the digital world—something that can evolve on its own. From that point on, anything that can be completed through information exchange stops being limited by your personal skill level.

It becomes limited mainly by your imagination.

I’m comfortable saying this: OpenClaw is the iPhone 4 moment of this LLM era.

And once you see that, the old “Web1 / Web2 / Web3” narrative feels… outdated. The next framing is something like Agent X.

In that world, the internet becomes less visible. Less “apps.” Less constant interaction friction. Less spam and UI fatigue.

Maybe you don’t need a phone full of apps. Maybe a watch—or even just an earbud—is enough.

And ironically, in a world of infinite synthetic voices, real human voice will become even more valuable.

2. The Engineering Aesthetics of OpenClaw

I still want to explain—at an engineering level—why I feel confident making a claim this big.

Over the last four years, I’ve watched AI waves come and go. My emotions cycled through:

fear of being replaced
skepticism and distance
using AI for small efficiency wins
understanding the boundary between real capability and hype
worrying about human–machine ethics

But until OpenClaw, I never believed AI would reshape daily life the way mobile internet did.

Why?

At least four reasons.

Reason 1: it was still “tech people playing with tech people”

Product people couldn’t really join the conversation. The production loop wasn’t closed.

In plain words: it felt too cold. Too high barrier. Too “who are you even?”

Reason 2: most “products” were still prototypes

They felt like computers in a server room, or a public payphone.

Not like a phone you carry—filled with your personal context and history.

Without a real personal container and memory, it can’t merge into life.

Reason 3: without (2), it can’t be proactive

Using AI still felt like opening an app.

And the truth is: apps are anti-human. Too many, too noisy, too much context switching.

If AI isn’t self-driven, it stays a tool. It never becomes a partner.

Reason 4: it didn’t have a real business model

There wasn’t a clear “why would normal people pay for this” moment.

That’s going to matter more than most people admit.

So what did OpenClaw do differently?

At its core, it’s an agent architecture built with real engineering discipline and strong product sense—written in a way a product manager can actually follow.

It’s not the traditional “fixed skills + strict MCP flows” style, where you get a packaged system designed for a narrow task.

It’s closer to what the name suggests:

open: flexible enough to train and shape around your own mental model
claw: usable enough that your job is to describe what you want—and it figures out where to grab it

Here’s a metaphor (not perfect, but close enough):

LLMs are the grains you can ferment into alcohol
skills/MCP are the recipes for base spirits
most agents are pre-mixed cocktails
OpenClaw is like being given a bartender who knows where to source the right spirits, then mixes based on your taste

Even the project structure communicates this. I don’t write code, but I could slowly understand its file layout and config. Much of it reads like natural language.

You “assemble” behavior through language.

What you can do depends on your imagination—within the boundary of things that can be done through information exchange.

And the output quality depends less on “knowing algorithms,” and more on:

logic
clarity
how well you can describe intent

That is a huge shift.

Personal container: soul / user / memory

OpenClaw also solves the “personal device” problem.

Each lobster has a soul—an identity, a user context, and memory. And you can update all of it through normal conversation.

You can make it “real,” or you can make it role-play. You can build memory however you want.

The best part: you can summarize memory to let it evolve. The more you use it, the more personal it becomes.

Heartbeat: a perfect word for autonomy

The heartbeat mechanism solves the self-drive issue.

Even the naming is good. With a heartbeat, it feels alive. Without it, it’s just a script.

Now we can talk about the last missing piece: business.

3. How the Business World Might Change

I mentioned earlier: I spent about 5,000 RMB.

Roughly 3,000+ on a Mac mini, and 2,000+ on tokens.

If you’re not ready to commit to a Mac Mini yet, you can try deploying OpenClaw via clawbot.ai first

I paid for AI. Repeatedly. I kept recharging tokens. I bought subscriptions. OpenAI, Moonshot, Zhipu, MiniMax—one after another.

Because I started to see the financial logic differently.

What do compute and tokens really mean?

Compute is made of electricity + chips.

It’s the central bank of the AI era: a form of credit.

Tokens are high-energy currency.

And business models? They are multipliers on this currency.

Electricity cost and chip efficiency decide the “credit quality” of that central bank—reflected in the cost of issuing tokens.

Defining the multiplier: three layers

All AI business models share the same production core:

spend tokens → produce information flow

You can define production efficiency as:

useful information output per unit time (e.g., working code) / token spent

But business models differ based on who the information flow targets.

L1: Replace human labor

Here the multiplier is straightforward:

labor cost replaced / token cost

If you use AI to build conventional software and sell licenses or subscriptions, the value you create is mostly the salaries you didn’t need to pay: engineers, support, pre-sales.

The problem is the marginal profit drops fast. There’s a ceiling.

L2: Increase human free time

Now the target is: reduce survival time required to reach real freedom.

Multiplier becomes:

(utility of free time × survival time saved) / token cost

Marginal benefit stays much more stable.

And the higher the “time utility” of your users, the stronger this multiplier becomes.

L3: Create more demand for token spending

This sounds strange, but it might be the most important layer.

If your information flow makes other people—or other agents—want to spend more tokens inside your system, the multiplier becomes:

downstream token consumption / token cost

It’s similar to how real money multipliers work: lending → deposits → lending again, amplifying the base supply.

OpenClaw is a living example of an information flow that makes people willing to burn more tokens. LLM companies are also part of this.

Right now, OpenClaw can’t directly capture value from the token spend it triggers. But in a world where tokens circulate like currency—not just issued directly from the “central bank” (compute owners)—every transaction layer can extract value.

This is the highest multiplier effect.

So if you’re building or investing:

which layer are you actually playing in?

4. Who Is Whose Lobster?

This Spring Festival, I basically lived at my desk—tinkering with the lobster.

There were failures, crashes, and moments so absurd they were funny. In a temporary group chat we made for debugging, I asked for help constantly—because I was the least skilled and the most addicted.

At the end, a friend replied with one sentence:

“You’re the lobster.”

I laughed. And then I stopped laughing.

Because it raises the uncomfortable question: what happens to human ethics in an Agent era?

The first moment you connect OpenClaw, it asks how it should address you. It asks you to name it. It asks you to define its identity.

You feel like the one with full control.

But over time, a few things might happen:

You may lose patience with real humans

The longer you talk with an agent, the more your tolerance for real people’s slowness, ambiguity, and emotions can shrink.

That can widen the gap between people—maybe as an escape, but also as the start of new boundary problems.

You gradually hand over agency

You give up small decisions. Then medium ones. Then larger ones.

You might gain time and freedom—but you may not fully own them.

Or… it could make more “super individuals”

I want to end on a less pessimistic note.

We worry AI will become strong enough to dominate humans. But before we reach that extreme, there’s another possibility:

If AI makes it easier for more people to become “super individuals,” maybe it becomes a buffer against social value fracture—slowing polarization rather than accelerating it.

Maybe.

For now, I’ll stop here.

DEV Community: brooks wilson

DeepSeek-V4 Preview: Entering the Era of Accessible Million-Token Context

DeepSeek-V4-Pro: Performance Comparable to Top Closed-Source Models

Significantly Improved Agent Capabilities

Rich World Knowledge

World-Class Reasoning Performance

DeepSeek-V4-Flash: A Faster and More Cost-Efficient Option

Architectural Innovation and Highly Efficient Long Context

Targeted Optimization for Agent Workloads

API Access

Open Weights and Local Deployment

Closing Thoughts

GPT Image 2: What It Is, What It Can Do, and Why It's Different From Every AI Image Tool That Came Before

What Is GPT Image 2?

A Quick Family Tree

Why Architecture Matters (Short Version)

The Five Capabilities That Matter

1. Thinking Mode — The Headline Feature

2. Near-Perfect Multilingual Text Rendering

3. Native 2K Resolution, Experimental 4K

4. Precise Editing via Masked Inpainting and Outpainting

5. Speed: ~3 Seconds Per Image

Five Hands-On Prompts, With Notes

Prompt 1 — Multilingual Magazine Cover

Prompt 2 — Infographic with Real Data

Prompt 3 — Photorealistic App UI Mockup

Prompt 4 — Four-Panel Comic With Character Consistency

Prompt 5 — Commercial Product Shot With Two Types of Text

Pricing: ~$0.21 Per HD Image, Thinking Mode Extra

GPT Image 2 vs Midjourney vs Nano Banana Pro vs Flux.2

Which Should You Actually Use?

Where GPT Image 2 Still Fails

How to Use It: ChatGPT and API

In ChatGPT

Via API

Practical Tips (From Running It for a Week)

FAQ

Bottom Line

Further Reading

An Anonymous Model Just Took #1—and Flipped the AI Video Race Overnight

How “HappyHorse” Disrupted the AI Video Generation Landscape

A Sudden Shift in the Rankings

A Carefully Orchestrated Release

1. Timing Was Strategic

2. Anonymity Was Intentional

3. Open Source Was the Real Move

What Makes HappyHorse Technically Notable?

1. Ultra-Fast Inference

2. Native Audio-Video Generation

3. Diffusion Transformer (DiT) Architecture

Why Many Believe It’s Alibaba

Strategic Implications: Open Source vs Closed Models

Dual-Track Positioning

Pressure on Competitors

Why Open Source Hits Competitors Harder

Industry Context: Competition Is Intensifying

Final Thoughts

Happy Horse 1.0: What We Actually Know About the Model That Topped Artificial Analysis' Video Arena

What's on the leaderboard

What the model claims about itself

Who built it

What this means if you evaluate video models

Sources

Claude Code Architecture Explained: Agent Loop, Tool System, and Permission Model (Rust Rewrite Analysis)

Claude Code Deep Dive (Part 1): Architecture Overview and the Core Agent Loop

Why Start with the Rust Rewrite?

The Problem with the Original Codebase

Why the Rust Version Is More Useful

Architecture Overview: A 6-Module System

A Key Design Decision

The Core: An 88-Line Agent Loop

Runtime State: Simpler Than Expected

The Core Loop: run_turn()

Tool Specification

Sub-Agent Design

Design Insight: Gradual Escalation

Sub-Agent Safety Model

Part 1 Summary

Final Thought

Next Part

The Core Loop: `run_turn()`