guanjiawei

Posted on May 1 • Originally published at guanjiawei.ai

DeepSeek V4 Day: It's About Infra, Not the Model

#ai #deepseek #infra #model

The AI industry feels like New Year's today.

OpenAI dropped GPT-5.5 in the morning, and DeepSeek V4 went live in the afternoon. Add DeepMind's Vision Banana from a couple of days ago, plus Haoyu Cai's Anuttacon paper on digital humans. This week has crammed in more new stuff than the entire past quarter. Let me talk through the highlights.

DeepSeek V4

I participated in some internal testing before the V4 release. I had to keep it confidential before, but the embargo is lifted.

First, the specs. V4 launched in two versions: V4-Pro with 1.6T total params and 49B active; V4-Flash at 284B / 13B. Both pre-trained on 32T+ tokens, with 1M context as standard, open-sourced under MIT, supporting three inference modes: non-think / think high / think max. On API pricing, V4-Pro is \$1.74 per million tokens input and \$3.48 output; V4-Flash is cheaper by an order of magnitude, at \$0.14 / \$0.28.

On capabilities, the team itself admits that V4 sits roughly around the Opus 4.6 tier, perhaps even slightly weaker. Not at the absolute top.

This is the same pattern as last time with R1 versus O3. Close, but not at the frontier.

So where does DeepSeek's significance lie? I've always felt it's not a company that crushes competitors with its model; it's a company that leads competitors through Infra. And Infra doesn't follow the model—it moves ahead of it.

The Infra in this V4 release is a disaster for every inference company on the market. That word is not an exaggeration.

Native FP4. V4 is an FP8 + FP4 mixed-precision model: MoE experts use FP4, the rest FP8. Right now, most chips and most inference stacks either don't support FP4 or support it very poorly.

Operators sliced extremely fine. When I ran V4 inference, I found it had done an enormous amount of personalized optimization at the operator layer; mainstream open-source engines basically can't match the official performance. To catch up to its price-performance ratio, you'd have to grind through the low-level compiler line by line.

Single machines struggle. The previous generation could at least run on a single machine; this generation can't even do that, let alone clusters. Without running the official stack, there's basically no way to hit that price point.

This reminds me of when DeepSeek disclosed a 545% theoretical gross margin during the V3/R1 era. Meaning if you run strictly on their architecture, margins can be extremely high; at the same time, all the replica inference services were losing money. V4 is a more radical version of that story.

A side story: the Infra lead at the company talked with us and said very seriously: be careful, technology is changing so fast that some architectures in the previous generation may be transitional—the next generation might drop them entirely. If you pour heavy investment into Infra, it could all go down the drain when the next generation drops.

There's a fundamental divergence behind this. Most model companies build the model first and leave Infra for later; DeepSeek puts Infra first, using bottom-level innovation to reverse-engineer the model's economics. Both can survive. But if you really want to serve consumers at scale without thinking through Infra, you'll definitely crash. DeepSeek itself stumbled during its first viral moment—the web page went down, the API dropped. And that was when its Infra was already relatively solid.

The Model Itself

From hands-on experience:

Chinese text capability remains its strong suit. For tasks like Chinese writing and report generation—content organization—it's worth it.
Tool calling feels pretty good, somewhat Claude-like.
It's not at the absolute top tier. It can't catch the level of GPT-5.5 or Opus 4.7.
No coding plan for now. Probably not coming anytime soon. A real shame.

1 Million Context as the Default

This might be the most noteworthy thing about V4. One million tokens of context, made the default for all online services, with no segmented price hikes.

What does 1 million tokens mean? Roughly 2 million Chinese characters—an entire web novel serialized over one or two years can fit inside.

Everyone wanted to do this before, but those who actually pulled it off either priced it separately or shut it down after a short while. Anthropic opened its million-tier last year then pulled it back, only re-releasing recently; OpenAI still hasn't officially opened the million-tier API. It's not a capability problem—Infra can't handle the load.

V4 can make this standard at no extra cost because it performed major surgery on attention. It introduces two structures used in alternation: CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention). CSA first compresses KV 4× along the sequence dimension, then applies sparse attention to pick the most relevant tokens (top-1024 for V4-Pro, top-512 for V4-Flash), paired with a 128-token sliding window to preserve local context. HCA compresses even more aggressively (128× compression ratio), but applies dense attention on the compressed representation, effectively leaving a low-resolution "global summary" for some layers. The two layer types are interleaved throughout the network: some do precise look-up, others do fuzzy global attention. Additionally, a layer of Manifold-Constrained Hyper-Connections (mHC) is stacked on top to stabilize cross-layer signal propagation.

The official efficiency numbers: at 1M context, V4-Pro needs only 27% of the per-token inference FLOPs of V3.2, and 10% of the KV cache. Making the million-token tier standard without price hikes is built on this foundation.

Kimi was the earliest to push in this direction back in '23: million-token context can cover most scenarios. Three years later, this has finally become an infrastructure-level default capability.

Day-0 Adaptation for Domestic Chips

This time V4 achieved day-0 deep adaptation for domestic chips like Huawei Ascend. I think that shows real vision.

Perpetually relying on overseas chips for training and inference isn't a technical problem; it's a risk problem. V4 thinking through domestic chip adaptation on day 0 is more meaningful than the model itself.

Other Releases This Week

Vision Banana: Generative Models Actually "Understand" Images

DeepMind released something called Vision Banana these past couple of days. The approach: take Nano Banana Pro, a text-to-image model, do a round of instruction tuning, and have it tackle traditional vision tasks like segmentation, depth estimation, and normal estimation.

The results match or even beat specialized models like Segment Anything and Depth Anything, without losing image generation capability.

This is quite interesting. Text-to-image models already possess an intrinsic understanding of images; it's just that no one knew how to "query" that understanding before. Now image understanding and generation are unified under the same interface: all tasks are solved via image-to-image.

Following this line of thinking, generative models naturally lead to "world models." The dimensions of 2D, 3D, video, and physics may all be folded into a single model.

Cai Haoyu's LPM 1.0: Digital Humans Can Finally "Listen"

On April 10, Haoyu Cai—former founder of miHoYo (the studio behind Genshin Impact)—published a paper on arXiv through his new company Anuttacon. LPM 1.0 is a 1.7-billion-parameter diffusion Transformer for "performance generation" of video characters.

Digital humans are an exhausted topic. But this paper defines two problems that no one had seriously tackled before: persistent identity consistency and interaction while listening.

Identity consistency isn't just about stable appearance. It means the character's reactions across different scenes must conform to the same "personality"—you shouldn't suddenly feel like "this isn't the same person."

Previous digital humans were output-oriented: make it speak, make it move, and those are doing fine now. The truly hard part is listening. When you talk to it, it needs to give you facial feedback, micro-gestures, breathing rhythm—making you feel like there's a living person on the other side. In real life, when you talk to someone, they don't wait expressionlessly for you to finish before responding; they're feeding back the entire time. The volume of this feedback is enormous, and almost no one had done it before.

The paper is out, but the model isn't open-sourced. Because the more realistic it gets, the more human-like it becomes, and the fraud risk is too high. I think that's the right call.

The Opus 4.7 Mess: An Infra Failure

Anthropic has taken a lot of heat over the past couple of weeks. On April 23 the official postmortem identified three overlapping issues:

On March 4, default reasoning effort was lowered from high to medium to improve UI latency, at the cost of noticeably lower intelligence for Sonnet 4.6 and Opus 4.6.
On March 26, a feature to clear idle session thinking was launched; a bug caused it to clear every round, making the model forgetful and repetitive.
On April 16, a system prompt was added to limit response length, causing Opus 4.7 coding quality to drop 3%.

All three were Infra-layer mistakes. The proportion that Infra contributes to whether a model service is usable is growing ever larger.

A Company That Didn't Release a Top Model, Yet Still Dominates Trending

DeepSeek went so long without a new model release. The intermediate generations basically made no waves. Today it dropped, and it took several trending spots.

I see this as strategic strength. It did several things that are highly representative right now: making million-token context standard, pushing FP4 to production, and getting day-0 domestic chip adaptation working. These are all hardcore Infra feats.

You also have to admit that standing alone at the top is hard in the current landscape. Kimi K2.6, GLM 5.1, MiniMax's new models—the entire open-source camp's water level has risen. It's not like the V3/R1 era when it monopolized the open-source high ground.

Repeating That Conclusion

A friend came to me this morning. His company wants to transform, wants to "buy an AI product" to help drive team change. Financial industry, not many people.

My exact words to him were: don't rush into talking about transformation. First make everyone in the company a heavy user of coding agents, then talk about organizational change.

Then I cast my desktop to show him how I use Claude Code every day, how many agent threads are running simultaneously on my screen. His first reaction after watching was to immediately go subscribe to a coding plan.

That reaction is the right one. The best investment in this era is buying a coding plan and using it every day. Not the "used ChatGPT a few times" kind of usage, but really letting agents into your daily workflow. Without that foundation, organizational-level change is a castle in the air.

2026 will definitely go down in the books. Not because of any single model, but because of density—three models and a paper can drop in a single day. If you can feel this rhythm, you're already in the arena.

References

Originally published at https://guanjiawei.ai/en/blog/deepseek-v4-infra-matters

DEV Community