guanjiawei

Posted on Apr 10 • Originally published at guanjiawei.ai

HappyHorse and the Hard Demand for Text-to-Video Privatization

#ai #videogeneration #hardware #opensource

On April 8, a pony suddenly appeared on the Artificial Analysis Video Arena leaderboard.

An anonymous model called HappyHorse scored Elo 1333 for text-to-video and 1391 for image-to-video, breaking records in both categories and knocking ByteDance's SEEDANCE 2.0 off the top spot. SEEDANCE 2.0 had just been released in March of this year, ranking first above competitors like Google Veo 3, OpenAI Sora 2, and Runway Gen-4.5. Then a little horse came along and overturned it.

The Tradition of Anonymous Benchmarking

This kind of anonymous leaderboard climbing has become something of a recurring performance in the Chinese AI circle.

In February this year, an anonymous language model called Pony Alpha appeared on OpenRouter—free to use, with a 200k context window, processing 40 billion tokens on its first day. Five days later, Zhipu AI announced: Pony Alpha was actually GLM-5, a 745B-parameter MoE architecture. In March came Hunter Alpha; the community speculated it was DeepSeek V4, but Xiaomi stepped forward to claim it as MiMo-V2-Pro, a trillion-parameter model.

The benefits of anonymous benchmarking are straightforward: obtaining real blind test data without brand halo or baggage. Benchmark scores can be gamed; blind tests cannot.

HappyHorse follows the same playbook. However, one detail quickly gave it away—Chinese and Cantonese appeared at the top of its supported languages list.

The Domain Wars

With the model going viral, domains naturally became targets.

When I searched for HappyHorse's official website, I discovered something amusing: both happyhorse.io and happyhorse.com had been registered, with websites built and billing already activated. Clicking through reveals a full suite of services—text-to-image, text-to-music, and text-to-video—quite an impressive setup. But look closely, and they're not using the HappyHorse model at all; running in the backend is Lightricks' LTX—an open-source model from an Israeli company with only 2 billion parameters in its original form. I had tested it before; it's completely different from the HappyHorse that topped the leaderboard.

Domain squatting happens faster than model training. But if someone unaware of the situation pays money thinking they're using that chart-topping HappyHorse, that's quite a scam.

It doesn't stop at domains. Several HappyHorse-related repositories also popped up on HuggingFace—happyhorse-lab, happyhorseai, HappyHorseOrg—all looking official. But checking their creation dates reveals they were all registered on April 9. Clicking in reveals either a lone README or an empty repository. The READMEs are complete, mentioning "open source" and "number one," but there are no weight files. Riding the hype wave isn't limited to domain squatting anymore; even HuggingFace gets occupied.

The Mystery Remains Unsolved

As of this writing, HappyHorse's origin remains undetermined.

The model page on Artificial Analysis still reads "More details coming soon," using the placeholder image for mysterious models. The leaderboard recognizes its scores but provides no team background. No technical report, official GitHub repository, company announcement, or paper homepage has been found to close the identity loop.

The most convincing inference currently points to Sand.ai. The technical descriptions circulating for HappyHorse—15B parameters, 40-layer single-stream Transformer, joint text-video-audio modeling, 8-step DMD-2 distillation, multilingual lip-sync—closely match daVinci-MagiHuman, jointly released by Sand.ai and SII-GAIR. Reports from 36Kr also point in this direction. But so far, this remains inference, not official confirmation.

The claim of being "already open sourced" also warrants skepticism. Artificial Analysis marks models with open weights using the Open Weights label on the leaderboard; HappyHorse currently lacks this designation. The current leading open-source video models remain in the LTX-2 Pro tier. Online articles claiming HappyHorse has been fully released under Apache 2.0 currently do not match any verifiable weight releases.

Around the same time, Alibaba released Wanxiang 2.7, a 27B-parameter MoE architecture (14B active), supporting a "thinking mode." However, Wanxiang 2.7 currently only offers API access; the weights have not been made public. Previous Wanxiang series models were released as open source immediately; the reason for the change this time remains unclear.

The Hard Demand for Text-to-Video Privatization

HappyHorse's identity will be revealed sooner or later. But what interests me more isn't who built it, but the privatization logic of text-to-video models.

Every model category ignites a hardware category. When DeepSeek emerged, H20 orders exploded—Chinese companies placed over $16 billion in orders in Q1 2025 alone. After open-source language models gained traction, DeepSeek V3 running on a cluster of eight M4 Pro Mac Minis caused Mac Minis to sell out.

What will text-to-video ignite? I believe the answer is consumer-grade GPUs and small inference boxes. Moreover, text-to-video has a stronger hard demand for privatization than language models.

Latency-Insensitive, Cost-Sensitive

Text-to-video is naturally a "can wait" scenario. Generating a video in the cloud takes several minutes regardless. Running locally a bit slower—ten minutes or even half an hour—makes no essential difference. You won't stare at the progress bar; you'll do something else in the meantime.

When latency isn't sensitive, cost becomes sensitive. Beyond compute, video in the cloud has one major expense that's easily overlooked: bandwidth. Videos are dozens to hundreds of megabytes; moving them around incurs frightening network fees. I recently calculated the costs—the servers themselves aren't that expensive, but the network bandwidth bill is something you don't want to look at twice. Deploy locally, and that expense disappears.

Latency insensitivity also leads to another conclusion: you don't need top-tier compute. Language model inference chases low latency, requiring the best cards available. Text-to-video is different—a bit slower is acceptable. This makes "not fast enough but cheap enough" compute—gaming GPUs, previous-generation compute cards—highly cost-effective choices.

The Dilemma of Regulations and Content Moderation

This is something many haven't considered.

Text content moderation is manageable; most scenarios won't encounter legal issues. Images and videos are different—IP infringement, portrait rights, sensitive content—regulations haven't fully settled, leaving cloud service providers caught in a difficult position.

Cloud services face a dilemma: don't block, and you're liable if something happens; block, and you can't achieve precise technical filtering, resulting in massive collateral damage. The result is a "better to over-censor than under-censor" approach—no portrait uploads allowed, no specific IP references permitted, immediate blocking at the slightest detection of possible sensitivity. The user experience becomes heavily restricted.

Local deployment avoids these problems. The model runs on your own machine, bypassing third-party moderation. During the Stable Diffusion era, massive text-to-image workflows ran locally not because local was faster, but because there were no moderation restrictions. Text-to-video will follow the same pattern.

Shorter Path to Monetization

The value of language models has always been difficult to quantify. A better model writes a better paragraph—how much revenue does that generate? Unclear. Upgrading from a 32B model to a hundreds-of-billions-parameter private deployment, spending ten times the cost on H20s—can you earn ten times more? Nobody can say. The emergence of coding scenarios improved this somewhat, but previously, everyone genuinely couldn't calculate this equation.

Text-to-video is completely different. A good video is good traffic; traffic is money. Spend a few hundred to generate a decent-quality video—if the content is interesting, the traffic generated might be worth thousands or even tens of thousands. Everyone can calculate this equation.

SEEDANCE 2.0 is an example. Creators are willing to pay and queue for resources because videos produced with it genuinely achieve better metrics. The gap between good and bad models becomes visible after posting just a few videos.

Hardware Chain Reaction

Whether HappyHorse will open source, and when, remains uncertain. But we can calculate the implications.

If we estimate based on the rumored 15B parameters, FP16 inference requires approximately 30GB of VRAM, while quantization to INT8 needs only around 15GB. A single RTX 4090 or 5090 could handle it. A small box like the DGX Spark with 128GB of unified memory would be even more comfortable, running inference with room to spare.

If it actually open-sources at this scale, RTX 4090/5090 cards will likely become even harder to acquire. The DGX Spark's price has already risen from the initially announced $3,000 in 2025 to $4,699—an increase of over 50%—with supply already tight. Adding another VRAM-hungry workload to the mix will only make the situation more extreme.

We've seen this script play out several times before. DeepSeek ignited H20 demand; open-source LLMs pulled Mac Mini sales. Text-to-video has reached today's quality levels; it only lacks a good enough open-source model to land. Whether HappyHorse gets this opportunity remains to be seen, but it will happen sooner or later.

Unresolved

Returning to HappyHorse itself.

Will it officially open source? It's impossible to tell right now. The leaderboard scores are there, but weights and code have not materialized. If it ultimately only offers an API service, the impact on the hardware market will be limited—just another powerful closed-source model.

How large is it actually? The marketing page claims 15B; if true, a single consumer GPU can run it. But if it's actually larger, requiring multi-GPU setups or even clusters, then local deployment becomes unrealistic, and we're back to the cloud provider model.

Different answers to these two questions lead to completely different storylines. But regardless of how HappyHorse turns out, the trend of text-to-video moving local won't change. Tools like ComfyUI and WebUI are waiting for a good enough open-source model; the quantization community is waiting too. Once it arrives, the consumer hardware side will get lively.

Originally published at https://guanjiawei.ai/en/blog/happy-horse-video-privatization

DEV Community