Happy Horse 1.0: What We Actually Know About the Model That Topped Artificial Analysis' Video Arena
An unfamiliar model called HappyHorse-1.0 is currently sitting at #1 on Artificial Analysis' Video Arena, the blind user-voted benchmark widely used to evaluate AI video generation systems. This post summarizes what's verifiable from public sources and what remains unconfirmed, because the gap between those two categories is larger than usual for a model at this rank.
What's on the leaderboard
From Artificial Analysis' public text-to-video (no audio) leaderboard, as of April 8, 2026:
Rank Model Creator Elo 95% CI Samples
1 HappyHorse-1.0 HappyHorse 1,355 ±11 5,062
2 Dreamina Seedance 2.0 720p ByteDance Seed 1,273 ±8 8,130
3 SkyReels V4 Skywork AI 1,245 ±9 5,712
4 Kling 3.0 1080p (Pro) KlingAI 1,242 ±9 5,262
5 Kling 3.0 Omni 1080p (Pro) KlingAI 1,230 ±10 4,776
Three observations worth pulling out:
The gap is statistically clean. An 82-point Elo lead over #2 is not within the noise floor of a preference-based arena. HappyHorse-1.0's confidence interval (1,344–1,366) doesn't overlap with Seedance 2.0's (1,265–1,281). That's a clean separation, not a coin flip.
The sample size is real. 5,062 blind matchups is the same order of magnitude as the #3 and #4 entries, which means the Elo isn't riding on a lucky early streak. It's been stable across thousands of votes.
API status is "Coming soon." The row on the leaderboard lists API availability as pending. The model is generating output on the arena but is not yet broadly available for production use.
What the model claims about itself
Here's where I want to be careful, because the information below comes from sites associated with the project (primarily happyhorse-ai.com and happyhorses.io) and has not been independently verified by any third party as of this writing.
According to these sources, HappyHorse-1.0 is described as:
- A 15B-parameter unified transformer (the parameter count appears on secondary documentation, not on Artificial Analysis itself).
- A 40-layer self-attention architecture with no cross-attention. First and last 4 layers use modality-specific projections; the middle 32 layers are shared across text, video, and audio tokens.
- Trained to run inference in 8 denoising steps without CFG, via a DMD-2 distillation recipe.
- Reportedly capable of generating a 5-second 1080p clip in ~38 seconds on an H100 (self-reported).
- Natively supporting joint audio-video generation across 6 languages (English, Mandarin, Japanese, Korean, German, French; a secondary site lists Cantonese as a 7th).
If these numbers are accurate, the architecture would represent a fairly aggressive bet on unified multimodal transformers over the multi-stream cross-attention approaches that most current video models use. It would also place HappyHorse-1.0 in the same design family as Meta's Transfusion line of research, though there is no direct connection established between the projects.
None of these claims can be independently verified right now. The GitHub and HuggingFace links referenced on the project's own sites currently point to "coming soon" placeholders. No weights, no reproducible demo outside the arena, no third-party benchmark of inference speed or memory footprint.
Who built it
As of April 8, no team or organization has officially claimed HappyHorse-1.0. The most widely discussed attribution in the Chinese tech press, now circulating in English AI circles, links the model to a new team reportedly led by Zhang Di — the former VP at Kuaishou who led the Kling video generation effort, and who reportedly joined Alibaba in late 2025 to run the Future Life Lab inside the Taotian Group.
I want to stress: this is the most credible theory currently in circulation, but it is not confirmed. Alibaba has not commented. No one publicly associated with HappyHorse has confirmed or denied it. Other community speculation has pointed to alternative origins. If you're making engineering or editorial decisions based on the attribution, wait for official confirmation.
What this means if you evaluate video models
If you benchmark video models before integrating them into a pipeline, the honest summary is:
- The leaderboard result is real. Blind user preferences, 5,000+ matchups, clean confidence intervals. That's not marketing; that's what the arena is designed to measure.
- Everything else is not yet real for you. No weights, no API, no reproducible local run. You can't currently fine-tune it, can't self-host it, can't measure its latency on your own hardware, can't verify the claimed architecture.
- The "what" is known. The "how" and "by whom" are not.
That combination is unusual at the top of the leaderboard. Most models at this rank come with a paper, a model card, a team announcement, and at least an API. HappyHorse-1.0 currently has a leaderboard row and a set of unverifiable claims. That may change quickly — the project sites describe an imminent broader release — or it may not.
Sources
- Artificial Analysis Video Arena (live leaderboard): https://artificialanalysis.ai/video/leaderboard/text-to-video
- HappyHorse-1.0 public testing interface and current technical spec: https://happyhorses.io
- Chinese-language reporting referencing the Zhang Di / Future Life Lab attribution is cited across several tech media outlets as of April 7–8, 2026
Leaderboard rankings are dynamic and may shift as new votes and new models are added.

Top comments (0)