Open Source Transformer Video Models: Architecture, Licenses, and Benchmarks
The video generation field is undergoing an architectural shift. HappyHorse-1.0's appearance at #1 on Artificial Analysis with a pure self-attention Transformer — no diffusion backbone, no cross-attention — has validated an approach that many researchers considered too simplistic for competitive video quality.
The Architectural Divide
Multi-Stream Diffusion (Traditional)
Most established video models use multi-stream architectures where text, video, and audio each have dedicated encoder branches. These branches interact through cross-attention layers, which adds parameters and inference steps.
Examples: Stable Video Diffusion, early Kling versions, Hunyuan Video
Pros: Modular, each stream can be optimized independently Cons: Parameter redundancy, longer inference paths, fragmented audio-visual alignment
Single-Stream Transformer (Emerging)
HappyHorse-1.0 represents the single-stream approach: all modalities — text, video, audio — are tokenized into a single sequence and processed through shared self-attention layers.
Claimed specs: 40 layers total, 4 modality-specific layers at each end, 32 shared in the middle
Pros: Higher parameter efficiency, shorter inference path, native audio-video synchronization Cons: Harder to train (all modalities must be learned jointly), single failure point
Diffusion Transformers (DiT)
A middle ground: using Transformer blocks as the backbone of a diffusion process. Kling 3.0 and FLUX use this approach.
Pros: Combines Transformer scalability with proven diffusion training Cons: Still requires many denoising steps (typically 20-50)
Inference Efficiency Compared
| Model | Steps | CFG | Approx. Time (5s clip, 1080p) |
|---|---|---|---|
| HappyHorse-1.0 | 8 | No | ~38s (H100, claimed) |
| Seedance 2.0 | ~30 | Yes | ~60s (estimated) |
| Kling 3.0 Pro | ~25 | Yes | ~45s (estimated) |
| WAN 2.6 | ~30 | Yes | ~90s (A100) |
| LTX 2.3 | ~20 | Yes | ~30s (consumer GPU) |
HappyHorse's claimed 8-step, no-CFG inference is remarkable. This likely implies consistency distillation or rectified flow training — techniques that compress multi-step sampling into fewer direct prediction steps.
License Landscape
| Model | License | Commercial Use | Weights Available |
|---|---|---|---|
| HappyHorse-1.0 | Claimed open | Claimed yes | No (Coming Soon) |
| WAN 2.6 | Apache 2.0 | Yes | Yes |
| Hunyuan Video | Tencent License | Limited | Yes |
| LTX Video 2.3 | Apache 2.0 | Yes | Yes |
| Open-Sora | Apache 2.0 | Yes | Yes |
For production deployment today, Apache 2.0 licensed models (WAN 2.6, LTX 2.3) offer the clearest legal path. HappyHorse's license terms cannot be evaluated until weights are actually released.
What HappyHorse Means for the Field
If the claims hold up when weights are released:
- Cross-attention may be unnecessary for top-quality video generation
- Sub-10-step inference is achievable at SOTA quality levels
- Joint audio-video in a single model can match or beat separate pipelines
These implications would affect how the next generation of video models is designed. Teams currently investing in complex multi-stream architectures may need to reconsider.
The caveat remains: none of this is verified. The weights will tell the truth.