Open Source Transformer Video Models: Architecture, Licenses, and Benchmarks

The video generation field is undergoing an architectural shift. HappyHorse-1.0's appearance at #1 on Artificial Analysis with a pure self-attention Transformer — no diffusion backbone, no cross-attention — has validated an approach that many researchers considered too simplistic for competitive video quality.

The Architectural Divide

Multi-Stream Diffusion (Traditional)

Most established video models use multi-stream architectures where text, video, and audio each have dedicated encoder branches. These branches interact through cross-attention layers, which adds parameters and inference steps.

Examples: Stable Video Diffusion, early Kling versions, Hunyuan Video

Pros: Modular, each stream can be optimized independently Cons: Parameter redundancy, longer inference paths, fragmented audio-visual alignment

Single-Stream Transformer (Emerging)

HappyHorse-1.0 represents the single-stream approach: all modalities — text, video, audio — are tokenized into a single sequence and processed through shared self-attention layers.

Claimed specs: 40 layers total, 4 modality-specific layers at each end, 32 shared in the middle

Pros: Higher parameter efficiency, shorter inference path, native audio-video synchronization Cons: Harder to train (all modalities must be learned jointly), single failure point

Diffusion Transformers (DiT)

A middle ground: using Transformer blocks as the backbone of a diffusion process. Kling 3.0 and FLUX use this approach.

Pros: Combines Transformer scalability with proven diffusion training Cons: Still requires many denoising steps (typically 20-50)

Inference Efficiency Compared

Model	Steps	CFG	Approx. Time (5s clip, 1080p)
HappyHorse-1.0	8	No	~38s (H100, claimed)
Seedance 2.0	~30	Yes	~60s (estimated)
Kling 3.0 Pro	~25	Yes	~45s (estimated)
WAN 2.6	~30	Yes	~90s (A100)
LTX 2.3	~20	Yes	~30s (consumer GPU)

HappyHorse's claimed 8-step, no-CFG inference is remarkable. This likely implies consistency distillation or rectified flow training — techniques that compress multi-step sampling into fewer direct prediction steps.

License Landscape

Model	License	Commercial Use	Weights Available
HappyHorse-1.0	Claimed open	Claimed yes	No (Coming Soon)
WAN 2.6	Apache 2.0	Yes	Yes
Hunyuan Video	Tencent License	Limited	Yes
LTX Video 2.3	Apache 2.0	Yes	Yes
Open-Sora	Apache 2.0	Yes	Yes

For production deployment today, Apache 2.0 licensed models (WAN 2.6, LTX 2.3) offer the clearest legal path. HappyHorse's license terms cannot be evaluated until weights are actually released.

What HappyHorse Means for the Field

If the claims hold up when weights are released:

Cross-attention may be unnecessary for top-quality video generation
Sub-10-step inference is achievable at SOTA quality levels
Joint audio-video in a single model can match or beat separate pipelines

These implications would affect how the next generation of video models is designed. Teams currently investing in complex multi-stream architectures may need to reconsider.

The caveat remains: none of this is verified. The weights will tell the truth.