Image-to-Video vs Text-to-Video: Which Open Source Models Win Each Category

HappyHorse-1.0 topped both the T2V and I2V leaderboards, but with a notably larger margin in image-to-video. This isn't a coincidence — the two tasks have different requirements, and models that excel at one don't always lead in the other.

Why Image-to-Video Scores Higher for HappyHorse

In the Artificial Analysis Video Arena, HappyHorse-1.0 scored:

T2V (no audio): 1333 Elo (+60 over #2)
I2V (no audio): 1392 Elo (+37 over #2)

The absolute I2V score is 59 points higher than T2V. This suggests HappyHorse's architecture particularly excels at preserving and animating reference image content — which is the core requirement for digital human and character animation use cases.

The official site emphasizes "human-centric scenarios, facial performance, lip-syncing" — all of which are I2V strengths. This positioning targets the virtual streamer, AI micro-drama, and cross-lingual promotional video markets.

Text-to-Video: Creative Control

T2V models generate video purely from text descriptions. This gives the model full creative control over composition, lighting, character appearance, and camera movement.

Strengths

No reference image needed
Full creative freedom
Better for abstract or fantastical content
Easier prompt iteration

Limitations

Character consistency is harder
Style can vary between generations
Requires more detailed prompts for specific visuals

Best Open Source T2V Models (April 2026)

HappyHorse-1.0 — 1333 Elo (unavailable)
WAN 2.6 — 1189 Elo (available, Apache 2.0)
LTX Video 2.3 — ~1100 Elo (available, consumer GPU)

Image-to-Video: Visual Consistency

I2V models take a reference image and animate it. This ensures visual consistency — the character, style, and composition match the input.

Strengths

Perfect character consistency from frame 1
Works with existing brand assets
Better for product demos and character animation
More predictable output quality

Limitations

Requires a quality reference image
Less creative flexibility
Can look "uncanny" if animation quality doesn't match image quality

Best Open Source I2V Models (April 2026)

HappyHorse-1.0 — 1392 Elo (unavailable)
WAN 2.6 — Competitive I2V (available)
Kling 3.0 Omni — 1297 Elo (API only)

When to Use Which

Scenario	Better Mode	Why
Brand video with existing characters	I2V	Consistency with brand assets
Creative concept exploration	T2V	Maximum creative freedom
Virtual streamer content	I2V	Character identity preservation
Product demo animation	I2V	Match product photos exactly
Music video with abstract visuals	T2V	No reference constraint
Multi-shot narrative	Both	I2V for key shots, T2V for establishing shots
Social media content	T2V	Speed of iteration

Unified Models: The HappyHorse Approach

HappyHorse-1.0's single-pipeline architecture handles both T2V and I2V with the same model. This is significant because:

One model to deploy: Simpler infrastructure, lower cost
Shared learning: I2V and T2V training data benefit each other
Consistent style: Outputs from both modes look like they came from the same model
Audio included: Both modes generate synchronized audio

Most production pipelines today run separate specialized models for T2V and I2V. A unified model that leads in both categories could simplify these pipelines significantly.

The Practical Recommendation

Today, for teams that need both T2V and I2V capabilities:

Self-hosted: WAN 2.6 for both modes (Apache 2.0, available now)
API-based: PixVerse V6 for T2V ($5.40/min), Kling 3.0 for I2V ($13.44/min)
When available: HappyHorse-1.0 for both (single model, potentially best quality in both modes)