YoCausal: How Far is Video Generation from World Model?

TL;DR: YoCausal is the first benchmark evaluating causal cognition in video generation models, inspired by cognitive science experiments that test whether infants perceive causality using reversed videos. Our benchmark can incorporate any real-world video at zero cost, making it arbitrarily extensible to easily assess video generation models' understanding of diverse types of causality.

Abstract

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an infinitely extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Contribution

The first causality benchmark for VDMs based on cognition science.
Build an arbitrary scalable real-world dataset, freeing evaluation from sim-to-real gaps
A cognitive-science-grounded two-level framework that disentangles arrow-of-time perception from causal cognition.
Evidence that current open-source VDMs lack causal understanding, revealing a critical gap toward world models and providing guidance.

Methodology Framework

Overview of the YoCausal evaluation framework.

(a) Dataset Construction: We construct an infinitely extensible benchmark by using real-world videos from different domains. By applying zero-cost temporal reversal, we generate natural counterfactual pairs (forward x_f and reverse x_r).
(b) Level 1 Temporal Perception: Identical sampled noise ε is added to both sequences and compute their denoising losses. Reverse Surprise Index (RSI), quantifies the model’s perception of the arrow of time by measuring the proportion of instances where the reversed video has a higher loss (L_r > L_f).
(c) Level 2 Causality Disentanglement: To disentangle genuine causal cognition from statistical temporal biases, a Vision-Language Model (VLM) divides the dataset into causal (D_c) and non-causal (D_nc) subsets. The Level-2 metric, Causal Cognition Index (CCI), is computed as the difference in RSI between these subsets, isolating the model’s genuine causal cognition ability.

Ranking

RSI

CCI

Aggregate Rank

Rank	Model	Release Date	General	Physics	Human	Animal	Average

Rank	Model	Release Date	CCI(D)	Normalize	RSI(Dc)	RSI(Dnc)

Rank	Model	RSI Avg	RSI Rank	CCI(D)	CCI Rank	Aggregate Score

We provide two submission protocols for the leaderboard:

Result Submission: Participants run the evaluation on their own side and submit the resulting JSON files to us.
Model Submission: Participants submit their model weights, inference code, and environment setup, and we will run the evaluation on their behalf.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to this link.

Citation

@article{xie2026yocausal,
  title   = {YoCausal: How Far is Video Generation from World Model? A Causality Perspective},
  author  = {Xie, You-Zhe and Li, Yu-Hsuan and Lee, Jie-Ying and Zhang, Kaipeng and Liu, Yu-Lun and Wang, Zhixiang},
  journal = {arXiv preprint arXiv:2605.30346},
  year    = {2026}
}

YoCausal: How Far is Video Generation from World Model?
A Causality Perspective

Teaser Video

Teaser

Abstract

Podcast

Contribution

Methodology Framework

Ranking

Subset Video

Causal / Non-Causal Video

Limitation: Bidirectionally Plausible Example

Citation