YoCausal: How Far is Video Generation from World Model?
A Causality Perspective

You-Zhe Xie*, Yu-Hsuan Li*, Jie-Ying Lee, Kaipeng Zhang,
Yu-Lun Liu, Zhixiang Wang,
(* equal contribution, † corresponding authors)
National Yang Ming Chiao Tung University    Alaya Studio

Teaser Video

Teaser

Teaser
Causally Correct ? Partial Causally Correct Causal Failure
TL;DR: YoCausal is the first benchmark evaluating causal cognition in video generation models, inspired by cognitive science experiments that test whether infants perceive causality using reversed videos. Our benchmark can incorporate any real-world video at zero cost, making it arbitrarily extensible to easily assess video generation models' understanding of diverse types of causality.

Abstract

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an infinitely extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Podcast

Contribution

  1. The first causality benchmark for VDMs based on cognition science.
  2. Build an arbitrary scalable real-world dataset, freeing evaluation from sim-to-real gaps
  3. A cognitive-science-grounded two-level framework that disentangles arrow-of-time perception from causal cognition.
  4. Evidence that current open-source VDMs lack causal understanding, revealing a critical gap toward world models and providing guidance.

Methodology Framework

Overview

Overview of the YoCausal evaluation framework.

  • (a) Dataset Construction: We construct an infinitely extensible benchmark by using real-world videos from different domains. By applying zero-cost temporal reversal, we generate natural counterfactual pairs (forward xf and reverse xr).
  • (b) Level 1 Temporal Perception: Identical sampled noise ε is added to both sequences and compute their denoising losses. Reverse Surprise Index (RSI), quantifies the model’s perception of the arrow of time by measuring the proportion of instances where the reversed video has a higher loss (Lr > Lf).
  • (c) Level 2 Causality Disentanglement: To disentangle genuine causal cognition from statistical temporal biases, a Vision-Language Model (VLM) divides the dataset into causal (Dc) and non-causal (Dnc) subsets. The Level-2 metric, Causal Cognition Index (CCI), is computed as the difference in RSI between these subsets, isolating the model’s genuine causal cognition ability.

Ranking

RSI
CCI
Aggregate Rank
RankModelRelease Date GeneralPhysicsHumanAnimalAverage
RankModelRelease Date CCI(D)NormalizeRSI(Dc)RSI(Dnc)
RankModelRSI AvgRSI Rank CCI(D)CCI RankAggregate Score

We provide two submission protocols for the leaderboard:

  1. Result Submission: Participants run the evaluation on their own side and submit the resulting JSON files to us.
  2. Model Submission: Participants submit their model weights, inference code, and environment setup, and we will run the evaluation on their behalf.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to this link.

Subset Video

Causal / Non-Causal Video

Limitation: Bidirectionally Plausible Example

Citation

@article{xie2026yocausal,
  title   = {YoCausal: How Far is Video Generation from World Model? A Causality Perspective},
  author  = {Xie, You-Zhe and Li, Yu-Hsuan and Lee, Jie-Ying and Zhang, Kaipeng and Liu, Yu-Lun and Wang, Zhixiang},
  journal = {arXiv preprint arXiv:2605.30346},
  year    = {2026}
}