Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
Mask2DiT is a dual-mask diffusion transformer designed for multi-scene video generation under a multi-prompt setting. Built upon the scalable DiT architecture, it introduces two key components:
(1) A symmetric binary attention mask that ensures fine-grained alignment between each text prompt and its corresponding video segment, allowing the model to focus on scene-specific guidance while maintaining intra-segment visual coherence.
(2) A segment-level conditional mask that enables auto-regressive scene extension by conditioning the generation of each new scene on the preceding segments.
To support this design, the model is trained in two stages: pretraining on concatenated single-scene clips to adapt to longer sequences, followed by fine-tuning on curated multi-scene datasets to improve consistency and alignment. During inference, these masking mechanisms guide the model to generate coherent and semantically aligned multi-scene videos, showing significant improvements over state-of-the-art baselines in both objective metrics and human evaluations.
Quantitatively: Mask²DiT consistently outperforms state-of-the-art baselines in multi-scene video generation, achieving superior Visual and Sequence Consistency and delivering the best visual quality (lowest FVD), while maintaining competitive semantic alignment. In addition, it supports auto-regressive scene extension and effectively maintains both visual and semantic consistency between the generated and preceding scenes.
Mask²DiT delivers significantly better visual coherence than SOTA baselines, demonstrating superior consistency in character appearance, background integrity, and overall style across multi-scene videos.
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT
CogVideoX
StoryDiffusion
TALC
VideoStudio
Mask2DiT