Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

¹University of Science and Technology of China, ²Bytedance Intelligent Creation, ³Yuanshi Inc.,

^*Works done during the internship at Bytedance Intelligent Creation, ^✝Project lead, ^✉Corresponding author,

Model Capabilities

🎬 Fixed-Scene Generation

This video is generated in a single pass using three different text prompts, each guiding a 6-second scene, resulting in an 18-second multi-scene video.

⏩ Auto-Regressive Scene Extension

This video demonstrates auto-regressive scene extension, where the model generates the third 6-second scene conditioned on the first two 6-second scenes (12s in total) as context.

Abstract

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.

Method

Mask²DiT is a dual-mask diffusion transformer designed for multi-scene video generation under a multi-prompt setting. Built upon the scalable DiT architecture, it introduces two key components:
(1) A symmetric binary attention mask that ensures fine-grained alignment between each text prompt and its corresponding video segment, allowing the model to focus on scene-specific guidance while maintaining intra-segment visual coherence. (2) A segment-level conditional mask that enables auto-regressive scene extension by conditioning the generation of each new scene on the preceding segments. To support this design, the model is trained in two stages: pretraining on concatenated single-scene clips to adapt to longer sequences, followed by fine-tuning on curated multi-scene datasets to improve consistency and alignment. During inference, these masking mechanisms guide the model to generate coherent and semantically aligned multi-scene videos, showing significant improvements over state-of-the-art baselines in both objective metrics and human evaluations.

Comparisons with State-of-the-Arts

Quantitatively: Mask²DiT consistently outperforms state-of-the-art baselines in multi-scene video generation, achieving superior Visual and Sequence Consistency and delivering the best visual quality (lowest FVD), while maintaining competitive semantic alignment. In addition, it supports auto-regressive scene extension and effectively maintains both visual and semantic consistency between the generated and preceding scenes.