FreeNoise: Tuning-Free Longer Video Diffusion
via Noise Rescheduling

ICLR 2024

Haonan Qiu1 Menghan Xia*2 Yong Zhang2

Yingqing He2,3 Xintao Wang2 Ying Shan2 Ziwei Liu*1

1 Nanyang Technological University   2 Tencent AI Lab   3 Hong Kong University of Science and Technology
(* Corresponding Author)

[arXiv]      [Code]


✅ totally no tuning      ✅ less than 20% extra time      ✅ support 512 frames     

Abstract

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%.

Comparisons of Longer Video Generation

Comparisons of Multi-Prompt Video Generation

Ablation for Noise Rescheduling

Ablation for Motion Injection

Longer Results with 512 Frames

Multi-Prompt Results with 256 Frames

FreeNoise+AnimateDiff [Code]

Our FreeNoise is also applicable to another Video LDM framework -- AnimateDiff
w/o Noise Rescheduling
w Noise Rescheduling

FreeNoise+LaVie [Code]

Our FreeNoise is also applicable to another Video LDM framework -- LaVie

BibTex

@misc{qiu2023freenoise,
    title={FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling}, 
    author={Haonan Qiu and Menghan Xia and Yong Zhang and Yingqing He and Xintao Wang and Ying Shan and Ziwei Liu},
    year={2023},
    eprint={2310.15169},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
  }