Details
Original language | English |
---|---|
Publication status | Published - 9 Oct 2023 |
Event | 12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria Duration: 7 May 2024 → 11 May 2024 |
Conference
Conference | 12th International Conference on Learning Representations, ICLR 2024 |
---|---|
Country/Territory | Austria |
City | Hybrid, Vienna |
Period | 7 May 2024 → 11 May 2024 |
Abstract
Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos. The project page is available at https://flatten-video-editing.github.io/.
ASJC Scopus subject areas
- Arts and Humanities(all)
- Language and Linguistics
- Computer Science(all)
- Computer Science Applications
- Social Sciences(all)
- Education
- Social Sciences(all)
- Linguistics and Language
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
2023. Paper presented at 12th International Conference on Learning Representations, ICLR 2024, Hybrid, Vienna, Austria.
Research output: Contribution to conference › Paper › Research › peer review
}
TY - CONF
T1 - FLATTEN
T2 - 12th International Conference on Learning Representations, ICLR 2024
AU - Cong, Yuren
AU - Xu, Mengmeng
AU - Simon, Christian
AU - Chen, Shoufa
AU - Ren, Jiawei
AU - Xie, Yanping
AU - Perez-Rua, Juan Manuel
AU - Rosenhahn, Bodo
AU - Xiang, Tao
AU - He, Sen
PY - 2023/10/9
Y1 - 2023/10/9
N2 - Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos. The project page is available at https://flatten-video-editing.github.io/.
AB - Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos. The project page is available at https://flatten-video-editing.github.io/.
UR - http://www.scopus.com/inward/record.url?scp=85195309009&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2310.05922
DO - 10.48550/arXiv.2310.05922
M3 - Paper
AN - SCOPUS:85195309009
Y2 - 7 May 2024 through 11 May 2024
ER -