FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Yuren Cong; Mengmeng Xu; Christian Simon; Shoufa Chen; Jiawei Ren; Yanping Xie; Juan Manuel Perez-Rua; Bodo Rosenhahn; Tao Xiang; Sen He

doi:10.48550/arXiv.2310.05922

Details

Original language	English
Publication status	Published - 7 May 2024
Event	12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria Duration: 7 May 2024 → 11 May 2024

Conference

Conference	12th International Conference on Learning Representations, ICLR 2024
Country/Territory	Austria
City	Hybrid, Vienna
Period	7 May 2024 → 11 May 2024

Abstract

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

ASJC Scopus subject areas

Arts and Humanities(all)
Language and Linguistics
Computer Science(all)
Computer Science Applications
Social Sciences(all)
Education
Social Sciences(all)
Linguistics and Language

Cite this

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. / Cong, Yuren; Xu, Mengmeng; Simon, Christian et al.
2024. Poster session presented at 12th International Conference on Learning Representations, ICLR 2024, Hybrid, Vienna, Austria.

Research output: Contribution to conference › Poster › Research › peer review

Cong, Y, Xu, M, Simon, C, Chen, S, Ren, J, Xie, Y, Perez-Rua, JM, Rosenhahn, B, Xiang, T & He, S 2024, 'FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing', 12th International Conference on Learning Representations, ICLR 2024, Hybrid, Vienna, Austria, 7 May 2024 - 11 May 2024. https://doi.org/10.48550/arXiv.2310.05922

Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J. M., Rosenhahn, B., Xiang, T., & He, S. (2024). FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. Poster session presented at 12th International Conference on Learning Representations, ICLR 2024, Hybrid, Vienna, Austria. https://doi.org/10.48550/arXiv.2310.05922

Cong Y, Xu M, Simon C, Chen S, Ren J, Xie Y et al.. FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. 2024. Poster session presented at 12th International Conference on Learning Representations, ICLR 2024, Hybrid, Vienna, Austria. Epub 2023 Oct 9. doi: 10.48550/arXiv.2310.05922

Cong, Yuren ; Xu, Mengmeng ; Simon, Christian et al. / FLATTEN : optical FLow-guided ATTENtion for consistent text-to-video editing. Poster session presented at 12th International Conference on Learning Representations, ICLR 2024, Hybrid, Vienna, Austria.

Download

@conference{2e20e067e830438b85df048980b70833,

title = "FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing",

abstract = "Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.",

author = "Yuren Cong and Mengmeng Xu and Christian Simon and Shoufa Chen and Jiawei Ren and Yanping Xie and Perez-Rua, {Juan Manuel} and Bodo Rosenhahn and Tao Xiang and Sen He",

note = "Publisher Copyright: {\textcopyright} 2024 12th International Conference on Learning Representations, ICLR 2024. All rights reserved.; 12th International Conference on Learning Representations, ICLR 2024 ; Conference date: 07-05-2024 Through 11-05-2024",

year = "2024",

month = may,

day = "7",

doi = "10.48550/arXiv.2310.05922",

language = "English",

}

Download

TY - CONF

T1 - FLATTEN

T2 - 12th International Conference on Learning Representations, ICLR 2024

AU - Cong, Yuren

AU - Xu, Mengmeng

AU - Simon, Christian

AU - Chen, Shoufa

AU - Ren, Jiawei

AU - Xie, Yanping

AU - Perez-Rua, Juan Manuel

AU - Rosenhahn, Bodo

AU - Xiang, Tao

AU - He, Sen

PY - 2024/5/7

Y1 - 2024/5/7

N2 - Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

AB - Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

UR - http://www.scopus.com/inward/record.url?scp=85195309009&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2310.05922

DO - 10.48550/arXiv.2310.05922

M3 - Poster

AN - SCOPUS:85195309009

Y2 - 7 May 2024 through 11 May 2024

ER -

Research@Leibniz University

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Authors

Research Organisations

External Research Organisations

Details

Conference

Abstract

ASJC Scopus subject areas

Cite this

By the same author(s)

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction