Details
Original language | English |
---|---|
Title of host publication | 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops |
Subtitle of host publication | CVPRW 2023 |
Publisher | IEEE Computer Society |
Pages | 2555-2565 |
Number of pages | 11 |
ISBN (electronic) | 9798350302493 |
ISBN (print) | 979-8-3503-0250-9 |
Publication status | Published - 2023 |
Event | 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023 - Vancouver, Canada Duration: 17 Jun 2023 → 24 Jun 2023 |
Publication series
Name | IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops |
---|---|
Volume | 2023-June |
ISSN (Print) | 2160-7508 |
ISSN (electronic) | 2160-7516 |
Abstract
As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.
ASJC Scopus subject areas
- Computer Science(all)
- Computer Vision and Pattern Recognition
- Engineering(all)
- Electrical and Electronic Engineering
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops: CVPRW 2023. IEEE Computer Society, 2023. p. 2555-2565 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; Vol. 2023-June).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - SSGVS
T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023
AU - Cong, Yuren
AU - Yi, Jinhui
AU - Rosenhahn, Bodo
AU - Yang, Michael Ying
N1 - Funding Information: Acknowledgements This work was supported by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor (grant no. 01DD20003) and the AI service center KISSKI (grant no. 01IS22093C), ZDIN and DFG under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).
PY - 2023
Y1 - 2023
N2 - As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.
AB - As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.
UR - http://www.scopus.com/inward/record.url?scp=85168723164&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2211.06119
DO - 10.48550/arXiv.2211.06119
M3 - Conference contribution
AN - SCOPUS:85168723164
SN - 979-8-3503-0250-9
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 2555
EP - 2565
BT - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
PB - IEEE Computer Society
Y2 - 17 June 2023 through 24 June 2023
ER -