SSGVS: Semantic Scene Graph-to-Video Synthesis

Yuren Cong; Jinhui Yi; Bodo Rosenhahn; Michael Ying Yang

doi:10.48550/arXiv.2211.06119

Details

Original language	English
Title of host publication	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
Subtitle of host publication	CVPRW 2023
Publisher	IEEE Computer Society
Pages	2555-2565
Number of pages	11
ISBN (electronic)	9798350302493
ISBN (print)	979-8-3503-0250-9
Publication status	Published - 2023
Event	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPRW 2023 - Vancouver, Canada Duration: 17 Jun 2023 → 24 Jun 2023

Publication series

Name	IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Volume	2023-June
ISSN (Print)	2160-7508
ISSN (electronic)	2160-7516

Abstract

As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.

ASJC Scopus subject areas

Computer Science(all)
Computer Vision and Pattern Recognition
Engineering(all)
Electrical and Electronic Engineering

Cite this

SSGVS: Semantic Scene Graph-to-Video Synthesis. / Cong, Yuren; Yi, Jinhui; Rosenhahn, Bodo et al.
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops: CVPRW 2023. IEEE Computer Society, 2023. p. 2555-2565 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; Vol. 2023-June).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Cong, Y, Yi, J, Rosenhahn, B & Yang, MY 2023, SSGVS: Semantic Scene Graph-to-Video Synthesis. in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops: CVPRW 2023. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2023-June, IEEE Computer Society, pp. 2555-2565, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPRW 2023, Vancouver, British Columbia, Canada, 17 Jun 2023. https://doi.org/10.48550/arXiv.2211.06119, https://doi.org/10.1109/CVPRW59228.2023.00254

Cong, Y., Yi, J., Rosenhahn, B., & Yang, M. Y. (2023). SSGVS: Semantic Scene Graph-to-Video Synthesis. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops: CVPRW 2023 (pp. 2555-2565). (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; Vol. 2023-June). IEEE Computer Society. https://doi.org/10.48550/arXiv.2211.06119, https://doi.org/10.1109/CVPRW59228.2023.00254

Cong Y, Yi J, Rosenhahn B, Yang MY. SSGVS: Semantic Scene Graph-to-Video Synthesis. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops: CVPRW 2023. IEEE Computer Society. 2023. p. 2555-2565. (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops). doi: 10.48550/arXiv.2211.06119, 10.1109/CVPRW59228.2023.00254

Cong, Yuren ; Yi, Jinhui ; Rosenhahn, Bodo et al. / SSGVS : Semantic Scene Graph-to-Video Synthesis. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops: CVPRW 2023. IEEE Computer Society, 2023. pp. 2555-2565 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops).

Download

@inproceedings{55d68d3b046542d89cab7dac76e02879,

title = "SSGVS: Semantic Scene Graph-to-Video Synthesis",

abstract = "As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.",

author = "Yuren Cong and Jinhui Yi and Bodo Rosenhahn and Yang, {Michael Ying}",

note = "Funding Information: Acknowledgements This work was supported by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor (grant no. 01DD20003) and the AI service center KISSKI (grant no. 01IS22093C), ZDIN and DFG under Germany{\textquoteright}s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122). ; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPRW 2023, CVPR 2023 ; Conference date: 17-06-2023 Through 24-06-2023",

year = "2023",

doi = "10.48550/arXiv.2211.06119",

language = "English",

isbn = "979-8-3503-0250-9",

series = "IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops",

publisher = "IEEE Computer Society",

pages = "2555--2565",

booktitle = "2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops",

address = "United States",

}

Download

TY - GEN

T1 - SSGVS

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPRW 2023

AU - Cong, Yuren

AU - Yi, Jinhui

AU - Rosenhahn, Bodo

AU - Yang, Michael Ying

N1 - Funding Information: Acknowledgements This work was supported by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor (grant no. 01DD20003) and the AI service center KISSKI (grant no. 01IS22093C), ZDIN and DFG under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

PY - 2023

Y1 - 2023

N2 - As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.

AB - As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.

UR - http://www.scopus.com/inward/record.url?scp=85168723164&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2211.06119

DO - 10.48550/arXiv.2211.06119

M3 - Conference contribution

AN - SCOPUS:85168723164

SN - 979-8-3503-0250-9

T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

SP - 2555

EP - 2565

BT - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

PB - IEEE Computer Society

Y2 - 17 June 2023 through 24 June 2023

ER -

Research@Leibniz University

SSGVS: Semantic Scene Graph-to-Video Synthesis

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

ASJC Scopus subject areas

Cite this

By the same author(s)

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction