Holistic scene understanding through image and video scene graphs

Yuren Cong

doi:10.15488/17548

Details

Original language	English
Qualification	Doctor of Engineering
Awarding Institution	Leibniz University Hannover
Supervised by	Rosenhahn, B., Supervisor
Date of Award	31 May 2024
Place of Publication	Hannover
Publication status	Published - 21 Jun 2024

Abstract

A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.

Cite this

Holistic scene understanding through image and video scene graphs. / Cong, Yuren.
Hannover, 2024. 135 p.

Research output: Thesis › Doctoral thesis

Cong, Y 2024, 'Holistic scene understanding through image and video scene graphs', Doctor of Engineering, Leibniz University Hannover, Hannover. https://doi.org/10.15488/17548

Cong, Y. (2024). Holistic scene understanding through image and video scene graphs. [Doctoral thesis, Leibniz University Hannover]. https://doi.org/10.15488/17548

Cong Y. Holistic scene understanding through image and video scene graphs. Hannover, 2024. 135 p. doi: 10.15488/17548

Cong, Yuren. / Holistic scene understanding through image and video scene graphs. Hannover, 2024. 135 p.

Download

@phdthesis{be84bbfe3f2b42e9ac32f955d3bcd5cc,

title = "Holistic scene understanding through image and video scene graphs",

abstract = "A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.",

author = "Yuren Cong",

year = "2024",

month = jun,

day = "21",

doi = "10.15488/17548",

language = "English",

school = "Leibniz University Hannover",

}

Download

TY - BOOK

T1 - Holistic scene understanding through image and video scene graphs

AU - Cong, Yuren

PY - 2024/6/21

Y1 - 2024/6/21

N2 - A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.

AB - A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.

U2 - 10.15488/17548

DO - 10.15488/17548

M3 - Doctoral thesis

CY - Hannover

ER -

Research@Leibniz University

Holistic scene understanding through image and video scene graphs

Authors

Research Organisations

Details

Abstract

Cite this

By the same author(s)

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction