Details
Original language | English |
---|---|
Qualification | Doctor of Engineering |
Awarding Institution | |
Supervised by |
|
Date of Award | 31 May 2024 |
Place of Publication | Hannover |
Publication status | Published - 21 Jun 2024 |
Abstract
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Hannover, 2024. 135 p.
Research output: Thesis › Doctoral thesis
}
TY - BOOK
T1 - Holistic scene understanding through image and video scene graphs
AU - Cong, Yuren
PY - 2024/6/21
Y1 - 2024/6/21
N2 - A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.
AB - A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.
U2 - 10.15488/17548
DO - 10.15488/17548
M3 - Doctoral thesis
CY - Hannover
ER -