Loading [MathJax]/extensions/tex2jax.js

Holistic scene understanding through image and video scene graphs

Research output: ThesisDoctoral thesis

Authors

  • Yuren Cong

Research Organisations

Details

Original languageEnglish
QualificationDoctor of Engineering
Awarding Institution
Supervised by
Date of Award31 May 2024
Place of PublicationHannover
Publication statusPublished - 21 Jun 2024

Abstract

A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.

Cite this

Holistic scene understanding through image and video scene graphs. / Cong, Yuren.
Hannover, 2024. 135 p.

Research output: ThesisDoctoral thesis

Cong, Y 2024, 'Holistic scene understanding through image and video scene graphs', Doctor of Engineering, Leibniz University Hannover, Hannover. https://doi.org/10.15488/17548
Cong, Y. (2024). Holistic scene understanding through image and video scene graphs. [Doctoral thesis, Leibniz University Hannover]. https://doi.org/10.15488/17548
Cong Y. Holistic scene understanding through image and video scene graphs. Hannover, 2024. 135 p. doi: 10.15488/17548
Download
@phdthesis{be84bbfe3f2b42e9ac32f955d3bcd5cc,
title = "Holistic scene understanding through image and video scene graphs",
abstract = "A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.",
author = "Yuren Cong",
year = "2024",
month = jun,
day = "21",
doi = "10.15488/17548",
language = "English",
school = "Leibniz University Hannover",

}

Download

TY - BOOK

T1 - Holistic scene understanding through image and video scene graphs

AU - Cong, Yuren

PY - 2024/6/21

Y1 - 2024/6/21

N2 - A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.

AB - A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art performance at the time of publication, marking a substantial advancement in scene graph research and overall scene understanding.

U2 - 10.15488/17548

DO - 10.15488/17548

M3 - Doctoral thesis

CY - Hannover

ER -

By the same author(s)