Text to Image Generation with Semantic-Spatial Aware GAN

Wentong Liao; Kai Hu; Michael Ying Yang; Bodo Rosenhahn

doi:10.1109/CVPR52688.2022.01765

Details

Original language	English
Title of host publication	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	18166-18175
Number of pages	10
ISBN (electronic)	978-1-6654-6946-3
ISBN (print)	978-1-6654-6947-0
Publication status	Published - 2022

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2022-June
ISSN (Print)	1063-6919

Abstract

A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

Keywords

cs.CV, cs.LG, Image and video synthesis and generation, Vision + language

ASJC Scopus subject areas

Computer Science(all)
Software
Computer Science(all)
Computer Vision and Pattern Recognition

Cite this

Text to Image Generation with Semantic-Spatial Aware GAN. / Liao, Wentong; Hu, Kai; Yang, Michael Ying et al.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. p. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Liao, W, Hu, K, Yang, MY & Rosenhahn, B 2022, Text to Image Generation with Semantic-Spatial Aware GAN. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2022-June, Institute of Electrical and Electronics Engineers Inc., pp. 18166-18175. https://doi.org/10.1109/CVPR52688.2022.01765

Liao, W., Hu, K., Yang, M. Y., & Rosenhahn, B. (2022). Text to Image Generation with Semantic-Spatial Aware GAN. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 18166-18175). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CVPR52688.2022.01765

Liao W, Hu K, Yang MY, Rosenhahn B. Text to Image Generation with Semantic-Spatial Aware GAN. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc. 2022. p. 18166-18175. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52688.2022.01765

Liao, Wentong ; Hu, Kai ; Yang, Michael Ying et al. / Text to Image Generation with Semantic-Spatial Aware GAN. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Download

@inproceedings{1a43e4964fa449d19bcbed27698416e1,

title = "Text to Image Generation with Semantic-Spatial Aware GAN",

abstract = "A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image. ",

keywords = "cs.CV, cs.LG, Image and video synthesis and generation, Vision + language",

author = "Wentong Liao and Kai Hu and Yang, {Michael Ying} and Bodo Rosenhahn",

note = "Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany{\textquoteright}s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).",

year = "2022",

doi = "10.1109/CVPR52688.2022.01765",

language = "English",

isbn = "978-1-6654-6947-0",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "18166--18175",

booktitle = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

address = "United States",

}

Download

TY - GEN

T1 - Text to Image Generation with Semantic-Spatial Aware GAN

AU - Liao, Wentong

AU - Hu, Kai

AU - Yang, Michael Ying

AU - Rosenhahn, Bodo

N1 - Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

PY - 2022

Y1 - 2022

N2 - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

AB - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

KW - cs.CV

KW - cs.LG

KW - Image and video synthesis and generation

KW - Vision + language

UR - http://www.scopus.com/inward/record.url?scp=85139192930&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01765

DO - 10.1109/CVPR52688.2022.01765

M3 - Conference contribution

SN - 978-1-6654-6947-0

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 18166

EP - 18175

BT - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Research@Leibniz University

Text to Image Generation with Semantic-Spatial Aware GAN

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Segment Any Object Model (SAOM): Real-To-Simulation Fine-Tuning Strategy For Multi-Class Multi-Instance Segmentation

Indoor Scene Change Understanding (SCU): Segment, Describe, and Revert Any Change

Robust Shape Fitting for 3D Scene Abstraction