Text to Image Generation with Semantic-Spatial Aware GAN

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

External Research Organisations

  • University of Twente
View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages18166-18175
Number of pages10
ISBN (electronic)978-1-6654-6946-3
ISBN (print)978-1-6654-6947-0
Publication statusPublished - 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Abstract

A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

Keywords

    cs.CV, cs.LG, Image and video synthesis and generation, Vision + language

ASJC Scopus subject areas

Cite this

Text to Image Generation with Semantic-Spatial Aware GAN. / Liao, Wentong; Hu, Kai; Yang, Michael Ying et al.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. p. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Liao, W, Hu, K, Yang, MY & Rosenhahn, B 2022, Text to Image Generation with Semantic-Spatial Aware GAN. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2022-June, Institute of Electrical and Electronics Engineers Inc., pp. 18166-18175. https://doi.org/10.1109/CVPR52688.2022.01765
Liao, W., Hu, K., Yang, M. Y., & Rosenhahn, B. (2022). Text to Image Generation with Semantic-Spatial Aware GAN. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 18166-18175). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CVPR52688.2022.01765
Liao W, Hu K, Yang MY, Rosenhahn B. Text to Image Generation with Semantic-Spatial Aware GAN. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc. 2022. p. 18166-18175. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52688.2022.01765
Liao, Wentong ; Hu, Kai ; Yang, Michael Ying et al. / Text to Image Generation with Semantic-Spatial Aware GAN. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).
Download
@inproceedings{1a43e4964fa449d19bcbed27698416e1,
title = "Text to Image Generation with Semantic-Spatial Aware GAN",
abstract = "A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image. ",
keywords = "cs.CV, cs.LG, Image and video synthesis and generation, Vision + language",
author = "Wentong Liao and Kai Hu and Yang, {Michael Ying} and Bodo Rosenhahn",
note = "Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany{\textquoteright}s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).",
year = "2022",
doi = "10.1109/CVPR52688.2022.01765",
language = "English",
isbn = "978-1-6654-6947-0",
series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "18166--18175",
booktitle = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
address = "United States",

}

Download

TY - GEN

T1 - Text to Image Generation with Semantic-Spatial Aware GAN

AU - Liao, Wentong

AU - Hu, Kai

AU - Yang, Michael Ying

AU - Rosenhahn, Bodo

N1 - Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

PY - 2022

Y1 - 2022

N2 - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

AB - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.

KW - cs.CV

KW - cs.LG

KW - Image and video synthesis and generation

KW - Vision + language

UR - http://www.scopus.com/inward/record.url?scp=85139192930&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01765

DO - 10.1109/CVPR52688.2022.01765

M3 - Conference contribution

SN - 978-1-6654-6947-0

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 18166

EP - 18175

BT - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

By the same author(s)