Visual speech synthesis from 3D mesh sequences driven by combined speech features

Felix Kuhnke; Jörn Ostermann

doi:10.1109/icme.2017.8019546

Details

Original language	English
Title of host publication	2017 IEEE International Conference on Multimedia and Expo
Subtitle of host publication	ICME 2017
Publisher	IEEE Computer Society
Pages	1075-1080
Number of pages	6
ISBN (electronic)	9781509060672
Publication status	Published - 28 Aug 2017
Event	2017 IEEE International Conference on Multimedia and Expo, ICME 2017 - Hong Kong, Hong Kong Duration: 10 Jul 2017 → 14 Jul 2017

Publication series

Name	Proceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)	1945-7871
ISSN (electronic)	1945-788X

Abstract

Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.

Keywords

Facial Animation, Lip Synchronization, Speech Features, Visual Speech Synthesis

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Computer Science Applications

Cite this

Visual speech synthesis from 3D mesh sequences driven by combined speech features. / Kuhnke, Felix; Ostermann, Jörn.
2017 IEEE International Conference on Multimedia and Expo: ICME 2017. IEEE Computer Society, 2017. p. 1075-1080 8019546 (Proceedings - IEEE International Conference on Multimedia and Expo).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Kuhnke, F & Ostermann, J 2017, Visual speech synthesis from 3D mesh sequences driven by combined speech features. in 2017 IEEE International Conference on Multimedia and Expo: ICME 2017., 8019546, Proceedings - IEEE International Conference on Multimedia and Expo, IEEE Computer Society, pp. 1075-1080, 2017 IEEE International Conference on Multimedia and Expo, ICME 2017, Hong Kong, Hong Kong, 10 Jul 2017. https://doi.org/10.1109/icme.2017.8019546

Kuhnke, F., & Ostermann, J. (2017). Visual speech synthesis from 3D mesh sequences driven by combined speech features. In 2017 IEEE International Conference on Multimedia and Expo: ICME 2017 (pp. 1075-1080). Article 8019546 (Proceedings - IEEE International Conference on Multimedia and Expo). IEEE Computer Society. https://doi.org/10.1109/icme.2017.8019546

Kuhnke F, Ostermann J. Visual speech synthesis from 3D mesh sequences driven by combined speech features. In 2017 IEEE International Conference on Multimedia and Expo: ICME 2017. IEEE Computer Society. 2017. p. 1075-1080. 8019546. (Proceedings - IEEE International Conference on Multimedia and Expo). doi: 10.1109/icme.2017.8019546

Kuhnke, Felix ; Ostermann, Jörn. / Visual speech synthesis from 3D mesh sequences driven by combined speech features. 2017 IEEE International Conference on Multimedia and Expo: ICME 2017. IEEE Computer Society, 2017. pp. 1075-1080 (Proceedings - IEEE International Conference on Multimedia and Expo).

Download

@inproceedings{07e3e28e69eb4517b9945274201e4c6b,

title = "Visual speech synthesis from 3D mesh sequences driven by combined speech features",

abstract = "Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.",

keywords = "Facial Animation, Lip Synchronization, Speech Features, Visual Speech Synthesis",

author = "Felix Kuhnke and J{\"o}rn Ostermann",

year = "2017",

month = aug,

day = "28",

doi = "10.1109/icme.2017.8019546",

language = "English",

series = "Proceedings - IEEE International Conference on Multimedia and Expo",

publisher = "IEEE Computer Society",

pages = "1075--1080",

booktitle = "2017 IEEE International Conference on Multimedia and Expo",

address = "United States",

note = "2017 IEEE International Conference on Multimedia and Expo, ICME 2017 ; Conference date: 10-07-2017 Through 14-07-2017",

}

Download

TY - GEN

T1 - Visual speech synthesis from 3D mesh sequences driven by combined speech features

AU - Kuhnke, Felix

AU - Ostermann, Jörn

PY - 2017/8/28

Y1 - 2017/8/28

N2 - Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.

AB - Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.

KW - Facial Animation

KW - Lip Synchronization

KW - Speech Features

KW - Visual Speech Synthesis

UR - http://www.scopus.com/inward/record.url?scp=85030238866&partnerID=8YFLogxK

U2 - 10.1109/icme.2017.8019546

DO - 10.1109/icme.2017.8019546

M3 - Conference contribution

AN - SCOPUS:85030238866

T3 - Proceedings - IEEE International Conference on Multimedia and Expo

SP - 1075

EP - 1080

BT - 2017 IEEE International Conference on Multimedia and Expo

PB - IEEE Computer Society

T2 - 2017 IEEE International Conference on Multimedia and Expo, ICME 2017

Y2 - 10 July 2017 through 14 July 2017

ER -

Research@Leibniz University

Visual speech synthesis from 3D mesh sequences driven by combined speech features

Authors

Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression