Fusion architectures for word-based audiovisual speech recognition

Michael Wand, Jürgen Schmidhuber

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations


In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker's face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.
Original languageEnglish (US)
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech Communication Association
Number of pages5
StatePublished - Jan 1 2020
Externally publishedYes

Bibliographical note

Generated from Scopus record by KAUST IRTS on 2022-09-14


Dive into the research topics of 'Fusion architectures for word-based audiovisual speech recognition'. Together they form a unique fingerprint.

Cite this