In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker's face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.
|Original language||English (US)|
|Title of host publication||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|Publisher||International Speech Communication Association|
|Number of pages||5|
|State||Published - Jan 1 2020|