Video generation from text

Yitong Li, Martin Renqiang Min, Dinghan Shen, David Carlson, Lawrence Carin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

122 Scopus citations

Abstract

Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called “gist,” are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs.
Original languageEnglish (US)
Title of host publication32nd AAAI Conference on Artificial Intelligence, AAAI 2018
PublisherAAAI press
Pages7065-7072
Number of pages8
ISBN (Print)9781577358008
StatePublished - Jan 1 2018
Externally publishedYes

Bibliographical note

Generated from Scopus record by KAUST IRTS on 2021-02-09

Fingerprint

Dive into the research topics of 'Video generation from text'. Together they form a unique fingerprint.

Cite this