Meta AI has released a new AI model allowing you to generate videos from text

Meta AI has released a new AI model allowing you to generate videos from text
Michael Louis
Co-Founder & CEO

With the releases of large language and text-to-image (T2I) models such as GPT-3, DALL-E and Stable Diffusion, a lot of attention has been drawn to the A.I community in recent months. Everybody, from researchers to designers and creators, is experimenting with these models and some have even created successful businesses on top of these technologies such as Copy.AI. With the influx of creations posted online using DALL-E and Stable Diffusion the community has been asking when we will be able to generate videos from text. That answer is today.

On the 29th of September, the team from Meta AI released a paper about their new T2V (text-to-video) model, Make-A-Video, that tries to solve exactly this problem: to generate high quality videos from textual inputs. However, video generation has many more significant challenges than those already experienced by image generation models and the team at Meta AI has made some significant contributions in this regard:

  1. It accelerates training of the T2V model by leveraging previous T2I based models instead of training a model from scratch.
  2. The training data does not require paired text-video data.
  3. The generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models.

Unfortunately, Meta has not released a public version of this model just yet, as they are still understanding the ethical concerns around these type of models. However, we can definitely expect to see a new wave of text-to-video models in the not so distant future.

High-level Technical overview

Collecting a vast dataset of high quality text-video labelled data is extremely difficult. Most available datasets have roughly 10m records, which is not nearly sufficient when it comes to large generative models. Based on this and the fact that it would be expensive to train a model from scratch, Meta argues that it would be advantageous to leverage the large T2I (Text-to-Image) models that already exist.

Meta implements an unsupervised learning approach, which is similar to those utilised in models such as GPT-3. Such an approach enables the network to learn from magnitudes of more data. Specifically, it allows the network to learn the subtlety of certain concepts, such as how different objects move and interact, and generate internal representations for these concepts. This has already been achieved in some image-based action recognition systems, where actions are inferred from images. When it comes to large language and vision models, models pre-trained in an unsupervised manner yield considerably higher performance than when solely trained in a supervised manner due to the amount of data available.

A high-level overview of the technical implementations consist of three main components:

  1. A base T2I model that has been trained on text-image pairs.
  2. A Spatiotemporal convolution and attention layers that extend the networks foundation to the temporal dimensions. This is the component that infers temporal dynamics from a collection of videos.
  3. Lastly, a spatial super-resolution model as well as frame interpolation models to increase the resolution of the generated videos and enable a controllable frame rate.

Evaluation against prior T2V models

Make-a-Video out-performed previous models in the T2V space such as CogVideo and VDM on a variety of datasets. The results demonstrate significantly better generalisation capabilities as well as generate more coherent videos. We recommend reading the paper here if you would like to understand how evaluation was performed.

Below is an image showing a comparison to other T2V models as well as other capabilities of Make-a-Video.

The 4 sets of image show the following:

a) Comparison of Make-a-Video to VDM and CogVideo on the text prompt “Busy freeway at night”

b) This allows a user to generate a video using their own image–giving them the opportunity to personalize and directly control the generated video.

c) Shows a comparison of the interpolation between two images as the beginning and end frames and masks 14 frames in between for generation.

d) Shows an example for video variation. They take the average CLIP embeddings of all frames from a video as the condition to generate a semantically similar video.

What impact could this technology have?

When thinking about the potential use cases of Make-a-Video it is important to take into consideration the rise of video, specifically short video, over the last 5–10 years. Many are familiar with TikTok and their success due to their easy to consume short video format and addictive personalised recommendation system. Their explosive growth has forced other platforms to follow suit with Instagram and Youtube releasing Instagram Reels and Youtube Shorts. We would be remiss to not consider the rise of similar platforms such as Twitter and Vine, which became large successes due to the short nature of their media.

Make-a-Video only currently has the capability to do very small video animations, more similar to GIF’s which are a large part of internet culture but it begs to ask the questions what the next few years will look like applying this technology into practice as the lengths of videos from text can get longer.

  1. Will Make-A-Video or similar technology replace jobs?

In short, no. The reason resembles the same answers given with regard to T2I systems like DALL-E replacing designers. Gary Marcus, scientist and the author of Rebooting AI, said: “DALL-E is probably best used as a source of inspiration rather than a tool for final products and the same will be true for video. There are many complexities when it comes to visual affects, contrasts that a videographer will need to touch up. However, it will allow content to be created at a much faster pace.

2. Will this lead to influx of creators and indie creators?

Absolutely. Many creators will tell you that it takes a considerable amount of time to shoot content, reshoot different scenes and edit video footage. However we are heading in a direction where you could release content as quickly as you could write it and radically reduce the time to edit footage. Also, a technology like this doesn’t necessarily have to fully create content but could instead serve as a source of inspiration for many creators out there.

3. Dynamic Video ads

As mentioned above, another trend TikTok has brought to life is personalisation — giving each user personalised recommendations and experiences with the end goal of leading customers to some conversion event. Currently, customer interactions across most businesses are static in that they show a bunch of images of products they would potentially like to buy but what if static images could change to videos and the products advertises in those videos were dynamic, according to the product that has the highest change leading to a users conversion?

To end, the team at meta said:

Learning from the world around us is one of the greatest strengths of human intelligence. Just as we quickly learn to recognize people, places, things, and actions through observation, generative systems will be more creative and useful if they can mimic the way humans learn.

As both a business and group of A.I enthusiasts we are excited to see the influx of research this model will create in the AI community, and more importantly we are exited to see the creativity it will stimulate in the world as the technology improves!

P.S: What are some of your thoughts about the impact this technology could have as its capability increases? Leave it in the comments.

Back to blog