
New Video from @Computerphile Explores Evolution of Video Generation Models
The video begins with a discussion on the rapid evolution of video generation models, comparing an image of Will Smith eating spaghetti generated in March 2023 with a more recent and realistic version. This comparison serves as a starting point to explore how these models work and how they have evolved to produce high-fidelity videos.
The presenter explains that video generation models work similarly to 2D image generation models, using a technique called diffusion. This technique involves adding noise to an image until it becomes purely noisy, then training a model to predict and remove this noise to reconstruct the original image. The model is guided by textual descriptions, called prompts, which allow it to generate specific images by combining different concepts.
Generating videos is more complex. A video is essentially a series of images, and the model must generate each image coherently to create a smooth sequence. The presenter notes that if a 2D diffusion model is simply used to generate videos, the successive images will not be coherent because the model does not know what it generated previously. To solve this problem, video generation models must be trained on complete video sequences, adding noise to each image in the sequence and training the model to predict and remove this noise to reconstruct the original video.
The presenter then introduces the concept of latent diffusion models, which use autoencoders to compress images into a smaller, more manageable latent space. This reduces the complexity of the problem and makes the diffusion process more efficient. For videos, images are divided into patches, which can be sections of an image or sections of successive images over time. These patches are then passed through the autoencoder to be compressed into latent space and then diffused to generate the final video.
Another challenge is maintaining temporal coherence, meaning that actions in one image must be consistent with actions in subsequent images. For this, models use transformers with attention, which allow patches to be compared and associated over time, creating continuity in the video.
The video concludes with examples of videos generated by Google DeepMind's V3 model, showing frogs on stilts in a pond, frogs rapping, and a frog asking viewers to subscribe to the Computerphile channel. These examples illustrate the quality and complexity of the videos these models can generate, while raising questions about ethical implications and the risks of misinformation.
To learn more, watch the full video at the following address: https://www.youtube.com/watch?v=hJHfZKYUKMw