
New Video from @Computerphile: Creating an AI Version of Mike
In this video, Mike Pound and Lewis explore the possibility of creating an artificial version of Mike, called Mikebot, using open-source tools and desktop hardware. The goal is to see if it's possible to generate short videos of Mikebot explaining computer science topics while highlighting the potential implications of this technology. The project starts with Mike's idea of replacing himself with an AI to take a nap while still producing content. To make this possible, Lewis created a pipeline using only open-source tools. The pipeline starts with a prompt, such as "Tell me about this," which is then sent to a large language model (LLM) to generate a script. This script contains scenes with image prompts, video prompts, audio, and indications of Mike's presence on screen. To generate the images, Lewis uses Flux, an open-source image generator based on diffusion. Diffusion is a technique that starts with a noisy image and gradually transforms it into a coherent image. To customize the model to recognize Mike, Lewis uses Laura, a low-rank adaptation technique that allows retraining the model with just a few images of Mike. With only 20 images of Mike, Laura can generate images of him in various contexts, such as skydiving or in a jungle. For video generation, Lewis uses WAN, a 3D diffusion model that generates 4-second videos. To create longer videos, he chains these segments together using the last image of each segment as the starting point for the next. The audio is generated by T5 TTS, a text-to-speech model that uses an hour of recordings of Mike's voice for training. Finally, to sync Mike's lips with the audio, Lewis uses Latent Sync, a lip-sync model. Lewis then shows some examples of what Mikebot can do, including an automatically generated video where Mikebot explains firewalls. Although the video has some imperfections, such as changes in clothing and facial variations, it demonstrates the impressive potential of this technology. Lewis also highlights the broader implications, including the possibility of creating convincing videos of anyone with very little data, raising ethical and security questions. To illustrate the capabilities of closed-source solutions, Lewis shows a video generated with Hedra, a commercial tool that can create very realistic videos with just one image of Mike. This demonstration highlights that while open-source tools are already powerful, commercial solutions are even more advanced and accessible. In conclusion, the video highlights the rapid advances in generative AI and the potential implications for content creation, security, and ethics. It also shows that these technologies are becoming more accessible, meaning we need to start seriously thinking about how to regulate and use them responsibly.