
New Computerphile Video: Using File Compression for Phylogenetic Trees
In this video, the presenter shares one of his favorite computer science tricks: using file compression utilities to build phylogenetic trees. The central question is how to determine the evolutionary relationships between different types of animals or plants from their genomes, using only file compression tools.
To begin, the presenter explains what a file compression utility is, using ZIP files as an example. ZIP files use lossless compression, meaning that when you decompress a file, you get exactly the same file as the original. This contrasts with formats like JPEG, which use lossy compression, where data is lost and the decompressed image is not identical to the original.
The presenter demonstrates how compression algorithms work by searching for repetitive patterns in a sequence of bits. When a pattern repeats, the algorithm uses a pointer to reference the previous location of the pattern, rather than rewriting the pattern. This allows the size of the compressed file to be reduced. The length of the compressed file can thus be used as a measure of the file's complexity: the more repetitive a file is, the shorter its compressed version will be.
Next, the presenter introduces a method for comparing two files, for example, the human genome and the chimpanzee genome. By concatenating the two files and compressing them together, one can compare the length of the resulting compressed file with the lengths of the separately compressed files. If the two files are identical, the length of the combined compressed file will be almost the same as that of a single compressed file. If the files are completely different, the length of the combined compressed file will be close to the sum of the lengths of the separately compressed files.
This method allows for creating a similarity metric for arbitrary data, simply by compressing concatenated versions of the files and comparing the resulting lengths. The presenter shows how this technique can be applied to build phylogenetic trees of different virus variants or to trace the evolutionary history of different animals. He also mentions that this method can be used to compare languages using translated texts, such as the Universal Declaration of Human Rights.
Although this method is generic and works reasonably well, the presenter admits that it is not very important for practical applications. If one is interested in a specific domain, it is preferable to write a more suitable similarity metric. However, he finds this method fun and interesting due to its generality and surprising effectiveness.
In conclusion, this video presents an ingenious and generic technique for comparing arbitrary data using file compression algorithms. Although this method is not the most precise for specific applications, it offers an interesting and accessible approach to exploring the relationships between different types of data.