
New Video from @Underscore_ Discusses Evolution of AI Models and Benchmarking
The video begins with a reflection on the evolution of artificial intelligence models, particularly Deep Research from OpenAI, which marked a significant turning point in AI usage. Unlike previous models that responded instantly to questions, Deep Research allows for complex queries requiring the use of tools, internet access, and the ability to read PDFs or images. This advancement has transformed how users interact with AI, making tasks previously impossible possible. The videographer shares his personal experience with Deep Research, highlighting how it has changed his work methods, especially in writing reports and making purchase comparisons. He then explores the technical changes that enabled this improvement, leading him to discover the GAA (General Agent Assessment) benchmark. Created by researchers from Hugging Face and Meta, this benchmark evaluates the capabilities of AI models to perform complex, real-world tasks. The creators of GAA explain their initial approach, which involved measuring capabilities not yet unlocked by AI models. They designed tasks requiring multiple steps and tools, such as reading PDFs, interpreting images, and conducting internet searches. Early tests on available models at the time, primarily ChatGPT, showed poor results, with less than 10% success. However, with the evolution of models and the integration of new tools, performance has significantly improved. The video also addresses the issue of benchmark contamination, where evaluation data ends up in the training corpus of models, skewing the results. To counter this, updated benchmarks and evaluations with unpublished responses are used. The video mentions cases of cheating in benchmark submissions, emphasizing the importance of community vigilance. A crucial point of the discussion is the evolution of benchmarks over time. While early benchmarks primarily measured factual knowledge, new benchmarks like GAA evaluate complex reasoning and interaction capabilities. This evolution reflects a paradigm shift in AI model evaluation, moving from measuring pure knowledge to assessing complexity in reasoning about real tasks. The video concludes with a discussion on future benchmarks, such as BrowComp and Dapstep, which measure scientific assistance and data analysis tasks. These new benchmarks aim to evaluate even more complex and useful capabilities in real-world contexts. In conclusion, the video provides a fascinating overview of the evolution of AI models and the benchmarks used to evaluate them. It highlights the challenges and opportunities in this constantly evolving field, while emphasizing the importance of rigor and transparency in evaluating AI performance.