
New Video from @Computerphile Discusses AI Model Capabilities
In this video, Sydney Vonarchs, a member of the technical team at Meter, an organization specializing in model evaluation and threat research, discusses the assessment of AI model capabilities. Vonarchs explains that Meter focuses on measuring the performance of AI models, such as Claude, Grock, and ChatGPT-like models, using various metrics, including multiple-choice question datasets. Although these models often outperform human performance in these tasks, they still seem limited in practical contexts, such as executing complex professional tasks. To better understand the capabilities of AI models, Vonarchs and her team created a dataset and evaluated several models. They published two papers: one describing the dataset and the other presenting their findings on the evolution of model performance over time. The dataset primarily focuses on software engineering and cybersecurity tasks, ranging from simple tasks taking a few seconds to complex tasks requiring up to 16 hours of human work. The results show a surprising trend: the capabilities of AI models appear to be improving at an exponential rate. Using logistic regression, they were able to predict the probability that a model could complete a task based on the task's duration. For example, the Sonnet 3.7 model can complete tasks taking about an hour with 50% reliability. However, if 80% reliability is required, the duration of tasks that models can complete decreases significantly. One of the most interesting findings is that the exponential trend is very robust. The doubling time of model capabilities is estimated to be about seven months, regardless of the success threshold used (50% or 80%). This means that, if this trend continues, models could be capable of performing 16-hour tasks by 2028, and potentially week-long or month-long tasks shortly thereafter. Vonarchs also emphasizes the importance of designing models to maximize their performance. They use different structures to help models complete tasks, making them adopt different roles, such as advisor, actor, and critic. This approach resembles a meeting where different aspects of the model interact to find the best solution. To validate their results, they performed several checks, including using real tasks from their workflow and another dataset called Swebench. Although Swebench's estimates are less precise, the exponential trend remains the same. Vonarchs concludes that these results are solid and that AI model performance is increasing exponentially, doubling approximately every seven months. In summary, this video provides a fascinating insight into the evolution of AI model capabilities and their future potential. The practical implications are vast, suggesting that AI models could soon be capable of performing complex and long-duration tasks, which could transform many professional sectors.