Evaluation of 5 LLM Agents on Patching Real-World CVEs Reveals Mixed Results

31 May 2026

CybersecurityVulnerabilitiesAI_ModelsCost_Analysis

The study tested 5 LLM agents (3 OpenAI, 2 Poolside Laguna) on 20 real CVEs across 15 CWE categories under three prompt conditions. The best-performing model (gpt-5.5) achieved a 50% overall solve rate, with no model reliably fixing vulnerabilities. Token costs varied significantly, with Laguna models consuming 3–4x more tokens than OpenAI models for equivalent outcomes. The 'location-only' condition (file and function without flaw description) proved the most challenging, closely mirroring real-world security research tasks.