Benchmarking AI Models on Offensive Security: Results from Testing Claude, Gemini, and Grok

28 Feb 2026

AIcybersecurityoffensive_securitybenchmarkingClaudeGeminiGrokvulnerabilitiesSQLiIDORJWT_forgeryinsecure_deserializationopen-sourceKali_Linux

Researchers tested AI models (Claude, Gemini, and Grok variants) on offensive security tasks using an open-source framework in a Kali Linux container. The models were evaluated on seven real vulnerabilities, including SQLi, IDOR, JWT forgery, and insecure deserialization, with scoring based on methodology and exploitation success. All models successfully solved every challenge, though token usage varied significantly (5K to 210K per task), with smaller models sometimes outperforming larger ones on simpler vulnerabilities. The framework is fully open-source and designed for local use with user-provided API keys.