Token Injection Vulnerability in LLM Inference Frameworks Exposes Services to Crashes
Researchers from Huazhong University of Science and Technology and Ant Group presented at Black Hat Europe on 'token injection,' a vulnerability in LLM inference frameworks that crashes services by injecting invisible special tokens (e.g., <start>, <video_pad>) into user input. The attack exploits the mixing of control signals and user data in multi-tenant inference engines like vLLM, TensorRT-LLM, and Ollama, causing crashes via index errors, CUDA device-side asserts, or segmentation faults. Over 1,300 multimodal tokens were identified across models from OpenAI, Meta, and Google, with some vendors defining hundreds or thousands, expanding the attack surface. Demonstrations showed single HTTP requests crashing engines, including cloud platforms like NVIDIA NIM, Google Vertex AI, and Microsoft Azure AI, due to unvalidated token processing. While a fix exists in Hugging Face’s tokenizer (split_special_tokens=True), it is disabled by default for compatibility, leaving most deployments vulnerable. Vendors like vLLM, NVIDIA, and Google confirmed and patched the issue, while others dismissed it as self-DoS. The research highlights risks beyond crashes, including potential cross-user data leaks or inference manipulation, though no such exploits were found in black-box testing.