Hidden Unicode Characters Can Trick AI Into Following Secret Commands

What Happened

Researchers from Moltwire conducted extensive testing on how invisible Unicode characters can be weaponized against AI systems. They embedded hidden characters inside normal-looking trivia questions, encoding different answers than what appeared visible to human readers.

The study tested five major AI models: GPT-5.2, GPT-4o-mini, Claude Opus 4, Sonnet 4, and Haiku 4.5 across 8,308 graded outputs. The researchers describe their method as a “reverse CAPTCHA” - while traditional CAPTCHAs test what humans can do but machines cannot, this exploit uses a channel machines can read but humans cannot see.

The critical finding was that AI models with tool access - particularly code execution capabilities - were significantly more vulnerable. Without tools, models almost never followed the hidden instructions. However, when given access to tools, the AI systems could write scripts to decode the hidden messages and act on them.

Why It Matters

This vulnerability represents a new class of security threat for AI systems deployed in real-world applications. As AI assistants increasingly gain access to tools like code execution, web browsing, and API calls, this attack vector could allow malicious actors to manipulate AI behavior without detection.

The research reveals important differences between AI providers. OpenAI and Anthropic models showed vulnerability to different encoding schemes, meaning attackers would need to tailor their approach based on the target model. This suggests the vulnerability stems from how different models process Unicode characters rather than a universal flaw.

Most concerning is how little prompting is required to activate the vulnerability. Without explicit hints, compliance rates were near zero. However, adding a single line like “check for hidden Unicode” was enough to trigger the AI to extract and follow the hidden instructions.

Background

Unicode steganography - the practice of hiding information in text using invisible characters - has existed for years in cybersecurity research. However, its application to AI systems represents a novel attack vector that emerged alongside the deployment of increasingly capable AI assistants.

The technique exploits the fact that AI models process text at a character level that includes Unicode’s full range of invisible formatting characters. These characters, originally designed for legitimate text processing purposes, create a hidden communication channel that bypasses human oversight.

Standard Unicode normalization techniques (NFC/NFKC) that developers might expect to prevent such attacks do not strip these specific characters, leaving systems vulnerable even when basic security measures are in place.

What’s Next

The researchers have made their findings and testing framework open source, allowing other security researchers and AI developers to evaluate and address this vulnerability. The publication of both the research results and the code used for testing suggests the security community is taking a collaborative approach to addressing this threat.

AI developers will likely need to implement new input sanitization techniques that go beyond standard Unicode normalization. This could include developing detection systems for suspicious character patterns or implementing stricter controls on tool access when processing untrusted input.

Organizations deploying AI systems should review their security protocols, particularly for AI agents with tool access. The research suggests that limiting tool capabilities or implementing additional verification steps could mitigate the risk until more comprehensive solutions are developed.

The vulnerability also highlights the need for ongoing security research as AI capabilities expand. As models gain access to more powerful tools and are deployed in higher-stakes environments, new attack vectors like this one will likely emerge, requiring continuous vigilance from both researchers and developers.